All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/30] [v3] KAISER: unmap most of the kernel from userspace page tables
@ 2017-11-10 19:30 ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86, jgross

Thanks, everyone for all the reviews thus far.  I hope I managed to
address all the feedback given so far, except for the TODOs of
course.  This is a pretty minor update compared to v1->v2.

These patches are all on top of Andy's entry changes here:

	https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/entry_consolidation

Changes from v2:
 * Reword documentation removing "we"
 * Fix some whitespace damage
 * Fix up MAX ASID values off-by-one noted by Peter Z
 * Change CodingStyle stuff from Borislav comments
 * Always use _KERNPG_TABLE for pmd_populate_kernel().

Changes from v1:
 * Updated to be on top of Andy L's new entry code
 * Allow global pages again, and use them for pages mapped into
   userspace page tables.
 * Use trampoline stack instead of process stack at entry so no
   longer need to map process stack (big win in fork() speed)
 * Made the page table walking less generic by restricting it
   to kernel addresses and !_PAGE_USER pages.
 * Added a debugfs file to enable/disable CR3 switching at
   runtime.  This does not remove all the KAISER overhead, but
   it removes the largest source.
 * Use runtime disable with Xen to permit Xen-PV guests with
   KAISER=y.
 * Moved assembly code from "core" to "prepare assembly" patch
 * Pass full register name to asm macros
 * Remove double stack switch in entry_SYSENTER_compat
 * Disable vsyscall native case when KAISER=y
 * Separate PER_CPU_USER_MAPPED generic definitions from use
   by arch/x86/.

TODO:
 * Allow dumping the shadow page tables with the ptdump code
 * Put LDT at top of userspace
 * Create separate tlb flushing functions for user and kernel
 * Chase down the source of the new !CR4.PGE warning that 0day
   found with i386

---

tl;dr:

KAISER makes it harder to defeat KASLR, but makes syscalls and
interrupts slower.  These patches are based on work from a team at
Graz University of Technology posted here[1].  The major addition is
support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
for a wide variety of use cases.

Full Description:

KAISER is a countermeasure against attacks on kernel address
information.  There are at least three existing, published,
approaches using the shared user/kernel mapping and hardware features
to defeat KASLR.  One approach referenced in the paper locates the
kernel by observing differences in page fault timing between
present-but-inaccessable kernel pages and non-present pages.

KAISER addresses this by unmapping (most of) the kernel when
userspace runs.  It leaves the existing page tables largely alone and
refers to them as "kernel page tables".  For running userspace, a new
"shadow" copy of the page tables is allocated for each process.  The
shadow page tables map all the same user memory as the "kernel" copy,
but only maps a minimal set of kernel memory.

When we enter the kernel via syscalls, interrupts or exceptions,
page tables are switched to the full "kernel" copy.  When the system
switches back to user mode, the "shadow" copy is used.  Process
Context IDentifiers (PCIDs) are used to to ensure that the TLB is not
flushed when switching between page tables, which makes syscalls
roughly 2x faster than without it.  PCIDs are usable on Haswell and
newer CPUs (the ones with "v4", or called fourth-generation Core).

The minimal kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions, interrupt
descriptors (IDT) and the kernel trampoline stacks.  This minimal set
of data can still reveal the kernel's ASLR base address.  But, this
minimal kernel data is all trusted, which makes it harder to exploit
than data in the kernel direct map which contains loads of
user-controlled data.

KAISER will affect performance for anything that does system calls or
interrupts: everything.  Just the new instructions (CR3 manipulation)
add a few hundred cycles to a syscall or interrupt.  Most workloads
that we have run show single-digit regressions.  5% is a good round
number for what is typical.  The worst we have seen is a roughly 30%
regression on a loopback networking test that did a ton of syscalls
and context switches.  More details about possible performance
impacts are in the new Documentation/ file.

This code is based on a version I downloaded from
(https://github.com/IAIK/KAISER).  It has been heavily modified.

The approach is described in detail in a paper[2].  However, there is
some incorrect and information in the paper, both on how Linux and
the hardware works.  For instance, I do not share the opinion that
KAISER has "runtime overhead of only 0.28%".  Please rely on this
patch series as the canonical source of information about this
submission.

Here is one example of how the kernel image grow with CONFIG_KAISER
on and off.  Most of the size increase is presumably from additional
alignment requirements for mapping entry/exit code and structures.

    text    data     bss      dec filename
11786064 7356724 2928640 22071428 vmlinux-nokaiser
11798203 7371704 2928640 22098547 vmlinux-kaiser
  +12139  +14980       0   +27119

To give folks an idea what the performance impact is like, I took
the following test and ran it single-threaded:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c

It's a pretty quick syscall so this shows how much KAISER slows
down syscalls (and how much PCIDs help).  The units here are
lseeks/second:

        no kaiser: 5.2M
    kaiser+  pcid: 3.0M
    kaiser+nopcid: 2.2M

"nopcid" is literally with the "nopcid" command-line option which
turns PCIDs off entirely.

Thanks to:
The original KAISER team at Graz University of Technology.
Andy Lutomirski for all the help with the entry code.
Kirill Shutemov for a helpful review of the code.

1. https://github.com/IAIK/KAISER
2. https://gruss.cc/files/kaiser.pdf

--

The code is available here:

	https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/

 Documentation/x86/kaiser.txt                | 160 +++++
 arch/x86/Kconfig                            |   8 +
 arch/x86/entry/calling.h                    |  89 +++
 arch/x86/entry/entry_64.S                   |  44 +-
 arch/x86/entry/entry_64_compat.S            |   8 +
 arch/x86/events/intel/ds.c                  |  49 +-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/desc.h                 |   2 +-
 arch/x86/include/asm/kaiser.h               |  62 ++
 arch/x86/include/asm/mmu_context.h          |  29 +-
 arch/x86/include/asm/pgalloc.h              |  37 +-
 arch/x86/include/asm/pgtable.h              |  20 +-
 arch/x86/include/asm/pgtable_64.h           | 135 +++++
 arch/x86/include/asm/pgtable_types.h        |  25 +-
 arch/x86/include/asm/processor.h            |   2 +-
 arch/x86/include/asm/tlbflush.h             | 232 +++++++-
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kernel/cpu/common.c                |  21 +-
 arch/x86/kernel/espfix_64.c                 |  27 +-
 arch/x86/kernel/head_64.S                   |  30 +-
 arch/x86/kernel/ldt.c                       |  25 +-
 arch/x86/kernel/process.c                   |   2 +-
 arch/x86/kernel/process_64.c                |   2 +-
 arch/x86/kernel/traps.c                     |  46 +-
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/mm/Makefile                        |   1 +
 arch/x86/mm/init.c                          |  75 ++-
 arch/x86/mm/kaiser.c                        | 627 ++++++++++++++++++++
 arch/x86/mm/pageattr.c                      |  18 +-
 arch/x86/mm/pgtable.c                       |  16 +-
 arch/x86/mm/tlb.c                           | 105 +++-
 include/asm-generic/vmlinux.lds.h           |  17 +
 include/linux/kaiser.h                      |  34 ++
 include/linux/percpu-defs.h                 |  30 +
 init/main.c                                 |   3 +
 kernel/fork.c                               |   1 +
 security/Kconfig                            |  10 +
 37 files changed, 1851 insertions(+), 148 deletions(-)

Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 00/30] [v3] KAISER: unmap most of the kernel from userspace page tables
@ 2017-11-10 19:30 ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86, jgross

Thanks, everyone for all the reviews thus far.  I hope I managed to
address all the feedback given so far, except for the TODOs of
course.  This is a pretty minor update compared to v1->v2.

These patches are all on top of Andy's entry changes here:

	https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/entry_consolidation

Changes from v2:
 * Reword documentation removing "we"
 * Fix some whitespace damage
 * Fix up MAX ASID values off-by-one noted by Peter Z
 * Change CodingStyle stuff from Borislav comments
 * Always use _KERNPG_TABLE for pmd_populate_kernel().

Changes from v1:
 * Updated to be on top of Andy L's new entry code
 * Allow global pages again, and use them for pages mapped into
   userspace page tables.
 * Use trampoline stack instead of process stack at entry so no
   longer need to map process stack (big win in fork() speed)
 * Made the page table walking less generic by restricting it
   to kernel addresses and !_PAGE_USER pages.
 * Added a debugfs file to enable/disable CR3 switching at
   runtime.  This does not remove all the KAISER overhead, but
   it removes the largest source.
 * Use runtime disable with Xen to permit Xen-PV guests with
   KAISER=y.
 * Moved assembly code from "core" to "prepare assembly" patch
 * Pass full register name to asm macros
 * Remove double stack switch in entry_SYSENTER_compat
 * Disable vsyscall native case when KAISER=y
 * Separate PER_CPU_USER_MAPPED generic definitions from use
   by arch/x86/.

TODO:
 * Allow dumping the shadow page tables with the ptdump code
 * Put LDT at top of userspace
 * Create separate tlb flushing functions for user and kernel
 * Chase down the source of the new !CR4.PGE warning that 0day
   found with i386

---

tl;dr:

KAISER makes it harder to defeat KASLR, but makes syscalls and
interrupts slower.  These patches are based on work from a team at
Graz University of Technology posted here[1].  The major addition is
support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
for a wide variety of use cases.

Full Description:

KAISER is a countermeasure against attacks on kernel address
information.  There are at least three existing, published,
approaches using the shared user/kernel mapping and hardware features
to defeat KASLR.  One approach referenced in the paper locates the
kernel by observing differences in page fault timing between
present-but-inaccessable kernel pages and non-present pages.

KAISER addresses this by unmapping (most of) the kernel when
userspace runs.  It leaves the existing page tables largely alone and
refers to them as "kernel page tables".  For running userspace, a new
"shadow" copy of the page tables is allocated for each process.  The
shadow page tables map all the same user memory as the "kernel" copy,
but only maps a minimal set of kernel memory.

When we enter the kernel via syscalls, interrupts or exceptions,
page tables are switched to the full "kernel" copy.  When the system
switches back to user mode, the "shadow" copy is used.  Process
Context IDentifiers (PCIDs) are used to to ensure that the TLB is not
flushed when switching between page tables, which makes syscalls
roughly 2x faster than without it.  PCIDs are usable on Haswell and
newer CPUs (the ones with "v4", or called fourth-generation Core).

The minimal kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions, interrupt
descriptors (IDT) and the kernel trampoline stacks.  This minimal set
of data can still reveal the kernel's ASLR base address.  But, this
minimal kernel data is all trusted, which makes it harder to exploit
than data in the kernel direct map which contains loads of
user-controlled data.

KAISER will affect performance for anything that does system calls or
interrupts: everything.  Just the new instructions (CR3 manipulation)
add a few hundred cycles to a syscall or interrupt.  Most workloads
that we have run show single-digit regressions.  5% is a good round
number for what is typical.  The worst we have seen is a roughly 30%
regression on a loopback networking test that did a ton of syscalls
and context switches.  More details about possible performance
impacts are in the new Documentation/ file.

This code is based on a version I downloaded from
(https://github.com/IAIK/KAISER).  It has been heavily modified.

The approach is described in detail in a paper[2].  However, there is
some incorrect and information in the paper, both on how Linux and
the hardware works.  For instance, I do not share the opinion that
KAISER has "runtime overhead of only 0.28%".  Please rely on this
patch series as the canonical source of information about this
submission.

Here is one example of how the kernel image grow with CONFIG_KAISER
on and off.  Most of the size increase is presumably from additional
alignment requirements for mapping entry/exit code and structures.

    text    data     bss      dec filename
11786064 7356724 2928640 22071428 vmlinux-nokaiser
11798203 7371704 2928640 22098547 vmlinux-kaiser
  +12139  +14980       0   +27119

To give folks an idea what the performance impact is like, I took
the following test and ran it single-threaded:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c

It's a pretty quick syscall so this shows how much KAISER slows
down syscalls (and how much PCIDs help).  The units here are
lseeks/second:

        no kaiser: 5.2M
    kaiser+  pcid: 3.0M
    kaiser+nopcid: 2.2M

"nopcid" is literally with the "nopcid" command-line option which
turns PCIDs off entirely.

Thanks to:
The original KAISER team at Graz University of Technology.
Andy Lutomirski for all the help with the entry code.
Kirill Shutemov for a helpful review of the code.

1. https://github.com/IAIK/KAISER
2. https://gruss.cc/files/kaiser.pdf

--

The code is available here:

	https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/

 Documentation/x86/kaiser.txt                | 160 +++++
 arch/x86/Kconfig                            |   8 +
 arch/x86/entry/calling.h                    |  89 +++
 arch/x86/entry/entry_64.S                   |  44 +-
 arch/x86/entry/entry_64_compat.S            |   8 +
 arch/x86/events/intel/ds.c                  |  49 +-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/desc.h                 |   2 +-
 arch/x86/include/asm/kaiser.h               |  62 ++
 arch/x86/include/asm/mmu_context.h          |  29 +-
 arch/x86/include/asm/pgalloc.h              |  37 +-
 arch/x86/include/asm/pgtable.h              |  20 +-
 arch/x86/include/asm/pgtable_64.h           | 135 +++++
 arch/x86/include/asm/pgtable_types.h        |  25 +-
 arch/x86/include/asm/processor.h            |   2 +-
 arch/x86/include/asm/tlbflush.h             | 232 +++++++-
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kernel/cpu/common.c                |  21 +-
 arch/x86/kernel/espfix_64.c                 |  27 +-
 arch/x86/kernel/head_64.S                   |  30 +-
 arch/x86/kernel/ldt.c                       |  25 +-
 arch/x86/kernel/process.c                   |   2 +-
 arch/x86/kernel/process_64.c                |   2 +-
 arch/x86/kernel/traps.c                     |  46 +-
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/mm/Makefile                        |   1 +
 arch/x86/mm/init.c                          |  75 ++-
 arch/x86/mm/kaiser.c                        | 627 ++++++++++++++++++++
 arch/x86/mm/pageattr.c                      |  18 +-
 arch/x86/mm/pgtable.c                       |  16 +-
 arch/x86/mm/tlb.c                           | 105 +++-
 include/asm-generic/vmlinux.lds.h           |  17 +
 include/linux/kaiser.h                      |  34 ++
 include/linux/percpu-defs.h                 |  30 +
 init/main.c                                 |   3 +
 kernel/fork.c                               |   1 +
 security/Kconfig                            |  10 +
 37 files changed, 1851 insertions(+), 148 deletions(-)

Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 01/30] x86, mm: do not set _PAGE_USER for init_mm page tables
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, tglx, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

init_mm is for kernel-exclusive use.  If someone is allocating page
tables for it, do not set _PAGE_USER on them.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgalloc.h |   37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/pgalloc.h~kaiser-prep-clear-_PAGE_USER-for-init_mm arch/x86/include/asm/pgalloc.h
--- a/arch/x86/include/asm/pgalloc.h~kaiser-prep-clear-_PAGE_USER-for-init_mm	2017-11-10 11:22:04.991244960 -0800
+++ b/arch/x86/include/asm/pgalloc.h	2017-11-10 11:22:04.994244960 -0800
@@ -61,20 +61,41 @@ static inline void __pte_free_tlb(struct
 	___pte_free_tlb(tlb, pte);
 }
 
+/*
+ * init_mm is for kernel-exclusive use.  Any page tables that
+ * are setup for it should not be usable by userspace.
+ *
+ * This also *signals* to code (like KAISER) that this page table
+ * entry is for kernel-exclusive use.
+ */
+static inline pteval_t mm_pgtable_flags(struct mm_struct *mm)
+{
+	if (!mm || (mm == &init_mm))
+		return _KERNPG_TABLE;
+	return _PAGE_TABLE;
+}
+
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
 {
+	/*
+	 * Since we are populating a kernel pmd, always use
+	 * _KERNPG_TABLE and ignore mm
+	 */
+	pteval_t pgtable_flags = _KERNPG_TABLE;
+
 	paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
-	set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+	set_pmd(pmd, __pmd(__pa(pte) | pgtable_flags));
 }
 
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 				struct page *pte)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
 	unsigned long pfn = page_to_pfn(pte);
 
 	paravirt_alloc_pte(mm, pfn);
-	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | pgtable_flags));
 }
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
@@ -117,16 +138,20 @@ extern void pud_populate(struct mm_struc
 #else	/* !CONFIG_X86_PAE */
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
-	set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)));
+	set_pud(pud, __pud(__pa(pmd) | pgtable_flags));
 }
 #endif	/* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
 static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
-	set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+	set_p4d(p4d, __p4d(__pa(pud) | pgtable_flags));
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -155,8 +180,10 @@ static inline void __pud_free_tlb(struct
 #if CONFIG_PGTABLE_LEVELS > 4
 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
-	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+	set_pgd(pgd, __pgd(__pa(p4d) | pgtable_flags));
 }
 
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 01/30] x86, mm: do not set _PAGE_USER for init_mm page tables
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, tglx, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

init_mm is for kernel-exclusive use.  If someone is allocating page
tables for it, do not set _PAGE_USER on them.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgalloc.h |   37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/pgalloc.h~kaiser-prep-clear-_PAGE_USER-for-init_mm arch/x86/include/asm/pgalloc.h
--- a/arch/x86/include/asm/pgalloc.h~kaiser-prep-clear-_PAGE_USER-for-init_mm	2017-11-10 11:22:04.991244960 -0800
+++ b/arch/x86/include/asm/pgalloc.h	2017-11-10 11:22:04.994244960 -0800
@@ -61,20 +61,41 @@ static inline void __pte_free_tlb(struct
 	___pte_free_tlb(tlb, pte);
 }
 
+/*
+ * init_mm is for kernel-exclusive use.  Any page tables that
+ * are setup for it should not be usable by userspace.
+ *
+ * This also *signals* to code (like KAISER) that this page table
+ * entry is for kernel-exclusive use.
+ */
+static inline pteval_t mm_pgtable_flags(struct mm_struct *mm)
+{
+	if (!mm || (mm == &init_mm))
+		return _KERNPG_TABLE;
+	return _PAGE_TABLE;
+}
+
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
 {
+	/*
+	 * Since we are populating a kernel pmd, always use
+	 * _KERNPG_TABLE and ignore mm
+	 */
+	pteval_t pgtable_flags = _KERNPG_TABLE;
+
 	paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
-	set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+	set_pmd(pmd, __pmd(__pa(pte) | pgtable_flags));
 }
 
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 				struct page *pte)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
 	unsigned long pfn = page_to_pfn(pte);
 
 	paravirt_alloc_pte(mm, pfn);
-	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | pgtable_flags));
 }
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
@@ -117,16 +138,20 @@ extern void pud_populate(struct mm_struc
 #else	/* !CONFIG_X86_PAE */
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
-	set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)));
+	set_pud(pud, __pud(__pa(pmd) | pgtable_flags));
 }
 #endif	/* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
 static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
-	set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+	set_p4d(p4d, __p4d(__pa(pud) | pgtable_flags));
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -155,8 +180,10 @@ static inline void __pud_free_tlb(struct
 #if CONFIG_PGTABLE_LEVELS > 4
 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
-	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+	set_pgd(pgd, __pgd(__pa(p4d) | pgtable_flags));
 }
 
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 02/30] x86, tlb: Make CR4-based TLB flushes more robust
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The existing CR4-based TLB flush currently requires global pages
to be supported *and* enabled.  But, the hardware only needs for
them to be supported.

Make the code more robust by allowing the initial state of
X86_CR4_PGE to be on *or* off.  In addition, if called in an
unexpected state (X86_CR4_PGE=0), issue a warning.  X86_CR4_PGE=0
is certainly unexpected should not be ignored it if encountered.

This essentially gives the best of both worlds: a TLB flush no
matter what, and a warning if the TLB flush is called in an
unexpected way (X86_CR4_PGE=0).

The XOR change was suggested by Kirill Shutemov.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-make-cr4-writes-tolerate-clear-pge arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-make-cr4-writes-tolerate-clear-pge	2017-11-10 11:22:05.534244958 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:05.538244958 -0800
@@ -247,12 +247,24 @@ static inline void __native_flush_tlb(vo
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
 {
-	unsigned long cr4;
+	unsigned long cr4 = this_cpu_read(cpu_tlbstate.cr4);
 
-	cr4 = this_cpu_read(cpu_tlbstate.cr4);
-	/* clear PGE */
-	native_write_cr4(cr4 & ~X86_CR4_PGE);
-	/* write old PGE again and flush TLBs */
+	/*
+	 * This function is only called on systems that support X86_CR4_PGE
+	 * and where we expect X86_CR4_PGE to be set.  Warn if we are called
+	 * without PGE set.
+	 */
+	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
+
+	/*
+	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
+	 * TLB of all entries including all entries in all PCIDs and all
+	 * global pages.  Make sure that we _change_ the bit, regardless of
+	 * whether we had X86_CR4_PGE set in the first place.
+	 */
+	native_write_cr4(cr4 ^ X86_CR4_PGE);
+
+	/* Put original CR4 value back: */
 	native_write_cr4(cr4);
 }
 
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 02/30] x86, tlb: Make CR4-based TLB flushes more robust
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The existing CR4-based TLB flush currently requires global pages
to be supported *and* enabled.  But, the hardware only needs for
them to be supported.

Make the code more robust by allowing the initial state of
X86_CR4_PGE to be on *or* off.  In addition, if called in an
unexpected state (X86_CR4_PGE=0), issue a warning.  X86_CR4_PGE=0
is certainly unexpected should not be ignored it if encountered.

This essentially gives the best of both worlds: a TLB flush no
matter what, and a warning if the TLB flush is called in an
unexpected way (X86_CR4_PGE=0).

The XOR change was suggested by Kirill Shutemov.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-make-cr4-writes-tolerate-clear-pge arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-make-cr4-writes-tolerate-clear-pge	2017-11-10 11:22:05.534244958 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:05.538244958 -0800
@@ -247,12 +247,24 @@ static inline void __native_flush_tlb(vo
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
 {
-	unsigned long cr4;
+	unsigned long cr4 = this_cpu_read(cpu_tlbstate.cr4);
 
-	cr4 = this_cpu_read(cpu_tlbstate.cr4);
-	/* clear PGE */
-	native_write_cr4(cr4 & ~X86_CR4_PGE);
-	/* write old PGE again and flush TLBs */
+	/*
+	 * This function is only called on systems that support X86_CR4_PGE
+	 * and where we expect X86_CR4_PGE to be set.  Warn if we are called
+	 * without PGE set.
+	 */
+	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
+
+	/*
+	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
+	 * TLB of all entries including all entries in all PCIDs and all
+	 * global pages.  Make sure that we _change_ the bit, regardless of
+	 * whether we had X86_CR4_PGE set in the first place.
+	 */
+	native_write_cr4(cr4 ^ X86_CR4_PGE);
+
+	/* Put original CR4 value back: */
 	native_write_cr4(cr4);
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 03/30] x86/mm: Document X86_CR4_PGE toggling behavior
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The comment says it all here.  The problem here is that the
X86_CR4_PGE bit affects all PCIDs in a way that is totally
obscure.

This makes it easier for someone to grep for PCID-related code
and documents the expected hardware behavior.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior	2017-11-10 11:22:06.079244957 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:06.082244957 -0800
@@ -257,10 +257,12 @@ static inline void __native_flush_tlb_gl
 	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
 
 	/*
-	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
-	 * TLB of all entries including all entries in all PCIDs and all
-	 * global pages.  Make sure that we _change_ the bit, regardless of
+	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush
+	 * all entries.  Make sure that we _change_ the bit, regardless of
 	 * whether we had X86_CR4_PGE set in the first place.
+	 *
+	 * Note that just toggling PGE *also* flushes all entries from all
+	 * PCIDs, regardless of the state of X86_CR4_PCIDE.
 	 */
 	native_write_cr4(cr4 ^ X86_CR4_PGE);
 
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 03/30] x86/mm: Document X86_CR4_PGE toggling behavior
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The comment says it all here.  The problem here is that the
X86_CR4_PGE bit affects all PCIDs in a way that is totally
obscure.

This makes it easier for someone to grep for PCID-related code
and documents the expected hardware behavior.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior	2017-11-10 11:22:06.079244957 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:06.082244957 -0800
@@ -257,10 +257,12 @@ static inline void __native_flush_tlb_gl
 	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
 
 	/*
-	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
-	 * TLB of all entries including all entries in all PCIDs and all
-	 * global pages.  Make sure that we _change_ the bit, regardless of
+	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush
+	 * all entries.  Make sure that we _change_ the bit, regardless of
 	 * whether we had X86_CR4_PGE set in the first place.
+	 *
+	 * Note that just toggling PGE *also* flushes all entries from all
+	 * PCIDs, regardless of the state of X86_CR4_PCIDE.
 	 */
 	native_write_cr4(cr4 ^ X86_CR4_PGE);
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, bp, tglx, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use [1].

That means that even when KAISER switches page tables on return to user
space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
available so that it can still be used for a few selected kernel mappings
which must be visible to userspace, when KAISER is enabled, like the
entry/exit code and data.

1. The double-page-fault attack:
   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_types.h |   14 +++++++++++++-
 b/arch/x86/mm/pageattr.c               |   16 ++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages	2017-11-10 11:22:06.621244956 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10 11:22:06.626244956 -0800
@@ -179,8 +179,20 @@ enum page_cache_mode {
 #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT | _PAGE_USER |	\
 					 _PAGE_ACCESSED)
 
+/*
+ * Disable global pages for anything using the default
+ * __PAGE_KERNEL* macros.  PGE will still be enabled
+ * and _PAGE_GLOBAL may still be used carefully.
+ */
+#ifdef CONFIG_KAISER
+#define __PAGE_KERNEL_GLOBAL	0
+#else
+#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
+#endif
+
 #define __PAGE_KERNEL_EXEC						\
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED |	\
+	 __PAGE_KERNEL_GLOBAL)
 #define __PAGE_KERNEL		(__PAGE_KERNEL_EXEC | _PAGE_NX)
 
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
diff -puN arch/x86/mm/pageattr.c~kaiser-prep-disable-global-pages arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~kaiser-prep-disable-global-pages	2017-11-10 11:22:06.623244956 -0800
+++ b/arch/x86/mm/pageattr.c	2017-11-10 11:22:06.627244956 -0800
@@ -585,9 +585,9 @@ try_preserve_large_page(pte_t *kpte, uns
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(req_prot) & _PAGE_PRESENT)
-		pgprot_val(req_prot) |= _PAGE_PSE | _PAGE_GLOBAL;
+		pgprot_val(req_prot) |= _PAGE_PSE | __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(req_prot) &= ~(_PAGE_PSE | _PAGE_GLOBAL);
+		pgprot_val(req_prot) &= ~(_PAGE_PSE | __PAGE_KERNEL_GLOBAL);
 
 	req_prot = canon_pgprot(req_prot);
 
@@ -705,9 +705,9 @@ __split_large_page(struct cpa_data *cpa,
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(ref_prot) & _PAGE_PRESENT)
-		pgprot_val(ref_prot) |= _PAGE_GLOBAL;
+		pgprot_val(ref_prot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(ref_prot) &= ~_PAGE_GLOBAL;
+		pgprot_val(ref_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	/*
 	 * Get the target pfn from the original entry:
@@ -938,9 +938,9 @@ static void populate_pte(struct cpa_data
 	 * support it.
 	 */
 	if (pgprot_val(pgprot) & _PAGE_PRESENT)
-		pgprot_val(pgprot) |= _PAGE_GLOBAL;
+		pgprot_val(pgprot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(pgprot) &= ~_PAGE_GLOBAL;
+		pgprot_val(pgprot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	pgprot = canon_pgprot(pgprot);
 
@@ -1242,9 +1242,9 @@ repeat:
 		 * support it.
 		 */
 		if (pgprot_val(new_prot) & _PAGE_PRESENT)
-			pgprot_val(new_prot) |= _PAGE_GLOBAL;
+			pgprot_val(new_prot) |= __PAGE_KERNEL_GLOBAL;
 		else
-			pgprot_val(new_prot) &= ~_PAGE_GLOBAL;
+			pgprot_val(new_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 		/*
 		 * We need to keep the pfn from the existing PTE,
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, bp, tglx, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use [1].

That means that even when KAISER switches page tables on return to user
space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
available so that it can still be used for a few selected kernel mappings
which must be visible to userspace, when KAISER is enabled, like the
entry/exit code and data.

1. The double-page-fault attack:
   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_types.h |   14 +++++++++++++-
 b/arch/x86/mm/pageattr.c               |   16 ++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages	2017-11-10 11:22:06.621244956 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10 11:22:06.626244956 -0800
@@ -179,8 +179,20 @@ enum page_cache_mode {
 #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT | _PAGE_USER |	\
 					 _PAGE_ACCESSED)
 
+/*
+ * Disable global pages for anything using the default
+ * __PAGE_KERNEL* macros.  PGE will still be enabled
+ * and _PAGE_GLOBAL may still be used carefully.
+ */
+#ifdef CONFIG_KAISER
+#define __PAGE_KERNEL_GLOBAL	0
+#else
+#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
+#endif
+
 #define __PAGE_KERNEL_EXEC						\
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED |	\
+	 __PAGE_KERNEL_GLOBAL)
 #define __PAGE_KERNEL		(__PAGE_KERNEL_EXEC | _PAGE_NX)
 
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
diff -puN arch/x86/mm/pageattr.c~kaiser-prep-disable-global-pages arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~kaiser-prep-disable-global-pages	2017-11-10 11:22:06.623244956 -0800
+++ b/arch/x86/mm/pageattr.c	2017-11-10 11:22:06.627244956 -0800
@@ -585,9 +585,9 @@ try_preserve_large_page(pte_t *kpte, uns
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(req_prot) & _PAGE_PRESENT)
-		pgprot_val(req_prot) |= _PAGE_PSE | _PAGE_GLOBAL;
+		pgprot_val(req_prot) |= _PAGE_PSE | __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(req_prot) &= ~(_PAGE_PSE | _PAGE_GLOBAL);
+		pgprot_val(req_prot) &= ~(_PAGE_PSE | __PAGE_KERNEL_GLOBAL);
 
 	req_prot = canon_pgprot(req_prot);
 
@@ -705,9 +705,9 @@ __split_large_page(struct cpa_data *cpa,
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(ref_prot) & _PAGE_PRESENT)
-		pgprot_val(ref_prot) |= _PAGE_GLOBAL;
+		pgprot_val(ref_prot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(ref_prot) &= ~_PAGE_GLOBAL;
+		pgprot_val(ref_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	/*
 	 * Get the target pfn from the original entry:
@@ -938,9 +938,9 @@ static void populate_pte(struct cpa_data
 	 * support it.
 	 */
 	if (pgprot_val(pgprot) & _PAGE_PRESENT)
-		pgprot_val(pgprot) |= _PAGE_GLOBAL;
+		pgprot_val(pgprot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(pgprot) &= ~_PAGE_GLOBAL;
+		pgprot_val(pgprot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	pgprot = canon_pgprot(pgprot);
 
@@ -1242,9 +1242,9 @@ repeat:
 		 * support it.
 		 */
 		if (pgprot_val(new_prot) & _PAGE_PRESENT)
-			pgprot_val(new_prot) |= _PAGE_GLOBAL;
+			pgprot_val(new_prot) |= __PAGE_KERNEL_GLOBAL;
 		else
-			pgprot_val(new_prot) &= ~_PAGE_GLOBAL;
+			pgprot_val(new_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 		/*
 		 * We need to keep the pfn from the existing PTE,
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 05/30] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before leaving assembly code.

This is extra challenging because the switching context is
tricky: the registers that can be clobbered can vary.  It is also
hard to store things on the stack because there is an established
ABI (ptregs) or the stack is entirely unsafe to use.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-cpu scratch space to save/restore a register
that can be used for the CR3 MOV.  The %GS register is used to
index into our per-cpu space, so SWAPGS *had* to be done before
the CR3 switch.  That scratch space is gone now, but the semantic
that SWAPGS must be done before the CR3 MOV is retained.  This is
good to keep because it is not that hard to do and it allows us
to do things like add per-cpu debugging information to help us
figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack during this
situation.  Changing the format of this stack is not a fun
exercise: I tried.  Instead of storing the old CR3 value on the
stack, this patch depend on the *regular* register save/restore
mechanism and then uses %r14 to keep CR3 during the NMI.  It is
callee-saved and will not be clobbered by the C NMI handlers that
get called.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h         |   65 +++++++++++++++++++++++++++++++++++++
 b/arch/x86/entry/entry_64.S        |   34 ++++++++++++++++---
 b/arch/x86/entry/entry_64_compat.S |    8 ++++
 3 files changed, 102 insertions(+), 5 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work	2017-11-10 11:22:07.191244954 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:07.198244954 -0800
@@ -1,5 +1,6 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -186,6 +187,70 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  We flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
+	andq	$(~KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, %r\scratch_reg
+	movq	%r\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real KAISER patches in a moment.
+	 */
+	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 %r\scratch_reg
+	movq	%r\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * We could avoid the CR3 write if not changing its value,
+	 * but that requires a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_KAISER=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
--- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work	2017-11-10 11:22:07.193244954 -0800
+++ b/arch/x86/entry/entry_64_compat.S	2017-11-10 11:22:07.198244954 -0800
@@ -91,6 +91,9 @@ ENTRY(entry_SYSENTER_compat)
 	pushq   $0			/* pt_regs->r15 = 0 */
 	cld
 
+	/* We just saved all the registers, so safe to clobber %rdi */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * SYSENTER doesn't filter flags, so we need to clear NT and AC
 	 * ourselves.  To save a few cycles, we can check whether
@@ -214,6 +217,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
@@ -240,6 +245,7 @@ sysret32_from_system_call:
 	popq	%rsi			/* pt_regs->si */
 	popq	%rdi			/* pt_regs->di */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
         /*
          * USERGS_SYSRET32 does:
          *  GSBASE = user's GS base
@@ -324,6 +330,8 @@ ENTRY(entry_INT80_compat)
 	pushq   %r15                    /* pt_regs->r15 */
 	cld
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
+
 	movq	%rsp, %rdi			/* pt_regs pointer */
 	call	sync_regs
 	movq	%rax, %rsp			/* switch stack */
diff -puN arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work	2017-11-10 11:22:07.194244954 -0800
+++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:07.199244954 -0800
@@ -147,8 +147,6 @@ ENTRY(entry_SYSCALL_64)
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
-	TRACE_IRQS_OFF
-
 	/* Construct struct pt_regs on stack */
 	pushq	$__USER_DS			/* pt_regs->ss */
 	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
@@ -169,6 +167,13 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
 	UNWIND_HINT_REGS extra=0
 
+	/* NB: right here, all regs except r11 are live. */
+
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
+
+	/* Must wait until we have the kernel CR3 to call C functions: */
+	TRACE_IRQS_OFF
+
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
@@ -340,6 +345,7 @@ syscall_return_via_sysret:
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -679,6 +685,8 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -1167,7 +1175,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1189,6 +1201,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1217,6 +1230,9 @@ ENTRY(error_entry)
 	 */
 	SWAPGS
 
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+
 .Lerror_entry_from_usermode_after_swapgs:
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
@@ -1263,9 +1279,10 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 
 	/*
@@ -1298,6 +1315,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when KAISER is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1389,6 +1410,7 @@ ENTRY(nmi)
 	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
@@ -1613,6 +1635,8 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 05/30] x86, kaiser: prepare assembly for entry/exit CR3 switching
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before leaving assembly code.

This is extra challenging because the switching context is
tricky: the registers that can be clobbered can vary.  It is also
hard to store things on the stack because there is an established
ABI (ptregs) or the stack is entirely unsafe to use.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-cpu scratch space to save/restore a register
that can be used for the CR3 MOV.  The %GS register is used to
index into our per-cpu space, so SWAPGS *had* to be done before
the CR3 switch.  That scratch space is gone now, but the semantic
that SWAPGS must be done before the CR3 MOV is retained.  This is
good to keep because it is not that hard to do and it allows us
to do things like add per-cpu debugging information to help us
figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack during this
situation.  Changing the format of this stack is not a fun
exercise: I tried.  Instead of storing the old CR3 value on the
stack, this patch depend on the *regular* register save/restore
mechanism and then uses %r14 to keep CR3 during the NMI.  It is
callee-saved and will not be clobbered by the C NMI handlers that
get called.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h         |   65 +++++++++++++++++++++++++++++++++++++
 b/arch/x86/entry/entry_64.S        |   34 ++++++++++++++++---
 b/arch/x86/entry/entry_64_compat.S |    8 ++++
 3 files changed, 102 insertions(+), 5 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work	2017-11-10 11:22:07.191244954 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:07.198244954 -0800
@@ -1,5 +1,6 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -186,6 +187,70 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  We flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
+	andq	$(~KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, %r\scratch_reg
+	movq	%r\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real KAISER patches in a moment.
+	 */
+	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 %r\scratch_reg
+	movq	%r\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * We could avoid the CR3 write if not changing its value,
+	 * but that requires a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_KAISER=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
--- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work	2017-11-10 11:22:07.193244954 -0800
+++ b/arch/x86/entry/entry_64_compat.S	2017-11-10 11:22:07.198244954 -0800
@@ -91,6 +91,9 @@ ENTRY(entry_SYSENTER_compat)
 	pushq   $0			/* pt_regs->r15 = 0 */
 	cld
 
+	/* We just saved all the registers, so safe to clobber %rdi */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * SYSENTER doesn't filter flags, so we need to clear NT and AC
 	 * ourselves.  To save a few cycles, we can check whether
@@ -214,6 +217,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
@@ -240,6 +245,7 @@ sysret32_from_system_call:
 	popq	%rsi			/* pt_regs->si */
 	popq	%rdi			/* pt_regs->di */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
         /*
          * USERGS_SYSRET32 does:
          *  GSBASE = user's GS base
@@ -324,6 +330,8 @@ ENTRY(entry_INT80_compat)
 	pushq   %r15                    /* pt_regs->r15 */
 	cld
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
+
 	movq	%rsp, %rdi			/* pt_regs pointer */
 	call	sync_regs
 	movq	%rax, %rsp			/* switch stack */
diff -puN arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work	2017-11-10 11:22:07.194244954 -0800
+++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:07.199244954 -0800
@@ -147,8 +147,6 @@ ENTRY(entry_SYSCALL_64)
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
-	TRACE_IRQS_OFF
-
 	/* Construct struct pt_regs on stack */
 	pushq	$__USER_DS			/* pt_regs->ss */
 	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
@@ -169,6 +167,13 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
 	UNWIND_HINT_REGS extra=0
 
+	/* NB: right here, all regs except r11 are live. */
+
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
+
+	/* Must wait until we have the kernel CR3 to call C functions: */
+	TRACE_IRQS_OFF
+
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
@@ -340,6 +345,7 @@ syscall_return_via_sysret:
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -679,6 +685,8 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -1167,7 +1175,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1189,6 +1201,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1217,6 +1230,9 @@ ENTRY(error_entry)
 	 */
 	SWAPGS
 
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+
 .Lerror_entry_from_usermode_after_swapgs:
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
@@ -1263,9 +1279,10 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 
 	/*
@@ -1298,6 +1315,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when KAISER is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1389,6 +1410,7 @@ ENTRY(nmi)
 	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
@@ -1613,6 +1635,8 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 06/30] x86, kaiser: introduce user-mapped per-cpu areas
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and
can be used no matter which copy of the page tables is active.
Users of this new section will be forthcoming.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/include/asm-generic/vmlinux.lds.h |    7 +++++++
 b/include/linux/percpu-defs.h       |   30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff -puN include/asm-generic/vmlinux.lds.h~kaiser-prep-user-mapped-percpu include/asm-generic/vmlinux.lds.h
--- a/include/asm-generic/vmlinux.lds.h~kaiser-prep-user-mapped-percpu	2017-11-10 11:22:07.802244953 -0800
+++ b/include/asm-generic/vmlinux.lds.h	2017-11-10 11:22:07.807244953 -0800
@@ -807,7 +807,14 @@
  */
 #define PERCPU_INPUT(cacheline)						\
 	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
 	*(.data..percpu..first)						\
+	. = ALIGN(cacheline);						\
+	*(.data..percpu..user_mapped)					\
+	*(.data..percpu..user_mapped..shared_aligned)			\
+	. = ALIGN(PAGE_SIZE);						\
+	*(.data..percpu..user_mapped..page_aligned)			\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
 	. = ALIGN(PAGE_SIZE);						\
 	*(.data..percpu..page_aligned)					\
 	. = ALIGN(cacheline);						\
diff -puN include/linux/percpu-defs.h~kaiser-prep-user-mapped-percpu include/linux/percpu-defs.h
--- a/include/linux/percpu-defs.h~kaiser-prep-user-mapped-percpu	2017-11-10 11:22:07.804244953 -0800
+++ b/include/linux/percpu-defs.h	2017-11-10 11:22:07.807244953 -0800
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name)					\
 	DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
 	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
 	____cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
 	____cacheline_aligned
@@ -162,6 +182,16 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 06/30] x86, kaiser: introduce user-mapped per-cpu areas
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and
can be used no matter which copy of the page tables is active.
Users of this new section will be forthcoming.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/include/asm-generic/vmlinux.lds.h |    7 +++++++
 b/include/linux/percpu-defs.h       |   30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff -puN include/asm-generic/vmlinux.lds.h~kaiser-prep-user-mapped-percpu include/asm-generic/vmlinux.lds.h
--- a/include/asm-generic/vmlinux.lds.h~kaiser-prep-user-mapped-percpu	2017-11-10 11:22:07.802244953 -0800
+++ b/include/asm-generic/vmlinux.lds.h	2017-11-10 11:22:07.807244953 -0800
@@ -807,7 +807,14 @@
  */
 #define PERCPU_INPUT(cacheline)						\
 	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
 	*(.data..percpu..first)						\
+	. = ALIGN(cacheline);						\
+	*(.data..percpu..user_mapped)					\
+	*(.data..percpu..user_mapped..shared_aligned)			\
+	. = ALIGN(PAGE_SIZE);						\
+	*(.data..percpu..user_mapped..page_aligned)			\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
 	. = ALIGN(PAGE_SIZE);						\
 	*(.data..percpu..page_aligned)					\
 	. = ALIGN(cacheline);						\
diff -puN include/linux/percpu-defs.h~kaiser-prep-user-mapped-percpu include/linux/percpu-defs.h
--- a/include/linux/percpu-defs.h~kaiser-prep-user-mapped-percpu	2017-11-10 11:22:07.804244953 -0800
+++ b/include/linux/percpu-defs.h	2017-11-10 11:22:07.807244953 -0800
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name)					\
 	DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
 	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
 	____cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
 	____cacheline_aligned
@@ -162,6 +182,16 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 07/30] x86, kaiser: mark per-cpu data structures required for entry/exit
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

Here's a short summary of the things mapped to userspace:
 * The gdt_page's virtual address is pointed to by the LGDT instruction.
   It is needed to define the segments.  Deeply required by CPU to run.
 * cpu_tss tells the CPU, among other things, where the new stacks are
   after user<->kernel transitions.  Needed by the CPU to make ring
   transitions.
 * exception_stacks are needed at interrupt and exception entry
   so that there is storage for, among other things, some temporary
   space to permit clobbering a register to load the kernel CR3.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/desc.h      |    2 +-
 b/arch/x86/include/asm/processor.h |    2 +-
 b/arch/x86/kernel/cpu/common.c     |    4 ++--
 b/arch/x86/kernel/process.c        |    2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/desc.h~kaiser-prep-x86-percpu-user-mapped arch/x86/include/asm/desc.h
--- a/arch/x86/include/asm/desc.h~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.376244951 -0800
+++ b/arch/x86/include/asm/desc.h	2017-11-10 11:22:08.385244951 -0800
@@ -45,7 +45,7 @@ struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
 } __attribute__((aligned(PAGE_SIZE)));
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
 
 /* Provide the original GDT */
 static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
diff -puN arch/x86/include/asm/processor.h~kaiser-prep-x86-percpu-user-mapped arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.378244951 -0800
+++ b/arch/x86/include/asm/processor.h	2017-11-10 11:22:08.386244951 -0800
@@ -346,7 +346,7 @@ struct tss_struct {
 	unsigned long		SYSENTER_stack[64];
 } ____cacheline_aligned;
 
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff -puN arch/x86/kernel/cpu/common.c~kaiser-prep-x86-percpu-user-mapped arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.380244951 -0800
+++ b/arch/x86/kernel/cpu/common.c	2017-11-10 11:22:08.386244951 -0800
@@ -98,7 +98,7 @@ static const struct cpu_dev default_cpu
 
 static const struct cpu_dev *this_cpu = &default_cpu;
 
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {
 #ifdef CONFIG_X86_64
 	/*
 	 * We need valid kernel segments for data and code in long mode too
@@ -1343,7 +1343,7 @@ static const unsigned int exception_stac
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 
 /* May not be marked __init: used by software suspend */
diff -puN arch/x86/kernel/process.c~kaiser-prep-x86-percpu-user-mapped arch/x86/kernel/process.c
--- a/arch/x86/kernel/process.c~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.382244951 -0800
+++ b/arch/x86/kernel/process.c	2017-11-10 11:22:08.387244951 -0800
@@ -46,7 +46,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss) = {
 	.x86_tss = {
 		/*
 		 * .sp0 is only used when entering ring 0 from a lower
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 07/30] x86, kaiser: mark per-cpu data structures required for entry/exit
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

Here's a short summary of the things mapped to userspace:
 * The gdt_page's virtual address is pointed to by the LGDT instruction.
   It is needed to define the segments.  Deeply required by CPU to run.
 * cpu_tss tells the CPU, among other things, where the new stacks are
   after user<->kernel transitions.  Needed by the CPU to make ring
   transitions.
 * exception_stacks are needed at interrupt and exception entry
   so that there is storage for, among other things, some temporary
   space to permit clobbering a register to load the kernel CR3.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/desc.h      |    2 +-
 b/arch/x86/include/asm/processor.h |    2 +-
 b/arch/x86/kernel/cpu/common.c     |    4 ++--
 b/arch/x86/kernel/process.c        |    2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/desc.h~kaiser-prep-x86-percpu-user-mapped arch/x86/include/asm/desc.h
--- a/arch/x86/include/asm/desc.h~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.376244951 -0800
+++ b/arch/x86/include/asm/desc.h	2017-11-10 11:22:08.385244951 -0800
@@ -45,7 +45,7 @@ struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
 } __attribute__((aligned(PAGE_SIZE)));
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
 
 /* Provide the original GDT */
 static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
diff -puN arch/x86/include/asm/processor.h~kaiser-prep-x86-percpu-user-mapped arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.378244951 -0800
+++ b/arch/x86/include/asm/processor.h	2017-11-10 11:22:08.386244951 -0800
@@ -346,7 +346,7 @@ struct tss_struct {
 	unsigned long		SYSENTER_stack[64];
 } ____cacheline_aligned;
 
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff -puN arch/x86/kernel/cpu/common.c~kaiser-prep-x86-percpu-user-mapped arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.380244951 -0800
+++ b/arch/x86/kernel/cpu/common.c	2017-11-10 11:22:08.386244951 -0800
@@ -98,7 +98,7 @@ static const struct cpu_dev default_cpu
 
 static const struct cpu_dev *this_cpu = &default_cpu;
 
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {
 #ifdef CONFIG_X86_64
 	/*
 	 * We need valid kernel segments for data and code in long mode too
@@ -1343,7 +1343,7 @@ static const unsigned int exception_stac
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 
 /* May not be marked __init: used by software suspend */
diff -puN arch/x86/kernel/process.c~kaiser-prep-x86-percpu-user-mapped arch/x86/kernel/process.c
--- a/arch/x86/kernel/process.c~kaiser-prep-x86-percpu-user-mapped	2017-11-10 11:22:08.382244951 -0800
+++ b/arch/x86/kernel/process.c	2017-11-10 11:22:08.387244951 -0800
@@ -46,7 +46,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss) = {
 	.x86_tss = {
 		/*
 		 * .sp0 is only used when entering ring 0 from a lower
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86


From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology: https://github.com/IAIK/KAISER .  This work would not have
been possible without their work as a starting point.

KAISER is a countermeasure against side channel attacks against kernel
virtual memory.  It leaves the existing page tables largely alone and
refers to them as the "kernel page tables.  It adds a "shadow" pgd for
every process which is intended for use when running userspace.  The
shadow pgd maps all the same user memory as the "kernel" copy, but
only maps a minimal set of kernel memory.

Whenever entering the kernel (syscalls, interrupts, exceptions), the
pgd is switched to the "kernel" copy.  When switching back to user
mode, the shadow pgd is used.

The minimalistic kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions themselves and
the interrupt descriptors (IDT).

Changes from original KAISER patch:
 * Gobs of coding style cleanups
 * The original patch tried to allocate an order-2 page, then
   8k-align the result.  That's silly since order-2 is already
   guaranteed to be 16k-aligned.  Removed that gunk and just
   allocate an order-1 page.
 * Handle (or at least detect and warn on) allocation failures
 * Use _KERNPG_TABLE, not _PAGE_TABLE when creating mappings for
   the kernel in the shadow (user) page tables.
 * BUG_ON() for !pte_none() case was totally insane: it checked
   the physical address of the 'struct page' against the physical
   address of the page being mapped.
 * Added 5-level page table support
 * Never free kaiser page tables.  We don't have the locking to
   keep them from getting referenced during the freeing process.
 * Use a totally different scheme in the entry code.  The
   original code just fell apart in horrific ways in debug faults,
   NMIs, or when iret faults.  Big thanks to Andy Lutomirski for
   reducing the number of places that needed to be patched.  He
   made the code a ton simpler.
 * Use new entry trampoline instead of mapping process stacks.

Note: The original KAISER authors signed-off on their patch.  Some of
their code has been broken out into other patches in this series, but
their SoB was only retained here.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/Documentation/x86/kaiser.txt      |  160 +++++++++++++
 b/arch/x86/entry/calling.h          |    1 
 b/arch/x86/entry/entry_64.S         |   15 +
 b/arch/x86/include/asm/kaiser.h     |   57 ++++
 b/arch/x86/include/asm/pgtable.h    |    6 
 b/arch/x86/include/asm/pgtable_64.h |   93 +++++++
 b/arch/x86/kernel/espfix_64.c       |   17 +
 b/arch/x86/kernel/head_64.S         |   14 -
 b/arch/x86/kernel/traps.c           |   46 +++
 b/arch/x86/mm/Makefile              |    1 
 b/arch/x86/mm/kaiser.c              |  423 ++++++++++++++++++++++++++++++++++++
 b/arch/x86/mm/pageattr.c            |    2 
 b/arch/x86/mm/pgtable.c             |   16 +
 b/include/linux/kaiser.h            |   29 ++
 b/init/main.c                       |    3 
 b/kernel/fork.c                     |    1 
 16 files changed, 867 insertions(+), 17 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-base arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-base	2017-11-10 11:22:09.005244950 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:09.030244950 -0800
@@ -1,6 +1,7 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
diff -puN arch/x86/entry/entry_64.S~kaiser-base arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-base	2017-11-10 11:22:09.007244950 -0800
+++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:09.031244950 -0800
@@ -145,6 +145,16 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+
+	/*
+	 * We need a good kernel CR3 to be able to map the process
+	 * stack, but we need a scratch register to be able to load
+	 * CR3.  We could create another PER_CPU_VAR(), but %rsp is
+	 * actually clobberable right now.  Just use it.  It will only
+	 * be insane for one a couple instructions.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -169,8 +179,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 
 	/* NB: right here, all regs except r11 are live. */
 
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
-
 	/* Must wait until we have the kernel CR3 to call C functions: */
 	TRACE_IRQS_OFF
 
@@ -1269,6 +1277,7 @@ ENTRY(error_entry)
 	 * gsbase and proceed.  We'll fix up the exception and land in
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 	jmp .Lerror_entry_done
 
@@ -1382,6 +1391,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1410,7 +1420,6 @@ ENTRY(nmi)
 	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
diff -puN /dev/null arch/x86/include/asm/kaiser.h
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/arch/x86/include/asm/kaiser.h	2017-11-10 11:22:09.031244950 -0800
@@ -0,0 +1,57 @@
+#ifndef _ASM_X86_KAISER_H
+#define _ASM_X86_KAISER_H
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Based on work published here: https://github.com/IAIK/KAISER
+ * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
+ */
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_KAISER
+/**
+ *  kaiser_add_mapping - map a kernel range into the user page tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ *  @flags: The mapping flags of the pages
+ *
+ *  Use this on all data and code that need to be mapped into both
+ *  copies of the page tables.  This includes the code that switches
+ *  to/from userspace and all of the hardware structures that are
+ *  virtually-addressed and needed in userspace like the interrupt
+ *  table.
+ */
+extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
+			      unsigned long flags);
+
+/**
+ *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ */
+extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
+
+/**
+ *  kaiser_init - Initialize the shadow mapping
+ *
+ *  Most parts of the shadow mapping can be mapped upon boot
+ *  time.  Only per-process things like the thread stacks
+ *  or a new LDT have to be mapped at runtime.  These boot-
+ *  time mappings are permanent and never unmapped.
+ */
+extern void kaiser_init(void);
+
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_KAISER_H */
diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-base arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-base	2017-11-10 11:22:09.009244950 -0800
+++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:09.032244950 -0800
@@ -130,9 +130,88 @@ static inline pud_t native_pudp_get_and_
 #endif
 }
 
+#ifdef CONFIG_KAISER
+/*
+ * All top-level KAISER page tables are order-1 pages (8k-aligned
+ * and 8k in size).  The kernel one is at the beginning 4k and
+ * the user (shadow) one is in the last 4k.  To switch between
+ * them, you just need to flip the 12th bit in their addresses.
+ */
+#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+	__ptr |= (1<<bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+	__ptr &= ~(1<<bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *native_get_shadow_p4d(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *native_get_normal_p4d(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_KAISER */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ * This returns true for user pages that need to get copied into
+ * both the user and kernel copies of the page tables, and false
+ * for kernel pages that should only be in the kernel copy.
+ */
+static inline bool is_userspace_pgd(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
+}
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
+	/*
+	 * set_pgd() does not get called when we are running
+	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
+	 * know here that we have a p4d but that it is really at
+	 * the top level of the page tables; it is really just a
+	 * pgd.
+	 */
+	/* Do we need to also populate the shadow p4d? */
+	if (is_userspace_pgd(p4dp))
+		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
+	/*
+	 * Even if the entry is *mapping* userspace, ensure
+	 * that userspace can not use it.  This way, if we
+	 * get out to userspace with the wrong CR3 value,
+	 * userspace will crash instead of running.
+	 */
+	if (!p4d.pgd.pgd)
+		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;
+#else /* CONFIG_KAISER */
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -146,7 +225,21 @@ static inline void native_p4d_clear(p4d_
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_KAISER
+	/* Do we need to also populate the shadow pgd? */
+	if (is_userspace_pgd(pgdp))
+		native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+	/*
+	 * Even if the entry is mapping userspace, ensure
+	 * that it is unusable for userspace.  This way,
+	 * if we get out to userspace with the wrong CR3
+	 * value, userspace will crash instead of running.
+	 */
+	if (!pgd_none(pgd))
+		pgdp->pgd = pgd.pgd | _PAGE_NX;
+#else /* CONFIG_KAISER */
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
diff -puN arch/x86/include/asm/pgtable.h~kaiser-base arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~kaiser-base	2017-11-10 11:22:09.011244950 -0800
+++ b/arch/x86/include/asm/pgtable.h	2017-11-10 11:22:09.032244950 -0800
@@ -1105,6 +1105,12 @@ static inline void pmdp_set_wrprotect(st
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
        memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_KAISER
+	/* Clone the shadow pgd part as well */
+	memcpy(native_get_shadow_pgd(dst),
+	       native_get_shadow_pgd(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)
diff -puN arch/x86/kernel/espfix_64.c~kaiser-base arch/x86/kernel/espfix_64.c
--- a/arch/x86/kernel/espfix_64.c~kaiser-base	2017-11-10 11:22:09.013244950 -0800
+++ b/arch/x86/kernel/espfix_64.c	2017-11-10 11:22:09.032244950 -0800
@@ -41,6 +41,7 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
+#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
+	/*
+	 * Just copy the top-level PGD that is mapping the espfix
+	 * area to ensure it is mapped into the shadow user page
+	 * tables.
+	 *
+	 * For 5-level paging, we should have already populated
+	 * the espfix pgd when kaiser_init() pre-populated all
+	 * the pgd entries.  The above p4d_alloc() would never do
+	 * anything and the p4d_populate() would be done to a p4d
+	 * already mapped in the userspace pgd.
+	 */
+#ifdef CONFIG_KAISER
+	if (CONFIG_PGTABLE_LEVELS <= 4)
+		set_pgd(native_get_shadow_pgd(pgd),
+			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
+#endif
 
 	/* Randomize the locations */
 	init_espfix_random();
diff -puN arch/x86/kernel/head_64.S~kaiser-base arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S~kaiser-base	2017-11-10 11:22:09.015244950 -0800
+++ b/arch/x86/kernel/head_64.S	2017-11-10 11:22:09.033244950 -0800
@@ -339,6 +339,14 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_KAISER
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -348,7 +356,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
@@ -362,10 +370,10 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
diff -puN arch/x86/kernel/traps.c~kaiser-base arch/x86/kernel/traps.c
--- a/arch/x86/kernel/traps.c~kaiser-base	2017-11-10 11:22:09.017244950 -0800
+++ b/arch/x86/kernel/traps.c	2017-11-10 11:22:09.033244950 -0800
@@ -329,6 +329,43 @@ __visible void __noreturn handle_stack_o
 }
 #endif
 
+/*
+ * This "fakes" a #GP from userspace upon returning (iret'ing)
+ * from this double fault.
+ */
+void setup_fake_gp_at_iret(struct pt_regs *regs)
+{
+	unsigned long *new_stack_top = (unsigned long *)
+		(this_cpu_read(cpu_tss.x86_tss.ist[0]) - 0x1500);
+
+	/*
+	 * Set up a stack just like the hardware would for a #GP.
+	 *
+	 * This format is an "iret frame", plus the error code
+	 * that the hardware puts on the stack for us for
+	 * exceptions.  (see struct pt_regs).
+	 */
+	new_stack_top[-1] = regs->ss;
+	new_stack_top[-2] = regs->sp;
+	new_stack_top[-3] = regs->flags;
+	new_stack_top[-4] = regs->cs;
+	new_stack_top[-5] = regs->ip;
+	new_stack_top[-6] = 0;	/* faked #GP error code */
+
+	/*
+	 * 'regs' points to the "iret frame" for *this*
+	 * exception, *not* the #GP we are faking.  Here,
+	 * we are telling 'iret' to jump to general_protection
+	 * when returning from this double fault.
+	 */
+	regs->ip = (unsigned long)general_protection;
+	/*
+	 * Make iret move the stack to the "fake #GP" stack
+	 * we created above.
+	 */
+	regs->sp = (unsigned long)&new_stack_top[-6];
+}
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
@@ -354,14 +391,7 @@ dotraplinkage void do_double_fault(struc
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
-
-		/* Fake a #GP(0) from userspace. */
-		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
-		normal_regs->orig_ax = 0;  /* Missing (lost) #GP error code */
-		regs->ip = (unsigned long)general_protection;
-		regs->sp = (unsigned long)&normal_regs->orig_ax;
-
+		setup_fake_gp_at_iret(regs);
 		return;
 	}
 #endif
diff -puN /dev/null arch/x86/mm/kaiser.c
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:09.034244950 -0800
@@ -0,0 +1,423 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This code is based in part on work published here:
+ *
+ *	https://github.com/IAIK/KAISER
+ *
+ * The original work was written by and and signed off by for the Linux
+ * kernel by:
+ *
+ *   Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
+ *   Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
+ *   Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
+ *   Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
+ *
+ * Major changes to the original code by: Dave Hansen <dave.hansen@intel.com>
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/kaiser.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+/*
+ * At runtime, the only things we map are some things for CPU
+ * hotplug, and stacks for new processes.  No two CPUs will ever
+ * be populating the same addresses, so we only need to ensure
+ * that we protect between two CPUs trying to allocate and
+ * populate the same page table page.
+ *
+ * Only take this lock when doing a set_p[4um]d(), but it is not
+ * needed for doing a set_pte().  We assume that only the *owner*
+ * of a given allocation will be doing this for _their_
+ * allocation.
+ *
+ * This ensures that once a system has been running for a while
+ * and there have been stacks all over and these page tables
+ * are fully populated, there will be no further acquisitions of
+ * this lock.
+ */
+static DEFINE_SPINLOCK(shadow_table_allocation_lock);
+
+/*
+ * This is only for walking kernel addresses.  We use it too help
+ * recreate the "shadow" page tables which are used while we are in
+ * userspace.
+ *
+ * This can be called on any kernel memory addresses and will work
+ * with any page sizes and any types: normal linear map memory,
+ * vmalloc(), even kmap().
+ *
+ * Note: this is only used when mapping new *kernel* entries into
+ * the user/shadow page tables.  It is never used for userspace
+ * addresses.
+ *
+ * Returns -1 on error.
+ */
+static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* We should only be asked to walk kernel addresses */
+	if (vaddr < PAGE_OFFSET) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pgd = pgd_offset_k(vaddr);
+	/*
+	 * We made all the kernel PGDs present in kaiser_init().
+	 * We expect them to stay that way.
+	 */
+	if (pgd_none(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+	/*
+	 * PGDs are either 512GB or 128TB on all x86_64
+	 * configurations.  We don't handle these.
+	 */
+	if (pgd_large(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	p4d = p4d_offset(pgd, vaddr);
+	if (p4d_none(*p4d)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pud = pud_offset(p4d, vaddr);
+	if (pud_none(*pud)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pud_large(*pud))
+		return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
+
+	pmd = pmd_offset(pud, vaddr);
+	if (pmd_none(*pmd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pmd_large(*pmd))
+		return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
+
+	pte = pte_offset_kernel(pmd, vaddr);
+	if (pte_none(*pte)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK);
+}
+
+/*
+ * Walk the shadow copy of the page tables (optionally) trying to
+ * allocate page table pages on the way down.  Does not support
+ * large pages since the data we are mapping is (generally) not
+ * large enough or aligned to 2MB.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+#define KAISER_WALK_ATOMIC  0x1
+static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
+					   unsigned long flags)
+{
+	pte_t *pte;
+	pmd_t *pmd;
+	pud_t *pud;
+	p4d_t *p4d;
+	pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+
+	if (flags & KAISER_WALK_ATOMIC) {
+		gfp &= ~GFP_KERNEL;
+		gfp |= __GFP_HIGH | __GFP_ATOMIC;
+	}
+
+	if (address < PAGE_OFFSET) {
+		WARN_ONCE(1, "attempt to walk user address\n");
+		return NULL;
+	}
+
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "All shadow pgds should have been populated\n");
+		return NULL;
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, address);
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (p4d_none(*p4d))
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+		else
+			free_page(new_pud_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pud_none(*pud))
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+		else
+			free_page(new_pmd_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pmd = pmd_offset(pud, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pmd_none(*pmd))
+			set_pmd(pmd, __pmd(_KERNPG_TABLE  | __pa(new_pte_page)));
+		else
+			free_page(new_pte_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_flags(*pte) & _PAGE_USER) {
+		WARN_ONCE(1, "attempt to walk to user pte\n");
+		return NULL;
+	}
+	return pte;
+}
+
+/*
+ * Given a kernel address, @__start_addr, copy that mapping into
+ * the user (shadow) page tables.  This may need to allocate page
+ * table pages.
+ */
+int kaiser_add_user_map(const void *__start_addr, unsigned long size,
+			unsigned long flags)
+{
+	pte_t *pte;
+	unsigned long start_addr = (unsigned long)__start_addr;
+	unsigned long address = start_addr & PAGE_MASK;
+	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
+	unsigned long target_address;
+
+	for (; address < end_addr; address += PAGE_SIZE) {
+		target_address = get_pa_from_kernel_map(address);
+		if (target_address == -1)
+			return -EIO;
+
+		pte = kaiser_shadow_pagetable_walk(address, false);
+		/*
+		 * Errors come from either -ENOMEM for a page
+		 * table page, or something screwy that did a
+		 * WARN_ON().  Just return -ENOMEM.
+		 */
+		if (!pte)
+			return -ENOMEM;
+		if (pte_none(*pte)) {
+			set_pte(pte, __pte(flags | target_address));
+		} else {
+			pte_t tmp;
+			set_pte(&tmp, __pte(flags | target_address));
+			WARN_ON_ONCE(!pte_same(*pte, tmp));
+		}
+	}
+	return 0;
+}
+
+int kaiser_add_user_map_ptrs(const void *__start_addr,
+			     const void *__end_addr,
+			     unsigned long flags)
+{
+	return kaiser_add_user_map(__start_addr,
+				   __end_addr - __start_addr,
+				   flags);
+}
+
+/*
+ * Ensure that the top level of the (shadow) page tables are
+ * entirely populated.  This ensures that all processes that get
+ * forked have the same entries.  This way, we do not have to
+ * ever go set up new entries in older processes.
+ *
+ * Note: we never free these, so there are no updates to them
+ * after this.
+ */
+static void __init kaiser_init_all_pgds(void)
+{
+	pgd_t *pgd;
+	int i = 0;
+
+	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
+	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
+		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;
+#if CONFIG_PGTABLE_LEVELS > 4
+		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
+		if (!p4d) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
+#else /* CONFIG_PGTABLE_LEVELS <= 4 */
+		pud_t *pud = pud_alloc_one(&init_mm, addr);
+		if (!pud) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
+#endif /* CONFIG_PGTABLE_LEVELS */
+	}
+}
+
+/*
+ * The page table allocations in here can theoretically fail, but
+ * we can not do much about it in early boot.  Do the checking
+ * and warning in a macro to make it more readable.
+ */
+#define kaiser_add_user_map_early(start, size, flags) do {	\
+	int __ret = kaiser_add_user_map(start, size, flags);	\
+	WARN_ON(__ret);						\
+} while (0)
+
+#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
+	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
+	WARN_ON(__ret);							\
+} while (0)
+
+extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
+/*
+ * If anything in here fails, we will likely die on one of the
+ * first kernel->user transitions and init will die.  But, we
+ * will have most of the kernel up by then and should be able to
+ * get a clean warning out of it.  If we BUG_ON() here, we run
+ * the risk of being before we have good console output.
+ *
+ * When KAISER is enabled, we remove _PAGE_GLOBAL from all of the
+ * kernel PTE permissions.  This ensures that the TLB entries for
+ * the kernel are not available when in userspace.  However, for
+ * the pages that are available to userspace *anyway*, we might as
+ * well continue to map them _PAGE_GLOBAL and enjoy the potential
+ * performance advantages.
+ */
+void __init kaiser_init(void)
+{
+	int cpu;
+
+	kaiser_init_all_pgds();
+
+	for_each_possible_cpu(cpu) {
+		void *percpu_vaddr = __per_cpu_user_mapped_start +
+				     per_cpu_offset(cpu);
+		unsigned long percpu_sz = __per_cpu_user_mapped_end -
+					  __per_cpu_user_mapped_start;
+		kaiser_add_user_map_early(percpu_vaddr, percpu_sz,
+					  __PAGE_KERNEL | _PAGE_GLOBAL);
+	}
+
+	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+
+	/* the fixed map address of the idt_table */
+	kaiser_add_user_map_early((void *)idt_descr.address,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+}
+
+int kaiser_add_mapping(unsigned long addr, unsigned long size,
+		       unsigned long flags)
+{
+	return kaiser_add_user_map((const void *)addr, size, flags);
+}
+
+void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+	unsigned long addr;
+
+	/* The shadow page tables always use small pages: */
+	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+		/*
+		 * Do an "atomic" walk in case this got called from an atomic
+		 * context.  This should not do any allocations because we
+		 * should only be walking things that are known to be mapped.
+		 */
+		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+
+		/*
+		 * We are removing a mapping that should
+		 * exist.  WARN if it was not there:
+		 */
+		if (!pte) {
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		pte_clear(&init_mm, addr, pte);
+	}
+	/*
+	 * This ensures that the TLB entries used to map this data are
+	 * no longer usable on *this* CPU.  We theoretically want to
+	 * flush the entries on all CPUs here, but that's too
+	 * expensive right now: this is called to unmap process
+	 * stacks in the exit() path path.
+	 *
+	 * This can change if we get to the point where this is not
+	 * in a remotely hot path, like only called via write_ldt().
+	 *
+	 * Note: we could probably also just invalidate the individual
+	 * addresses to take care of *this* PCID and then do a
+	 * tlb_flush_shared_nonglobals() to ensure that all other
+	 * PCIDs get flushed before being used again.
+	 */
+	__native_flush_tlb_global();
+}
diff -puN arch/x86/mm/Makefile~kaiser-base arch/x86/mm/Makefile
--- a/arch/x86/mm/Makefile~kaiser-base	2017-11-10 11:22:09.019244950 -0800
+++ b/arch/x86/mm/Makefile	2017-11-10 11:22:09.034244950 -0800
@@ -45,6 +45,7 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulatio
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_KAISER)		+= kaiser.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff -puN arch/x86/mm/pageattr.c~kaiser-base arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~kaiser-base	2017-11-10 11:22:09.020244950 -0800
+++ b/arch/x86/mm/pageattr.c	2017-11-10 11:22:09.035244950 -0800
@@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud,
 			pud_clear(pud);
 }
 
-static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
+void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
 	pud_t *pud = pud_offset(p4d, start);
 
diff -puN arch/x86/mm/pgtable.c~kaiser-base arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~kaiser-base	2017-11-10 11:22:09.022244950 -0800
+++ b/arch/x86/mm/pgtable.c	2017-11-10 11:22:09.035244950 -0800
@@ -354,14 +354,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_KAISER
+/*
+ * Instead of one pgd, we aquire two pgds.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 
diff -puN /dev/null Documentation/x86/kaiser.txt
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/Documentation/x86/kaiser.txt	2017-11-10 11:22:09.035244950 -0800
@@ -0,0 +1,160 @@
+Overview
+========
+
+KAISER is a countermeasure against attacks on kernel address
+information.  There are at least three existing, published,
+approaches using the shared user/kernel mapping and hardware features
+to defeat KASLR.  One approach referenced in the paper locates the
+kernel by observing differences in page fault timing between
+present-but-inaccessable kernel pages and non-present pages.
+
+When we enter the kernel via syscalls, interrupts or exceptions,
+page tables are switched to the full "kernel" copy.  When the
+system switches back to user mode, the user/shadow copy is used.
+
+The minimalistic kernel portion of the user page tables try to
+map only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor
+table (IDT).
+
+This helps ensure that side-channel attacks that leverage the
+paging structures do not function when KAISER is enabled.  It
+can be enabled by setting CONFIG_KAISER=y
+
+Page Table Management
+=====================
+
+KAISER logically keeps a "copy" of the page tables which unmap
+the kernel while in userspace.  The kernel manages the page
+tables as normal, but the "copying" is done with a few tricks
+that mean that we do not have to manage two full copies.
+
+The first trick is that for any any new kernel mapping, we
+presume that we do not want it mapped to userspace.  That means
+we normally have no copying to do.  We only copy the kernel
+entries over to the shadow in response to a kaiser_add_*()
+call which is rare.
+
+For a new userspace mapping, the kernel makes the entries in
+its page tables like normal.  The only difference is when the
+kernel makes entries in the top (PGD) level.  In addition to
+setting the entry in the main kernel PGD, a copy if the entry
+is made in the shadow PGD.
+
+PGD entries always point to another page table.  Two PGD
+entries pointing to the same thing gives us shared page tables
+for all the lower entries.  This leaves a single, shared set of
+userspace page tables to manage.  One PTE to lock, one set set
+of accessed bits, dirty bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes 4k per process).
+  b. The pre-allocated second-level (p4d or pud) kernel page
+     table pages cost ~1MB of additional memory at boot.  This
+     is not totally wasted because some of these pages would
+     have been needed eventually for normal kernel page tables
+     and things in the vmalloc() area like vmemmap[].
+  c. Statically-allocated structures and entry/exit text must
+     be padded out to 4k (or 8k for PGDs) so they can be mapped
+     into the user page tables.  This bloats the kernel image
+     by ~20-30k.
+  d. The shadow page tables eventually grow to map all of used
+     vmalloc() space.  They can have roughly the same memory
+     consumption as the vmalloc() page tables.
+
+2. Runtime Cost
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and we need one at entry and another at exit.
+  b. Task stacks must be mapped/unmapped.  We need to walk
+     and modify the shadow page tables at fork() and exit().
+  c. Global pages are disabled.  This feature of the MMU
+     allows different processes to share TLB entries mapping
+     the kernel.  Losing the feature means potentially more
+     TLB misses after a context switch.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when we switch
+     the page tables.  This makes switching the page tables
+     (at context switch, or kernel entry/exit) cheaper.  But,
+     on systems with PCID support, the context switch code
+     must flush both the user and kernel entries out of the
+     TLB, with an INVPCID in addition to the CR3 write.  This
+     INVPCID is generally slower than a CR3 write, but still
+     on the order of a hundred cycles.
+  e. The shadow page tables must be populated for each new
+     process.  Even without KAISER, since we share all of the
+     kernel mappings in all processes, we can do all this
+     population for kernel addresses at the top level of the
+     page tables (the PGD level).  But, with KAISER, we now
+     have *two* kernel mappings: one in the kernel page tables
+     that maps everything and one in the user/shadow page
+     tables mapping the "minimal" kernel.  At fork(), we
+     copy the portion of the shadow PGD that maps the minimal
+     kernel structures in addition to the normal kernel one.
+  f. In addition to the fork()-time copying, we must also
+     update the shadow PGD any time a set_pgd() is done on a
+     PGD used to map userspace.  This ensures that the kernel
+     and user/shadow copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+
+Possible Future Work:
+1. We can be more careful about not actually writing to CR3
+   unless we actually switch it.
+2. Try to have dedicated entry/exit kernel stacks so we do
+   not have to map/unmap the task/thread stacks.
+3. Compress the user/shadow-mapped data to be mapped together
+   underneath a single PGD entry.
+4. Re-enable global pages, but use them for mappings in the
+   user/shadow page tables.  This would allow the kernel to
+   take advantage of TLB entries that were established from
+   the user page tables.  This might speed up the entry/exit
+   code or userspace since it will not have to reload all of
+   its TLB entries.  However, its upside is limited by PCID
+   being used.
+5. Allow KAISER to enabled/disabled at runtime so folks can
+   run a single kernel image.
+
+Debugging:
+
+Bugs in KAISER cause a few different signatures of crashes
+that are worth noting here.
+
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-kaiser-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not kaiser-mapped.
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
diff -puN /dev/null include/linux/kaiser.h
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/include/linux/kaiser.h	2017-11-10 11:22:09.036244950 -0800
@@ -0,0 +1,29 @@
+#ifndef _INCLUDE_KAISER_H
+#define _INCLUDE_KAISER_H
+
+#ifdef CONFIG_KAISER
+#include <asm/kaiser.h>
+#else
+
+/*
+ * These stubs are used whenever CONFIG_KAISER is off, which
+ * includes architectures that support KAISER, but have it
+ * disabled.
+ */
+
+static inline void kaiser_init(void)
+{
+}
+
+static inline void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+}
+
+static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
+				     unsigned long flags)
+{
+	return 0;
+}
+
+#endif /* !CONFIG_KAISER */
+#endif /* _INCLUDE_KAISER_H */
diff -puN init/main.c~kaiser-base init/main.c
--- a/init/main.c~kaiser-base	2017-11-10 11:22:09.025244950 -0800
+++ b/init/main.c	2017-11-10 11:22:09.036244950 -0800
@@ -75,6 +75,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/kaiser.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -504,6 +505,8 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	/* This just needs to be done before we first run userspace: */
+	kaiser_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)
diff -puN kernel/fork.c~kaiser-base kernel/fork.c
--- a/kernel/fork.c~kaiser-base	2017-11-10 11:22:09.027244950 -0800
+++ b/kernel/fork.c	2017-11-10 11:22:09.037244950 -0800
@@ -70,6 +70,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
+#include <linux/kaiser.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86


From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology: https://github.com/IAIK/KAISER .  This work would not have
been possible without their work as a starting point.

KAISER is a countermeasure against side channel attacks against kernel
virtual memory.  It leaves the existing page tables largely alone and
refers to them as the "kernel page tables.  It adds a "shadow" pgd for
every process which is intended for use when running userspace.  The
shadow pgd maps all the same user memory as the "kernel" copy, but
only maps a minimal set of kernel memory.

Whenever entering the kernel (syscalls, interrupts, exceptions), the
pgd is switched to the "kernel" copy.  When switching back to user
mode, the shadow pgd is used.

The minimalistic kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions themselves and
the interrupt descriptors (IDT).

Changes from original KAISER patch:
 * Gobs of coding style cleanups
 * The original patch tried to allocate an order-2 page, then
   8k-align the result.  That's silly since order-2 is already
   guaranteed to be 16k-aligned.  Removed that gunk and just
   allocate an order-1 page.
 * Handle (or at least detect and warn on) allocation failures
 * Use _KERNPG_TABLE, not _PAGE_TABLE when creating mappings for
   the kernel in the shadow (user) page tables.
 * BUG_ON() for !pte_none() case was totally insane: it checked
   the physical address of the 'struct page' against the physical
   address of the page being mapped.
 * Added 5-level page table support
 * Never free kaiser page tables.  We don't have the locking to
   keep them from getting referenced during the freeing process.
 * Use a totally different scheme in the entry code.  The
   original code just fell apart in horrific ways in debug faults,
   NMIs, or when iret faults.  Big thanks to Andy Lutomirski for
   reducing the number of places that needed to be patched.  He
   made the code a ton simpler.
 * Use new entry trampoline instead of mapping process stacks.

Note: The original KAISER authors signed-off on their patch.  Some of
their code has been broken out into other patches in this series, but
their SoB was only retained here.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/Documentation/x86/kaiser.txt      |  160 +++++++++++++
 b/arch/x86/entry/calling.h          |    1 
 b/arch/x86/entry/entry_64.S         |   15 +
 b/arch/x86/include/asm/kaiser.h     |   57 ++++
 b/arch/x86/include/asm/pgtable.h    |    6 
 b/arch/x86/include/asm/pgtable_64.h |   93 +++++++
 b/arch/x86/kernel/espfix_64.c       |   17 +
 b/arch/x86/kernel/head_64.S         |   14 -
 b/arch/x86/kernel/traps.c           |   46 +++
 b/arch/x86/mm/Makefile              |    1 
 b/arch/x86/mm/kaiser.c              |  423 ++++++++++++++++++++++++++++++++++++
 b/arch/x86/mm/pageattr.c            |    2 
 b/arch/x86/mm/pgtable.c             |   16 +
 b/include/linux/kaiser.h            |   29 ++
 b/init/main.c                       |    3 
 b/kernel/fork.c                     |    1 
 16 files changed, 867 insertions(+), 17 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-base arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-base	2017-11-10 11:22:09.005244950 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:09.030244950 -0800
@@ -1,6 +1,7 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
diff -puN arch/x86/entry/entry_64.S~kaiser-base arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-base	2017-11-10 11:22:09.007244950 -0800
+++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:09.031244950 -0800
@@ -145,6 +145,16 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+
+	/*
+	 * We need a good kernel CR3 to be able to map the process
+	 * stack, but we need a scratch register to be able to load
+	 * CR3.  We could create another PER_CPU_VAR(), but %rsp is
+	 * actually clobberable right now.  Just use it.  It will only
+	 * be insane for one a couple instructions.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -169,8 +179,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 
 	/* NB: right here, all regs except r11 are live. */
 
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
-
 	/* Must wait until we have the kernel CR3 to call C functions: */
 	TRACE_IRQS_OFF
 
@@ -1269,6 +1277,7 @@ ENTRY(error_entry)
 	 * gsbase and proceed.  We'll fix up the exception and land in
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 	jmp .Lerror_entry_done
 
@@ -1382,6 +1391,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1410,7 +1420,6 @@ ENTRY(nmi)
 	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
diff -puN /dev/null arch/x86/include/asm/kaiser.h
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/arch/x86/include/asm/kaiser.h	2017-11-10 11:22:09.031244950 -0800
@@ -0,0 +1,57 @@
+#ifndef _ASM_X86_KAISER_H
+#define _ASM_X86_KAISER_H
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Based on work published here: https://github.com/IAIK/KAISER
+ * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
+ */
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_KAISER
+/**
+ *  kaiser_add_mapping - map a kernel range into the user page tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ *  @flags: The mapping flags of the pages
+ *
+ *  Use this on all data and code that need to be mapped into both
+ *  copies of the page tables.  This includes the code that switches
+ *  to/from userspace and all of the hardware structures that are
+ *  virtually-addressed and needed in userspace like the interrupt
+ *  table.
+ */
+extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
+			      unsigned long flags);
+
+/**
+ *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ */
+extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
+
+/**
+ *  kaiser_init - Initialize the shadow mapping
+ *
+ *  Most parts of the shadow mapping can be mapped upon boot
+ *  time.  Only per-process things like the thread stacks
+ *  or a new LDT have to be mapped at runtime.  These boot-
+ *  time mappings are permanent and never unmapped.
+ */
+extern void kaiser_init(void);
+
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_KAISER_H */
diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-base arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-base	2017-11-10 11:22:09.009244950 -0800
+++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:09.032244950 -0800
@@ -130,9 +130,88 @@ static inline pud_t native_pudp_get_and_
 #endif
 }
 
+#ifdef CONFIG_KAISER
+/*
+ * All top-level KAISER page tables are order-1 pages (8k-aligned
+ * and 8k in size).  The kernel one is at the beginning 4k and
+ * the user (shadow) one is in the last 4k.  To switch between
+ * them, you just need to flip the 12th bit in their addresses.
+ */
+#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+	__ptr |= (1<<bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+	__ptr &= ~(1<<bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *native_get_shadow_p4d(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *native_get_normal_p4d(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_KAISER */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ * This returns true for user pages that need to get copied into
+ * both the user and kernel copies of the page tables, and false
+ * for kernel pages that should only be in the kernel copy.
+ */
+static inline bool is_userspace_pgd(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
+}
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
+	/*
+	 * set_pgd() does not get called when we are running
+	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
+	 * know here that we have a p4d but that it is really at
+	 * the top level of the page tables; it is really just a
+	 * pgd.
+	 */
+	/* Do we need to also populate the shadow p4d? */
+	if (is_userspace_pgd(p4dp))
+		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
+	/*
+	 * Even if the entry is *mapping* userspace, ensure
+	 * that userspace can not use it.  This way, if we
+	 * get out to userspace with the wrong CR3 value,
+	 * userspace will crash instead of running.
+	 */
+	if (!p4d.pgd.pgd)
+		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;
+#else /* CONFIG_KAISER */
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -146,7 +225,21 @@ static inline void native_p4d_clear(p4d_
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_KAISER
+	/* Do we need to also populate the shadow pgd? */
+	if (is_userspace_pgd(pgdp))
+		native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+	/*
+	 * Even if the entry is mapping userspace, ensure
+	 * that it is unusable for userspace.  This way,
+	 * if we get out to userspace with the wrong CR3
+	 * value, userspace will crash instead of running.
+	 */
+	if (!pgd_none(pgd))
+		pgdp->pgd = pgd.pgd | _PAGE_NX;
+#else /* CONFIG_KAISER */
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
diff -puN arch/x86/include/asm/pgtable.h~kaiser-base arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~kaiser-base	2017-11-10 11:22:09.011244950 -0800
+++ b/arch/x86/include/asm/pgtable.h	2017-11-10 11:22:09.032244950 -0800
@@ -1105,6 +1105,12 @@ static inline void pmdp_set_wrprotect(st
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
        memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_KAISER
+	/* Clone the shadow pgd part as well */
+	memcpy(native_get_shadow_pgd(dst),
+	       native_get_shadow_pgd(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)
diff -puN arch/x86/kernel/espfix_64.c~kaiser-base arch/x86/kernel/espfix_64.c
--- a/arch/x86/kernel/espfix_64.c~kaiser-base	2017-11-10 11:22:09.013244950 -0800
+++ b/arch/x86/kernel/espfix_64.c	2017-11-10 11:22:09.032244950 -0800
@@ -41,6 +41,7 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
+#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
+	/*
+	 * Just copy the top-level PGD that is mapping the espfix
+	 * area to ensure it is mapped into the shadow user page
+	 * tables.
+	 *
+	 * For 5-level paging, we should have already populated
+	 * the espfix pgd when kaiser_init() pre-populated all
+	 * the pgd entries.  The above p4d_alloc() would never do
+	 * anything and the p4d_populate() would be done to a p4d
+	 * already mapped in the userspace pgd.
+	 */
+#ifdef CONFIG_KAISER
+	if (CONFIG_PGTABLE_LEVELS <= 4)
+		set_pgd(native_get_shadow_pgd(pgd),
+			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
+#endif
 
 	/* Randomize the locations */
 	init_espfix_random();
diff -puN arch/x86/kernel/head_64.S~kaiser-base arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S~kaiser-base	2017-11-10 11:22:09.015244950 -0800
+++ b/arch/x86/kernel/head_64.S	2017-11-10 11:22:09.033244950 -0800
@@ -339,6 +339,14 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_KAISER
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -348,7 +356,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
@@ -362,10 +370,10 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
diff -puN arch/x86/kernel/traps.c~kaiser-base arch/x86/kernel/traps.c
--- a/arch/x86/kernel/traps.c~kaiser-base	2017-11-10 11:22:09.017244950 -0800
+++ b/arch/x86/kernel/traps.c	2017-11-10 11:22:09.033244950 -0800
@@ -329,6 +329,43 @@ __visible void __noreturn handle_stack_o
 }
 #endif
 
+/*
+ * This "fakes" a #GP from userspace upon returning (iret'ing)
+ * from this double fault.
+ */
+void setup_fake_gp_at_iret(struct pt_regs *regs)
+{
+	unsigned long *new_stack_top = (unsigned long *)
+		(this_cpu_read(cpu_tss.x86_tss.ist[0]) - 0x1500);
+
+	/*
+	 * Set up a stack just like the hardware would for a #GP.
+	 *
+	 * This format is an "iret frame", plus the error code
+	 * that the hardware puts on the stack for us for
+	 * exceptions.  (see struct pt_regs).
+	 */
+	new_stack_top[-1] = regs->ss;
+	new_stack_top[-2] = regs->sp;
+	new_stack_top[-3] = regs->flags;
+	new_stack_top[-4] = regs->cs;
+	new_stack_top[-5] = regs->ip;
+	new_stack_top[-6] = 0;	/* faked #GP error code */
+
+	/*
+	 * 'regs' points to the "iret frame" for *this*
+	 * exception, *not* the #GP we are faking.  Here,
+	 * we are telling 'iret' to jump to general_protection
+	 * when returning from this double fault.
+	 */
+	regs->ip = (unsigned long)general_protection;
+	/*
+	 * Make iret move the stack to the "fake #GP" stack
+	 * we created above.
+	 */
+	regs->sp = (unsigned long)&new_stack_top[-6];
+}
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
@@ -354,14 +391,7 @@ dotraplinkage void do_double_fault(struc
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
-
-		/* Fake a #GP(0) from userspace. */
-		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
-		normal_regs->orig_ax = 0;  /* Missing (lost) #GP error code */
-		regs->ip = (unsigned long)general_protection;
-		regs->sp = (unsigned long)&normal_regs->orig_ax;
-
+		setup_fake_gp_at_iret(regs);
 		return;
 	}
 #endif
diff -puN /dev/null arch/x86/mm/kaiser.c
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:09.034244950 -0800
@@ -0,0 +1,423 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This code is based in part on work published here:
+ *
+ *	https://github.com/IAIK/KAISER
+ *
+ * The original work was written by and and signed off by for the Linux
+ * kernel by:
+ *
+ *   Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
+ *   Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
+ *   Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
+ *   Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
+ *
+ * Major changes to the original code by: Dave Hansen <dave.hansen@intel.com>
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/kaiser.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+/*
+ * At runtime, the only things we map are some things for CPU
+ * hotplug, and stacks for new processes.  No two CPUs will ever
+ * be populating the same addresses, so we only need to ensure
+ * that we protect between two CPUs trying to allocate and
+ * populate the same page table page.
+ *
+ * Only take this lock when doing a set_p[4um]d(), but it is not
+ * needed for doing a set_pte().  We assume that only the *owner*
+ * of a given allocation will be doing this for _their_
+ * allocation.
+ *
+ * This ensures that once a system has been running for a while
+ * and there have been stacks all over and these page tables
+ * are fully populated, there will be no further acquisitions of
+ * this lock.
+ */
+static DEFINE_SPINLOCK(shadow_table_allocation_lock);
+
+/*
+ * This is only for walking kernel addresses.  We use it too help
+ * recreate the "shadow" page tables which are used while we are in
+ * userspace.
+ *
+ * This can be called on any kernel memory addresses and will work
+ * with any page sizes and any types: normal linear map memory,
+ * vmalloc(), even kmap().
+ *
+ * Note: this is only used when mapping new *kernel* entries into
+ * the user/shadow page tables.  It is never used for userspace
+ * addresses.
+ *
+ * Returns -1 on error.
+ */
+static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* We should only be asked to walk kernel addresses */
+	if (vaddr < PAGE_OFFSET) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pgd = pgd_offset_k(vaddr);
+	/*
+	 * We made all the kernel PGDs present in kaiser_init().
+	 * We expect them to stay that way.
+	 */
+	if (pgd_none(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+	/*
+	 * PGDs are either 512GB or 128TB on all x86_64
+	 * configurations.  We don't handle these.
+	 */
+	if (pgd_large(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	p4d = p4d_offset(pgd, vaddr);
+	if (p4d_none(*p4d)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pud = pud_offset(p4d, vaddr);
+	if (pud_none(*pud)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pud_large(*pud))
+		return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
+
+	pmd = pmd_offset(pud, vaddr);
+	if (pmd_none(*pmd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pmd_large(*pmd))
+		return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
+
+	pte = pte_offset_kernel(pmd, vaddr);
+	if (pte_none(*pte)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK);
+}
+
+/*
+ * Walk the shadow copy of the page tables (optionally) trying to
+ * allocate page table pages on the way down.  Does not support
+ * large pages since the data we are mapping is (generally) not
+ * large enough or aligned to 2MB.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+#define KAISER_WALK_ATOMIC  0x1
+static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
+					   unsigned long flags)
+{
+	pte_t *pte;
+	pmd_t *pmd;
+	pud_t *pud;
+	p4d_t *p4d;
+	pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+
+	if (flags & KAISER_WALK_ATOMIC) {
+		gfp &= ~GFP_KERNEL;
+		gfp |= __GFP_HIGH | __GFP_ATOMIC;
+	}
+
+	if (address < PAGE_OFFSET) {
+		WARN_ONCE(1, "attempt to walk user address\n");
+		return NULL;
+	}
+
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "All shadow pgds should have been populated\n");
+		return NULL;
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, address);
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (p4d_none(*p4d))
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+		else
+			free_page(new_pud_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pud_none(*pud))
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+		else
+			free_page(new_pmd_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pmd = pmd_offset(pud, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pmd_none(*pmd))
+			set_pmd(pmd, __pmd(_KERNPG_TABLE  | __pa(new_pte_page)));
+		else
+			free_page(new_pte_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_flags(*pte) & _PAGE_USER) {
+		WARN_ONCE(1, "attempt to walk to user pte\n");
+		return NULL;
+	}
+	return pte;
+}
+
+/*
+ * Given a kernel address, @__start_addr, copy that mapping into
+ * the user (shadow) page tables.  This may need to allocate page
+ * table pages.
+ */
+int kaiser_add_user_map(const void *__start_addr, unsigned long size,
+			unsigned long flags)
+{
+	pte_t *pte;
+	unsigned long start_addr = (unsigned long)__start_addr;
+	unsigned long address = start_addr & PAGE_MASK;
+	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
+	unsigned long target_address;
+
+	for (; address < end_addr; address += PAGE_SIZE) {
+		target_address = get_pa_from_kernel_map(address);
+		if (target_address == -1)
+			return -EIO;
+
+		pte = kaiser_shadow_pagetable_walk(address, false);
+		/*
+		 * Errors come from either -ENOMEM for a page
+		 * table page, or something screwy that did a
+		 * WARN_ON().  Just return -ENOMEM.
+		 */
+		if (!pte)
+			return -ENOMEM;
+		if (pte_none(*pte)) {
+			set_pte(pte, __pte(flags | target_address));
+		} else {
+			pte_t tmp;
+			set_pte(&tmp, __pte(flags | target_address));
+			WARN_ON_ONCE(!pte_same(*pte, tmp));
+		}
+	}
+	return 0;
+}
+
+int kaiser_add_user_map_ptrs(const void *__start_addr,
+			     const void *__end_addr,
+			     unsigned long flags)
+{
+	return kaiser_add_user_map(__start_addr,
+				   __end_addr - __start_addr,
+				   flags);
+}
+
+/*
+ * Ensure that the top level of the (shadow) page tables are
+ * entirely populated.  This ensures that all processes that get
+ * forked have the same entries.  This way, we do not have to
+ * ever go set up new entries in older processes.
+ *
+ * Note: we never free these, so there are no updates to them
+ * after this.
+ */
+static void __init kaiser_init_all_pgds(void)
+{
+	pgd_t *pgd;
+	int i = 0;
+
+	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
+	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
+		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;
+#if CONFIG_PGTABLE_LEVELS > 4
+		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
+		if (!p4d) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
+#else /* CONFIG_PGTABLE_LEVELS <= 4 */
+		pud_t *pud = pud_alloc_one(&init_mm, addr);
+		if (!pud) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
+#endif /* CONFIG_PGTABLE_LEVELS */
+	}
+}
+
+/*
+ * The page table allocations in here can theoretically fail, but
+ * we can not do much about it in early boot.  Do the checking
+ * and warning in a macro to make it more readable.
+ */
+#define kaiser_add_user_map_early(start, size, flags) do {	\
+	int __ret = kaiser_add_user_map(start, size, flags);	\
+	WARN_ON(__ret);						\
+} while (0)
+
+#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
+	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
+	WARN_ON(__ret);							\
+} while (0)
+
+extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
+/*
+ * If anything in here fails, we will likely die on one of the
+ * first kernel->user transitions and init will die.  But, we
+ * will have most of the kernel up by then and should be able to
+ * get a clean warning out of it.  If we BUG_ON() here, we run
+ * the risk of being before we have good console output.
+ *
+ * When KAISER is enabled, we remove _PAGE_GLOBAL from all of the
+ * kernel PTE permissions.  This ensures that the TLB entries for
+ * the kernel are not available when in userspace.  However, for
+ * the pages that are available to userspace *anyway*, we might as
+ * well continue to map them _PAGE_GLOBAL and enjoy the potential
+ * performance advantages.
+ */
+void __init kaiser_init(void)
+{
+	int cpu;
+
+	kaiser_init_all_pgds();
+
+	for_each_possible_cpu(cpu) {
+		void *percpu_vaddr = __per_cpu_user_mapped_start +
+				     per_cpu_offset(cpu);
+		unsigned long percpu_sz = __per_cpu_user_mapped_end -
+					  __per_cpu_user_mapped_start;
+		kaiser_add_user_map_early(percpu_vaddr, percpu_sz,
+					  __PAGE_KERNEL | _PAGE_GLOBAL);
+	}
+
+	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+
+	/* the fixed map address of the idt_table */
+	kaiser_add_user_map_early((void *)idt_descr.address,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+}
+
+int kaiser_add_mapping(unsigned long addr, unsigned long size,
+		       unsigned long flags)
+{
+	return kaiser_add_user_map((const void *)addr, size, flags);
+}
+
+void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+	unsigned long addr;
+
+	/* The shadow page tables always use small pages: */
+	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+		/*
+		 * Do an "atomic" walk in case this got called from an atomic
+		 * context.  This should not do any allocations because we
+		 * should only be walking things that are known to be mapped.
+		 */
+		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+
+		/*
+		 * We are removing a mapping that should
+		 * exist.  WARN if it was not there:
+		 */
+		if (!pte) {
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		pte_clear(&init_mm, addr, pte);
+	}
+	/*
+	 * This ensures that the TLB entries used to map this data are
+	 * no longer usable on *this* CPU.  We theoretically want to
+	 * flush the entries on all CPUs here, but that's too
+	 * expensive right now: this is called to unmap process
+	 * stacks in the exit() path path.
+	 *
+	 * This can change if we get to the point where this is not
+	 * in a remotely hot path, like only called via write_ldt().
+	 *
+	 * Note: we could probably also just invalidate the individual
+	 * addresses to take care of *this* PCID and then do a
+	 * tlb_flush_shared_nonglobals() to ensure that all other
+	 * PCIDs get flushed before being used again.
+	 */
+	__native_flush_tlb_global();
+}
diff -puN arch/x86/mm/Makefile~kaiser-base arch/x86/mm/Makefile
--- a/arch/x86/mm/Makefile~kaiser-base	2017-11-10 11:22:09.019244950 -0800
+++ b/arch/x86/mm/Makefile	2017-11-10 11:22:09.034244950 -0800
@@ -45,6 +45,7 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulatio
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_KAISER)		+= kaiser.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff -puN arch/x86/mm/pageattr.c~kaiser-base arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~kaiser-base	2017-11-10 11:22:09.020244950 -0800
+++ b/arch/x86/mm/pageattr.c	2017-11-10 11:22:09.035244950 -0800
@@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud,
 			pud_clear(pud);
 }
 
-static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
+void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
 	pud_t *pud = pud_offset(p4d, start);
 
diff -puN arch/x86/mm/pgtable.c~kaiser-base arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~kaiser-base	2017-11-10 11:22:09.022244950 -0800
+++ b/arch/x86/mm/pgtable.c	2017-11-10 11:22:09.035244950 -0800
@@ -354,14 +354,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_KAISER
+/*
+ * Instead of one pgd, we aquire two pgds.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 
diff -puN /dev/null Documentation/x86/kaiser.txt
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/Documentation/x86/kaiser.txt	2017-11-10 11:22:09.035244950 -0800
@@ -0,0 +1,160 @@
+Overview
+========
+
+KAISER is a countermeasure against attacks on kernel address
+information.  There are at least three existing, published,
+approaches using the shared user/kernel mapping and hardware features
+to defeat KASLR.  One approach referenced in the paper locates the
+kernel by observing differences in page fault timing between
+present-but-inaccessable kernel pages and non-present pages.
+
+When we enter the kernel via syscalls, interrupts or exceptions,
+page tables are switched to the full "kernel" copy.  When the
+system switches back to user mode, the user/shadow copy is used.
+
+The minimalistic kernel portion of the user page tables try to
+map only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor
+table (IDT).
+
+This helps ensure that side-channel attacks that leverage the
+paging structures do not function when KAISER is enabled.  It
+can be enabled by setting CONFIG_KAISER=y
+
+Page Table Management
+=====================
+
+KAISER logically keeps a "copy" of the page tables which unmap
+the kernel while in userspace.  The kernel manages the page
+tables as normal, but the "copying" is done with a few tricks
+that mean that we do not have to manage two full copies.
+
+The first trick is that for any any new kernel mapping, we
+presume that we do not want it mapped to userspace.  That means
+we normally have no copying to do.  We only copy the kernel
+entries over to the shadow in response to a kaiser_add_*()
+call which is rare.
+
+For a new userspace mapping, the kernel makes the entries in
+its page tables like normal.  The only difference is when the
+kernel makes entries in the top (PGD) level.  In addition to
+setting the entry in the main kernel PGD, a copy if the entry
+is made in the shadow PGD.
+
+PGD entries always point to another page table.  Two PGD
+entries pointing to the same thing gives us shared page tables
+for all the lower entries.  This leaves a single, shared set of
+userspace page tables to manage.  One PTE to lock, one set set
+of accessed bits, dirty bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes 4k per process).
+  b. The pre-allocated second-level (p4d or pud) kernel page
+     table pages cost ~1MB of additional memory at boot.  This
+     is not totally wasted because some of these pages would
+     have been needed eventually for normal kernel page tables
+     and things in the vmalloc() area like vmemmap[].
+  c. Statically-allocated structures and entry/exit text must
+     be padded out to 4k (or 8k for PGDs) so they can be mapped
+     into the user page tables.  This bloats the kernel image
+     by ~20-30k.
+  d. The shadow page tables eventually grow to map all of used
+     vmalloc() space.  They can have roughly the same memory
+     consumption as the vmalloc() page tables.
+
+2. Runtime Cost
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and we need one at entry and another at exit.
+  b. Task stacks must be mapped/unmapped.  We need to walk
+     and modify the shadow page tables at fork() and exit().
+  c. Global pages are disabled.  This feature of the MMU
+     allows different processes to share TLB entries mapping
+     the kernel.  Losing the feature means potentially more
+     TLB misses after a context switch.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when we switch
+     the page tables.  This makes switching the page tables
+     (at context switch, or kernel entry/exit) cheaper.  But,
+     on systems with PCID support, the context switch code
+     must flush both the user and kernel entries out of the
+     TLB, with an INVPCID in addition to the CR3 write.  This
+     INVPCID is generally slower than a CR3 write, but still
+     on the order of a hundred cycles.
+  e. The shadow page tables must be populated for each new
+     process.  Even without KAISER, since we share all of the
+     kernel mappings in all processes, we can do all this
+     population for kernel addresses at the top level of the
+     page tables (the PGD level).  But, with KAISER, we now
+     have *two* kernel mappings: one in the kernel page tables
+     that maps everything and one in the user/shadow page
+     tables mapping the "minimal" kernel.  At fork(), we
+     copy the portion of the shadow PGD that maps the minimal
+     kernel structures in addition to the normal kernel one.
+  f. In addition to the fork()-time copying, we must also
+     update the shadow PGD any time a set_pgd() is done on a
+     PGD used to map userspace.  This ensures that the kernel
+     and user/shadow copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+
+Possible Future Work:
+1. We can be more careful about not actually writing to CR3
+   unless we actually switch it.
+2. Try to have dedicated entry/exit kernel stacks so we do
+   not have to map/unmap the task/thread stacks.
+3. Compress the user/shadow-mapped data to be mapped together
+   underneath a single PGD entry.
+4. Re-enable global pages, but use them for mappings in the
+   user/shadow page tables.  This would allow the kernel to
+   take advantage of TLB entries that were established from
+   the user page tables.  This might speed up the entry/exit
+   code or userspace since it will not have to reload all of
+   its TLB entries.  However, its upside is limited by PCID
+   being used.
+5. Allow KAISER to enabled/disabled at runtime so folks can
+   run a single kernel image.
+
+Debugging:
+
+Bugs in KAISER cause a few different signatures of crashes
+that are worth noting here.
+
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-kaiser-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not kaiser-mapped.
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
diff -puN /dev/null include/linux/kaiser.h
--- /dev/null	2017-11-06 07:51:38.702108459 -0800
+++ b/include/linux/kaiser.h	2017-11-10 11:22:09.036244950 -0800
@@ -0,0 +1,29 @@
+#ifndef _INCLUDE_KAISER_H
+#define _INCLUDE_KAISER_H
+
+#ifdef CONFIG_KAISER
+#include <asm/kaiser.h>
+#else
+
+/*
+ * These stubs are used whenever CONFIG_KAISER is off, which
+ * includes architectures that support KAISER, but have it
+ * disabled.
+ */
+
+static inline void kaiser_init(void)
+{
+}
+
+static inline void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+}
+
+static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
+				     unsigned long flags)
+{
+	return 0;
+}
+
+#endif /* !CONFIG_KAISER */
+#endif /* _INCLUDE_KAISER_H */
diff -puN init/main.c~kaiser-base init/main.c
--- a/init/main.c~kaiser-base	2017-11-10 11:22:09.025244950 -0800
+++ b/init/main.c	2017-11-10 11:22:09.036244950 -0800
@@ -75,6 +75,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/kaiser.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -504,6 +505,8 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	/* This just needs to be done before we first run userspace: */
+	kaiser_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)
diff -puN kernel/fork.c~kaiser-base kernel/fork.c
--- a/kernel/fork.c~kaiser-base	2017-11-10 11:22:09.027244950 -0800
+++ b/kernel/fork.c	2017-11-10 11:22:09.037244950 -0800
@@ -70,6 +70,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
+#include <linux/kaiser.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

KAISER has two copies of the page tables: one for the kernel and
one for when running in userspace.  There is also a kernel
portion of each of the page tables: the part that *maps* the
kernel.

The kernel portion is relatively static and uses pre-populated
PGDs.  Nobody ever calls set_pgd() on the kernel portion during
normal operation.

The userspace portion of the page tables is updated frequently as
userspace pages are mapped and page table pages are allocated.
These updates of the userspace *portion* of the tables need to be
reflected into both the kernel and user/shadow copies.

The original KAISER patches did this by effectively looking at
the address that is being updated.  If it is <PAGE_OFFSET,
it is considered to be doing an update for the userspace portion of the page
tables and must make an entry in the shadow.

However, this has a wrinkle: there are a few places where low
addresses are used in supervisor (kernel) mode.  When EFI calls
are made, they use what are traditionally user addresses in
supervisor mode and trip over these checks.  The trampoline code
that used for booting secondary CPUs has a similar issue.

Remember, there are two things that KAISER needs performed on a
userspace PGD:

 1. Populate the shadow itself
 2. Poison the kernel PGD so it can not be used by userspace.

This patch only performs these actions when dealing with a user
address *and* the PGD has _PAGE_USER set.  That way, in-kernel
users of low addresses typically used by userspace are not
accidentally poisoned.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_64.h |   94 +++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 33 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX	2017-11-10 11:22:09.932244947 -0800
+++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:09.935244947 -0800
@@ -177,38 +177,76 @@ static inline p4d_t *native_get_normal_p
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
- * This returns true for user pages that need to get copied into
- * both the user and kernel copies of the page tables, and false
- * for kernel pages that should only be in the kernel copy.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
  */
-static inline bool is_userspace_pgd(void *__ptr)
+static inline bool pgdp_maps_userspace(void *__ptr)
 {
 	unsigned long ptr = (unsigned long)__ptr;
 
 	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
 }
 
+/*
+ * Does this PGD allow access via userspace?
+ */
+static inline bool pgd_userspace_access(pgd_t pgd)
+{
+	return (pgd.pgd & _PAGE_USER);
+}
+
+/*
+ * Returns the pgd_t that the kernel should use in its page tables.
+ */
+static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+#ifdef CONFIG_KAISER
+	if (pgd_userspace_access(pgd)) {
+		if (pgdp_maps_userspace(pgdp)) {
+			/*
+			 * The user/shadow page tables get the full
+			 * PGD, accessible to userspace:
+			 */
+			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+			/*
+			 * For the copy of the pgd that the kernel
+			 * uses, make it unusable to userspace.  This
+			 * ensures if we get out to userspace with the
+			 * wrong CR3 value, userspace will crash
+			 * instead of running.
+			 */
+			pgd.pgd |= _PAGE_NX;
+		}
+	} else if (!pgd.pgd) {
+		/*
+		 * We are clearing the PGD and can not check  _PAGE_USER
+		 * in the zero'd PGD.  We never do this on the
+		 * pre-populated kernel PGDs, except for pgd_bad().
+		 */
+		if (pgdp_maps_userspace(pgdp)) {
+			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+		} else {
+			/*
+			 * Uh, we are very confused.  We have been
+			 * asked to clear a PGD that is in the kernel
+			 * part of the address space.  We preallocated
+			 * all the KAISER PGDs, so this should never
+			 * happen.
+			 */
+			WARN_ON_ONCE(1);
+		}
+	}
+#endif
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
-	/*
-	 * set_pgd() does not get called when we are running
-	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
-	 * know here that we have a p4d but that it is really at
-	 * the top level of the page tables; it is really just a
-	 * pgd.
-	 */
-	/* Do we need to also populate the shadow p4d? */
-	if (is_userspace_pgd(p4dp))
-		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
-	/*
-	 * Even if the entry is *mapping* userspace, ensure
-	 * that userspace can not use it.  This way, if we
-	 * get out to userspace with the wrong CR3 value,
-	 * userspace will crash instead of running.
-	 */
-	if (!p4d.pgd.pgd)
-		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;
+	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
 #else /* CONFIG_KAISER */
 	*p4dp = p4d;
 #endif
@@ -226,17 +264,7 @@ static inline void native_p4d_clear(p4d_
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 #ifdef CONFIG_KAISER
-	/* Do we need to also populate the shadow pgd? */
-	if (is_userspace_pgd(pgdp))
-		native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
-	/*
-	 * Even if the entry is mapping userspace, ensure
-	 * that it is unusable for userspace.  This way,
-	 * if we get out to userspace with the wrong CR3
-	 * value, userspace will crash instead of running.
-	 */
-	if (!pgd_none(pgd))
-		pgdp->pgd = pgd.pgd | _PAGE_NX;
+	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
 #else /* CONFIG_KAISER */
 	*pgdp = pgd;
 #endif
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

KAISER has two copies of the page tables: one for the kernel and
one for when running in userspace.  There is also a kernel
portion of each of the page tables: the part that *maps* the
kernel.

The kernel portion is relatively static and uses pre-populated
PGDs.  Nobody ever calls set_pgd() on the kernel portion during
normal operation.

The userspace portion of the page tables is updated frequently as
userspace pages are mapped and page table pages are allocated.
These updates of the userspace *portion* of the tables need to be
reflected into both the kernel and user/shadow copies.

The original KAISER patches did this by effectively looking at
the address that is being updated.  If it is <PAGE_OFFSET,
it is considered to be doing an update for the userspace portion of the page
tables and must make an entry in the shadow.

However, this has a wrinkle: there are a few places where low
addresses are used in supervisor (kernel) mode.  When EFI calls
are made, they use what are traditionally user addresses in
supervisor mode and trip over these checks.  The trampoline code
that used for booting secondary CPUs has a similar issue.

Remember, there are two things that KAISER needs performed on a
userspace PGD:

 1. Populate the shadow itself
 2. Poison the kernel PGD so it can not be used by userspace.

This patch only performs these actions when dealing with a user
address *and* the PGD has _PAGE_USER set.  That way, in-kernel
users of low addresses typically used by userspace are not
accidentally poisoned.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_64.h |   94 +++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 33 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX	2017-11-10 11:22:09.932244947 -0800
+++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:09.935244947 -0800
@@ -177,38 +177,76 @@ static inline p4d_t *native_get_normal_p
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
- * This returns true for user pages that need to get copied into
- * both the user and kernel copies of the page tables, and false
- * for kernel pages that should only be in the kernel copy.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
  */
-static inline bool is_userspace_pgd(void *__ptr)
+static inline bool pgdp_maps_userspace(void *__ptr)
 {
 	unsigned long ptr = (unsigned long)__ptr;
 
 	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
 }
 
+/*
+ * Does this PGD allow access via userspace?
+ */
+static inline bool pgd_userspace_access(pgd_t pgd)
+{
+	return (pgd.pgd & _PAGE_USER);
+}
+
+/*
+ * Returns the pgd_t that the kernel should use in its page tables.
+ */
+static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+#ifdef CONFIG_KAISER
+	if (pgd_userspace_access(pgd)) {
+		if (pgdp_maps_userspace(pgdp)) {
+			/*
+			 * The user/shadow page tables get the full
+			 * PGD, accessible to userspace:
+			 */
+			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+			/*
+			 * For the copy of the pgd that the kernel
+			 * uses, make it unusable to userspace.  This
+			 * ensures if we get out to userspace with the
+			 * wrong CR3 value, userspace will crash
+			 * instead of running.
+			 */
+			pgd.pgd |= _PAGE_NX;
+		}
+	} else if (!pgd.pgd) {
+		/*
+		 * We are clearing the PGD and can not check  _PAGE_USER
+		 * in the zero'd PGD.  We never do this on the
+		 * pre-populated kernel PGDs, except for pgd_bad().
+		 */
+		if (pgdp_maps_userspace(pgdp)) {
+			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+		} else {
+			/*
+			 * Uh, we are very confused.  We have been
+			 * asked to clear a PGD that is in the kernel
+			 * part of the address space.  We preallocated
+			 * all the KAISER PGDs, so this should never
+			 * happen.
+			 */
+			WARN_ON_ONCE(1);
+		}
+	}
+#endif
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
-	/*
-	 * set_pgd() does not get called when we are running
-	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
-	 * know here that we have a p4d but that it is really at
-	 * the top level of the page tables; it is really just a
-	 * pgd.
-	 */
-	/* Do we need to also populate the shadow p4d? */
-	if (is_userspace_pgd(p4dp))
-		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
-	/*
-	 * Even if the entry is *mapping* userspace, ensure
-	 * that userspace can not use it.  This way, if we
-	 * get out to userspace with the wrong CR3 value,
-	 * userspace will crash instead of running.
-	 */
-	if (!p4d.pgd.pgd)
-		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;
+	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
 #else /* CONFIG_KAISER */
 	*p4dp = p4d;
 #endif
@@ -226,17 +264,7 @@ static inline void native_p4d_clear(p4d_
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 #ifdef CONFIG_KAISER
-	/* Do we need to also populate the shadow pgd? */
-	if (is_userspace_pgd(pgdp))
-		native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
-	/*
-	 * Even if the entry is mapping userspace, ensure
-	 * that it is unusable for userspace.  This way,
-	 * if we get out to userspace with the wrong CR3
-	 * value, userspace will crash instead of running.
-	 */
-	if (!pgd_none(pgd))
-		pgdp->pgd = pgd.pgd | _PAGE_NX;
+	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
 #else /* CONFIG_KAISER */
 	*pgdp = pgd;
 #endif
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 10/30] x86, kaiser: allow NX poison to be set in p4d/pgd
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The user portion of the kernel page tables use the NX bit to
poison them for userspace.  But, that trips the p4d/pgd_bad()
checks.  Make sure it does not do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable.h |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable.h~kaiser-p4d-allow-nx arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~kaiser-p4d-allow-nx	2017-11-10 11:22:10.474244946 -0800
+++ b/arch/x86/include/asm/pgtable.h	2017-11-10 11:22:10.478244946 -0800
@@ -845,7 +845,12 @@ static inline pud_t *pud_offset(p4d_t *p
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -879,7 +884,12 @@ static inline p4d_t *p4d_offset(pgd_t *p
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 10/30] x86, kaiser: allow NX poison to be set in p4d/pgd
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The user portion of the kernel page tables use the NX bit to
poison them for userspace.  But, that trips the p4d/pgd_bad()
checks.  Make sure it does not do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable.h |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable.h~kaiser-p4d-allow-nx arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~kaiser-p4d-allow-nx	2017-11-10 11:22:10.474244946 -0800
+++ b/arch/x86/include/asm/pgtable.h	2017-11-10 11:22:10.478244946 -0800
@@ -845,7 +845,12 @@ static inline pud_t *pud_offset(p4d_t *p
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -879,7 +884,12 @@ static inline p4d_t *p4d_offset(pgd_t *p
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 11/30] x86, kaiser: make sure static PGDs are 8k in size
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

A few PGDs come out of the kernel binary instead of being
allocated dynamically.  Before this patch, they are all
8k-aligned, but they must also be 8k in *size*.

The original KAISER patch did not do this.  It probably just
lucked out that it did not trample over data after the last PGD.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/head_64.S |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff -puN arch/x86/kernel/head_64.S~kaiser-head_S-pgds-need-8k-too arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S~kaiser-head_S-pgds-need-8k-too	2017-11-10 11:22:11.018244945 -0800
+++ b/arch/x86/kernel/head_64.S	2017-11-10 11:22:11.021244945 -0800
@@ -340,11 +340,24 @@ GLOBAL(early_recursion_flag)
 GLOBAL(name)
 
 #ifdef CONFIG_KAISER
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KAISER_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
 #define NEXT_PGD_PAGE(name) \
 	.balign 2 * PAGE_SIZE; \
 GLOBAL(name)
 #else
 #define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KAISER_USER_PGD_FILL	0
 #endif
 
 /* Automate the creation of 1 to 1 mapping pmd entries */
@@ -363,6 +376,7 @@ NEXT_PGD_PAGE(early_top_pgt)
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -372,6 +386,7 @@ NEXT_PAGE(early_dynamic_pgts)
 #ifndef CONFIG_XEN
 NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	KAISER_USER_PGD_FILL,8,0
 #else
 NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -380,6 +395,7 @@ NEXT_PGD_PAGE(init_top_pgt)
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 11/30] x86, kaiser: make sure static PGDs are 8k in size
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

A few PGDs come out of the kernel binary instead of being
allocated dynamically.  Before this patch, they are all
8k-aligned, but they must also be 8k in *size*.

The original KAISER patch did not do this.  It probably just
lucked out that it did not trample over data after the last PGD.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/head_64.S |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff -puN arch/x86/kernel/head_64.S~kaiser-head_S-pgds-need-8k-too arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S~kaiser-head_S-pgds-need-8k-too	2017-11-10 11:22:11.018244945 -0800
+++ b/arch/x86/kernel/head_64.S	2017-11-10 11:22:11.021244945 -0800
@@ -340,11 +340,24 @@ GLOBAL(early_recursion_flag)
 GLOBAL(name)
 
 #ifdef CONFIG_KAISER
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KAISER_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
 #define NEXT_PGD_PAGE(name) \
 	.balign 2 * PAGE_SIZE; \
 GLOBAL(name)
 #else
 #define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KAISER_USER_PGD_FILL	0
 #endif
 
 /* Automate the creation of 1 to 1 mapping pmd entries */
@@ -363,6 +376,7 @@ NEXT_PGD_PAGE(early_top_pgt)
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -372,6 +386,7 @@ NEXT_PAGE(early_dynamic_pgts)
 #ifndef CONFIG_XEN
 NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	KAISER_USER_PGD_FILL,8,0
 #else
 NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -380,6 +395,7 @@ NEXT_PGD_PAGE(init_top_pgt)
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The GDT is used to control the x86 segmentation mechanism.  It
must be virtually mapped when switching segments or at IRET
time when switching between userspace and kernel.

The original KAISER patch did not do this.  I have no idea how
it ever worked.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/cpu/common.c |   15 +++++++++++++++
 b/arch/x86/mm/kaiser.c         |   10 ++++++++++
 2 files changed, 25 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~kaiser-user-map-gdt-pages arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-user-map-gdt-pages	2017-11-10 11:22:11.559244943 -0800
+++ b/arch/x86/kernel/cpu/common.c	2017-11-10 11:22:11.564244943 -0800
@@ -5,6 +5,7 @@
 #include <linux/export.h>
 #include <linux/percpu.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/ctype.h>
 #include <linux/delay.h>
 #include <linux/sched/mm.h>
@@ -487,6 +488,20 @@ static inline void setup_fixmap_gdt(int
 #endif
 
 	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+
+	/* CPU 0's mapping is done in kaiser_init() */
+	if (cpu) {
+		int ret;
+
+		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
+					 PAGE_SIZE, __PAGE_KERNEL_RO);
+		/*
+		 * We do not have a good way to fail CPU bringup.
+		 * Just WARN about it and hope we boot far enough
+		 * to get a good log out.
+		 */
+		WARN_ON(ret);
+	}
 }
 
 /* Load the original GDT from the per-cpu structure */
diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-gdt-pages arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-gdt-pages	2017-11-10 11:22:11.560244943 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:11.565244943 -0800
@@ -372,6 +372,16 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_early((void *)idt_descr.address,
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+	/*
+	 * We could theoretically do this in setup_fixmap_gdt().
+	 * But, we would need to rewrite the above page table
+	 * allocation code to use the bootmem allocator.  The
+	 * buddy allocator is not available at the time that we
+	 * call setup_fixmap_gdt() for CPU 0.
+	 */
+	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
+				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The GDT is used to control the x86 segmentation mechanism.  It
must be virtually mapped when switching segments or at IRET
time when switching between userspace and kernel.

The original KAISER patch did not do this.  I have no idea how
it ever worked.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/cpu/common.c |   15 +++++++++++++++
 b/arch/x86/mm/kaiser.c         |   10 ++++++++++
 2 files changed, 25 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~kaiser-user-map-gdt-pages arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-user-map-gdt-pages	2017-11-10 11:22:11.559244943 -0800
+++ b/arch/x86/kernel/cpu/common.c	2017-11-10 11:22:11.564244943 -0800
@@ -5,6 +5,7 @@
 #include <linux/export.h>
 #include <linux/percpu.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/ctype.h>
 #include <linux/delay.h>
 #include <linux/sched/mm.h>
@@ -487,6 +488,20 @@ static inline void setup_fixmap_gdt(int
 #endif
 
 	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+
+	/* CPU 0's mapping is done in kaiser_init() */
+	if (cpu) {
+		int ret;
+
+		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
+					 PAGE_SIZE, __PAGE_KERNEL_RO);
+		/*
+		 * We do not have a good way to fail CPU bringup.
+		 * Just WARN about it and hope we boot far enough
+		 * to get a good log out.
+		 */
+		WARN_ON(ret);
+	}
 }
 
 /* Load the original GDT from the per-cpu structure */
diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-gdt-pages arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-gdt-pages	2017-11-10 11:22:11.560244943 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:11.565244943 -0800
@@ -372,6 +372,16 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_early((void *)idt_descr.address,
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+	/*
+	 * We could theoretically do this in setup_fixmap_gdt().
+	 * But, we would need to rewrite the above page table
+	 * allocation code to use the bootmem allocator.  The
+	 * buddy allocator is not available at the time that we
+	 * call setup_fixmap_gdt() for CPU 0.
+	 */
+	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
+				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 13/30] x86, kaiser: map dynamically-allocated LDTs
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Normally, a process has a NULL mm->context.ldt.  But, there is a
syscall for a process to set a new one.  If a process does that,
the LDT be mapped into the user page tables, just like the
default copy.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/ldt.c |   25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff -puN arch/x86/kernel/ldt.c~kaiser-user-map-new-ldts arch/x86/kernel/ldt.c
--- a/arch/x86/kernel/ldt.c~kaiser-user-map-new-ldts	2017-11-10 11:22:12.127244942 -0800
+++ b/arch/x86/kernel/ldt.c	2017-11-10 11:22:12.131244942 -0800
@@ -10,6 +10,7 @@
 #include <linux/gfp.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/syscalls.h>
@@ -56,11 +57,21 @@ static void flush_ldt(void *__mm)
 	refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree_atomic(ldt->entries);
+	else
+		free_page((unsigned long)ldt->entries);
+	kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
 	struct ldt_struct *new_ldt;
 	unsigned int alloc_size;
+	int ret;
 
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
@@ -88,6 +99,12 @@ static struct ldt_struct *alloc_ldt_stru
 		return NULL;
 	}
 
+	ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size,
+				 __PAGE_KERNEL | _PAGE_GLOBAL);
+	if (ret) {
+		__free_ldt_struct(new_ldt);
+		return NULL;
+	}
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
@@ -114,12 +131,10 @@ static void free_ldt_struct(struct ldt_s
 	if (likely(!ldt))
 		return;
 
+	kaiser_remove_mapping((unsigned long)ldt->entries,
+			      ldt->nr_entries * LDT_ENTRY_SIZE);
 	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
+	__free_ldt_struct(ldt);
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 13/30] x86, kaiser: map dynamically-allocated LDTs
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Normally, a process has a NULL mm->context.ldt.  But, there is a
syscall for a process to set a new one.  If a process does that,
the LDT be mapped into the user page tables, just like the
default copy.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/ldt.c |   25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff -puN arch/x86/kernel/ldt.c~kaiser-user-map-new-ldts arch/x86/kernel/ldt.c
--- a/arch/x86/kernel/ldt.c~kaiser-user-map-new-ldts	2017-11-10 11:22:12.127244942 -0800
+++ b/arch/x86/kernel/ldt.c	2017-11-10 11:22:12.131244942 -0800
@@ -10,6 +10,7 @@
 #include <linux/gfp.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/syscalls.h>
@@ -56,11 +57,21 @@ static void flush_ldt(void *__mm)
 	refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree_atomic(ldt->entries);
+	else
+		free_page((unsigned long)ldt->entries);
+	kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
 	struct ldt_struct *new_ldt;
 	unsigned int alloc_size;
+	int ret;
 
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
@@ -88,6 +99,12 @@ static struct ldt_struct *alloc_ldt_stru
 		return NULL;
 	}
 
+	ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size,
+				 __PAGE_KERNEL | _PAGE_GLOBAL);
+	if (ret) {
+		__free_ldt_struct(new_ldt);
+		return NULL;
+	}
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
@@ -114,12 +131,10 @@ static void free_ldt_struct(struct ldt_s
 	if (likely(!ldt))
 		return;
 
+	kaiser_remove_mapping((unsigned long)ldt->entries,
+			      ldt->nr_entries * LDT_ENTRY_SIZE);
 	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
+	__free_ldt_struct(ldt);
 }
 
 /*
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 14/30] x86, kaiser: map espfix structures
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

There is some rather arcane code to help when an IRET returns
to 16-bit segments.  It is referred to as the "espfix" code.
This consists of a few per-cpu variables:

	espfix_stack: tells us where the stack is allocated
	  	      (the bottom)
	espfix_waddr: tells us to where %rsp may be pointed
		      (the top)

These are in addition to the stack itself.  All three things must
be mapped for the espfix code to function.

Note: the espfix code runs with a kernel GSBASE, but user
(shadow) page tables.  A switch to the kernel page tables could
be performed instead of mapping these structures, but mapping
them is simpler and less likely to break the assembly.  To switch
over to the kernel copy, additional temporary storage would be
required which is in short supply in this context.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/espfix_64.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff -puN arch/x86/kernel/espfix_64.c~kaiser-user-map-espfix arch/x86/kernel/espfix_64.c
--- a/arch/x86/kernel/espfix_64.c~kaiser-user-map-espfix	2017-11-10 11:22:12.669244941 -0800
+++ b/arch/x86/kernel/espfix_64.c	2017-11-10 11:22:12.673244941 -0800
@@ -33,6 +33,7 @@
 
 #include <linux/init.h>
 #include <linux/init_task.h>
+#include <linux/kaiser.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
 #include <linux/gfp.h>
@@ -41,7 +42,6 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
-#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,4 +225,10 @@ done:
 	per_cpu(espfix_stack, cpu) = addr;
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
+	/*
+	 * _PAGE_GLOBAL is not really required.  This is not a hot
+	 * path, but we do it here for consistency.
+	 */
+	kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE,
+			__PAGE_KERNEL | _PAGE_GLOBAL);
 }
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 14/30] x86, kaiser: map espfix structures
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

There is some rather arcane code to help when an IRET returns
to 16-bit segments.  It is referred to as the "espfix" code.
This consists of a few per-cpu variables:

	espfix_stack: tells us where the stack is allocated
	  	      (the bottom)
	espfix_waddr: tells us to where %rsp may be pointed
		      (the top)

These are in addition to the stack itself.  All three things must
be mapped for the espfix code to function.

Note: the espfix code runs with a kernel GSBASE, but user
(shadow) page tables.  A switch to the kernel page tables could
be performed instead of mapping these structures, but mapping
them is simpler and less likely to break the assembly.  To switch
over to the kernel copy, additional temporary storage would be
required which is in short supply in this context.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/espfix_64.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff -puN arch/x86/kernel/espfix_64.c~kaiser-user-map-espfix arch/x86/kernel/espfix_64.c
--- a/arch/x86/kernel/espfix_64.c~kaiser-user-map-espfix	2017-11-10 11:22:12.669244941 -0800
+++ b/arch/x86/kernel/espfix_64.c	2017-11-10 11:22:12.673244941 -0800
@@ -33,6 +33,7 @@
 
 #include <linux/init.h>
 #include <linux/init_task.h>
+#include <linux/kaiser.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
 #include <linux/gfp.h>
@@ -41,7 +42,6 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
-#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,4 +225,10 @@ done:
 	per_cpu(espfix_stack, cpu) = addr;
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
+	/*
+	 * _PAGE_GLOBAL is not really required.  This is not a hot
+	 * path, but we do it here for consistency.
+	 */
+	kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE,
+			__PAGE_KERNEL | _PAGE_GLOBAL);
 }
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 15/30] x86, kaiser: map entry stack variables
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

There are times where the kernel is entered but there is not a
safe stack, like at SYSCALL entry.  To obtain a safe stack, the
per-cpu variables 'rsp_scratch' and 'cpu_current_top_of_stack'
are used to save the old %rsp value and to find where the kernel
stack should start.

You can not directly manipulate the CR3 register.  You can only
'MOV' to it from another register, which means a register must be
clobbered in order to do any CR3 manipulation.  User-mapping
these variables allows us to obtain a safe stack and use it for
temporary storage *before* CR3 is switched.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/cpu/common.c |    2 +-
 b/arch/x86/kernel/process_64.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~kaiser-user-map-stack-helper-vars arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-user-map-stack-helper-vars	2017-11-10 11:22:13.203244939 -0800
+++ b/arch/x86/kernel/cpu/common.c	2017-11-10 11:22:13.209244939 -0800
@@ -1447,7 +1447,7 @@ DEFINE_PER_CPU_ALIGNED(struct stack_cana
  * trampoline, not the thread stack.  Use an extra percpu variable to track
  * the top of the kernel stack directly.
  */
-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) =
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, cpu_current_top_of_stack) =
 	(unsigned long)&init_thread_union + THREAD_SIZE;
 EXPORT_PER_CPU_SYMBOL(cpu_current_top_of_stack);
 
diff -puN arch/x86/kernel/process_64.c~kaiser-user-map-stack-helper-vars arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~kaiser-user-map-stack-helper-vars	2017-11-10 11:22:13.205244939 -0800
+++ b/arch/x86/kernel/process_64.c	2017-11-10 11:22:13.209244939 -0800
@@ -59,7 +59,7 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 15/30] x86, kaiser: map entry stack variables
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

There are times where the kernel is entered but there is not a
safe stack, like at SYSCALL entry.  To obtain a safe stack, the
per-cpu variables 'rsp_scratch' and 'cpu_current_top_of_stack'
are used to save the old %rsp value and to find where the kernel
stack should start.

You can not directly manipulate the CR3 register.  You can only
'MOV' to it from another register, which means a register must be
clobbered in order to do any CR3 manipulation.  User-mapping
these variables allows us to obtain a safe stack and use it for
temporary storage *before* CR3 is switched.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/cpu/common.c |    2 +-
 b/arch/x86/kernel/process_64.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~kaiser-user-map-stack-helper-vars arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-user-map-stack-helper-vars	2017-11-10 11:22:13.203244939 -0800
+++ b/arch/x86/kernel/cpu/common.c	2017-11-10 11:22:13.209244939 -0800
@@ -1447,7 +1447,7 @@ DEFINE_PER_CPU_ALIGNED(struct stack_cana
  * trampoline, not the thread stack.  Use an extra percpu variable to track
  * the top of the kernel stack directly.
  */
-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) =
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, cpu_current_top_of_stack) =
 	(unsigned long)&init_thread_union + THREAD_SIZE;
 EXPORT_PER_CPU_SYMBOL(cpu_current_top_of_stack);
 
diff -puN arch/x86/kernel/process_64.c~kaiser-user-map-stack-helper-vars arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~kaiser-user-map-stack-helper-vars	2017-11-10 11:22:13.205244939 -0800
+++ b/arch/x86/kernel/process_64.c	2017-11-10 11:22:13.209244939 -0800
@@ -59,7 +59,7 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 16/30] x86, kaiser: map trace interrupt entry
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

All of the interrupt entry/exit code is in a special section
(.irqentry.text).  This enables the ftrace code to figure out
when the kernel is executing in the "grey area" of interrupt
handling before the C code has taken over and marked the data
structures indicating that an interrupt is in progress.

KAISER needs to map this section into the user page tables
because it contains the assembly that helps us enter interrupt
routines.  In addition to the assembly which KAISER *needs*, the
section also contains the first C function that handles an
interrupt.  This is unfortunate, but it doesn't really hurt
anything.

This patch also aligns the .entry.text and .irqentry.text.  This
ensures that only the _required_ text is mapped.

Without this alignment, code might be mapped inadvertently as a
result of sharing a page with code that is intentionally mapped.
This does not hurt anything, but it makes debugging hard because
random build alignment changes can cause things to fail.

This was missed in the original KAISER patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c              |   14 ++++++++++++++
 b/include/asm-generic/vmlinux.lds.h |   10 ++++++++++
 2 files changed, 24 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-trace-irqentry_text arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-trace-irqentry_text	2017-11-10 11:22:13.763244938 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:13.768244938 -0800
@@ -30,6 +30,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/init.h>
+#include <linux/interrupt.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <linux/uaccess.h>
@@ -382,6 +383,19 @@ void __init kaiser_init(void)
 	 */
 	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+	/*
+	 * .irqentry.text helps us identify code that runs before
+	 * we get a chance to call entering_irq().  This includes
+	 * the interrupt entry assembly plus the first C function
+	 * that gets called.  KAISER does not need the C code
+	 * mapped.  We just use the .irqentry.text section as-is
+	 * to avoid having to carve out a new section for the
+	 * assembly only.
+	 */
+	kaiser_add_user_map_ptrs_early(__irqentry_text_start,
+				       __irqentry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
diff -puN include/asm-generic/vmlinux.lds.h~kaiser-user-map-trace-irqentry_text include/asm-generic/vmlinux.lds.h
--- a/include/asm-generic/vmlinux.lds.h~kaiser-user-map-trace-irqentry_text	2017-11-10 11:22:13.765244938 -0800
+++ b/include/asm-generic/vmlinux.lds.h	2017-11-10 11:22:13.769244938 -0800
@@ -59,6 +59,12 @@
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
 #define ALIGN_FUNCTION()  . = ALIGN(8)
 
+#ifdef CONFIG_KAISER
+#define ALIGN_KAISER()	. = ALIGN(PAGE_SIZE);
+#else
+#define ALIGN_KAISER()
+#endif
+
 /*
  * LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections, which
  * generates .data.identifier sections, which need to be pulled in with
@@ -493,15 +499,19 @@
 		VMLINUX_SYMBOL(__kprobes_text_end) = .;
 
 #define ENTRY_TEXT							\
+		ALIGN_KAISER();						\
 		ALIGN_FUNCTION();					\
 		VMLINUX_SYMBOL(__entry_text_start) = .;			\
 		*(.entry.text)						\
+		ALIGN_KAISER();						\
 		VMLINUX_SYMBOL(__entry_text_end) = .;
 
 #define IRQENTRY_TEXT							\
+		ALIGN_KAISER();						\
 		ALIGN_FUNCTION();					\
 		VMLINUX_SYMBOL(__irqentry_text_start) = .;		\
 		*(.irqentry.text)					\
+		ALIGN_KAISER();						\
 		VMLINUX_SYMBOL(__irqentry_text_end) = .;
 
 #define SOFTIRQENTRY_TEXT						\
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 16/30] x86, kaiser: map trace interrupt entry
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

All of the interrupt entry/exit code is in a special section
(.irqentry.text).  This enables the ftrace code to figure out
when the kernel is executing in the "grey area" of interrupt
handling before the C code has taken over and marked the data
structures indicating that an interrupt is in progress.

KAISER needs to map this section into the user page tables
because it contains the assembly that helps us enter interrupt
routines.  In addition to the assembly which KAISER *needs*, the
section also contains the first C function that handles an
interrupt.  This is unfortunate, but it doesn't really hurt
anything.

This patch also aligns the .entry.text and .irqentry.text.  This
ensures that only the _required_ text is mapped.

Without this alignment, code might be mapped inadvertently as a
result of sharing a page with code that is intentionally mapped.
This does not hurt anything, but it makes debugging hard because
random build alignment changes can cause things to fail.

This was missed in the original KAISER patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c              |   14 ++++++++++++++
 b/include/asm-generic/vmlinux.lds.h |   10 ++++++++++
 2 files changed, 24 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-trace-irqentry_text arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-trace-irqentry_text	2017-11-10 11:22:13.763244938 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:13.768244938 -0800
@@ -30,6 +30,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/init.h>
+#include <linux/interrupt.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <linux/uaccess.h>
@@ -382,6 +383,19 @@ void __init kaiser_init(void)
 	 */
 	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+	/*
+	 * .irqentry.text helps us identify code that runs before
+	 * we get a chance to call entering_irq().  This includes
+	 * the interrupt entry assembly plus the first C function
+	 * that gets called.  KAISER does not need the C code
+	 * mapped.  We just use the .irqentry.text section as-is
+	 * to avoid having to carve out a new section for the
+	 * assembly only.
+	 */
+	kaiser_add_user_map_ptrs_early(__irqentry_text_start,
+				       __irqentry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
diff -puN include/asm-generic/vmlinux.lds.h~kaiser-user-map-trace-irqentry_text include/asm-generic/vmlinux.lds.h
--- a/include/asm-generic/vmlinux.lds.h~kaiser-user-map-trace-irqentry_text	2017-11-10 11:22:13.765244938 -0800
+++ b/include/asm-generic/vmlinux.lds.h	2017-11-10 11:22:13.769244938 -0800
@@ -59,6 +59,12 @@
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
 #define ALIGN_FUNCTION()  . = ALIGN(8)
 
+#ifdef CONFIG_KAISER
+#define ALIGN_KAISER()	. = ALIGN(PAGE_SIZE);
+#else
+#define ALIGN_KAISER()
+#endif
+
 /*
  * LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections, which
  * generates .data.identifier sections, which need to be pulled in with
@@ -493,15 +499,19 @@
 		VMLINUX_SYMBOL(__kprobes_text_end) = .;
 
 #define ENTRY_TEXT							\
+		ALIGN_KAISER();						\
 		ALIGN_FUNCTION();					\
 		VMLINUX_SYMBOL(__entry_text_start) = .;			\
 		*(.entry.text)						\
+		ALIGN_KAISER();						\
 		VMLINUX_SYMBOL(__entry_text_end) = .;
 
 #define IRQENTRY_TEXT							\
+		ALIGN_KAISER();						\
 		ALIGN_FUNCTION();					\
 		VMLINUX_SYMBOL(__irqentry_text_start) = .;		\
 		*(.irqentry.text)					\
+		ALIGN_KAISER();						\
 		VMLINUX_SYMBOL(__irqentry_text_end) = .;
 
 #define SOFTIRQENTRY_TEXT						\
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 17/30] x86, kaiser: map debug IDT tables
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The IDT is another structure which the CPU references via a
virtual address.  It also obviously needs these to handle an
interrupt in userspace, so these need to be mapped into the user
copy of the page tables.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-trace-and-debug-idt arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-trace-and-debug-idt	2017-11-10 11:22:14.332244936 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:14.336244936 -0800
@@ -286,6 +286,14 @@ int kaiser_add_user_map_ptrs(const void
 				   flags);
 }
 
+static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
+				 unsigned long flags)
+{
+	int ret = kaiser_add_user_map(start_addr, size, flags);
+	WARN_ON(ret);
+	return ret;
+}
+
 /*
  * Ensure that the top level of the (shadow) page tables are
  * entirely populated.  This ensures that all processes that get
@@ -374,6 +382,10 @@ void __init kaiser_init(void)
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
 
+	kaiser_user_map_ptr_early(&debug_idt_table,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
 	/*
 	 * We could theoretically do this in setup_fixmap_gdt().
 	 * But, we would need to rewrite the above page table
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 17/30] x86, kaiser: map debug IDT tables
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The IDT is another structure which the CPU references via a
virtual address.  It also obviously needs these to handle an
interrupt in userspace, so these need to be mapped into the user
copy of the page tables.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-trace-and-debug-idt arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-trace-and-debug-idt	2017-11-10 11:22:14.332244936 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:14.336244936 -0800
@@ -286,6 +286,14 @@ int kaiser_add_user_map_ptrs(const void
 				   flags);
 }
 
+static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
+				 unsigned long flags)
+{
+	int ret = kaiser_add_user_map(start_addr, size, flags);
+	WARN_ON(ret);
+	return ret;
+}
+
 /*
  * Ensure that the top level of the (shadow) page tables are
  * entirely populated.  This ensures that all processes that get
@@ -374,6 +382,10 @@ void __init kaiser_init(void)
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
 
+	kaiser_user_map_ptr_early(&debug_idt_table,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
 	/*
 	 * We could theoretically do this in setup_fixmap_gdt().
 	 * But, we would need to rewrite the above page table
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86


From: Hugh Dickins <hughd@google.com>
[Dave] Add explicit _PAGE_GLOBAL
[Dave] remove KAISER #ifdefs by moving kmalloc() to plain page allocator
[Dave] reword the commit message a bit to be consistent with other patches

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/events/intel/ds.c |   49 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff -puN arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers arch/x86/events/intel/ds.c
--- a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers	2017-11-10 11:22:14.866244935 -0800
+++ b/arch/x86/events/intel/ds.c	2017-11-10 11:22:14.869244935 -0800
@@ -2,11 +2,15 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 
+#include <asm/kaiser.h>
 #include <asm/perf_event.h>
 #include <asm/insn.h>
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
@@ -278,6 +282,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+	unsigned int order = get_order(size);
+	struct page *page;
+	unsigned long addr;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	if (!page)
+		return NULL;
+	addr = (unsigned long)page_address(page);
+	if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+		__free_pages(page, order);
+		addr = 0;
+	}
+	return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+	if (!buffer)
+		return;
+	kaiser_remove_mapping((unsigned long)buffer, size);
+	free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +317,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -299,7 +328,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree(buffer, x86_pmu.pebs_buffer_size);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +354,8 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+			x86_pmu.pebs_buffer_size);
 	ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +369,7 @@ static int alloc_bts_buffer(int cpu)
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
@@ -365,19 +395,15 @@ static void release_bts_buffer(int cpu)
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
 	ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
 
 	return 0;
@@ -391,7 +417,6 @@ static void release_ds_buffer(int cpu)
 		return;
 
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86


From: Hugh Dickins <hughd@google.com>
[Dave] Add explicit _PAGE_GLOBAL
[Dave] remove KAISER #ifdefs by moving kmalloc() to plain page allocator
[Dave] reword the commit message a bit to be consistent with other patches

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/events/intel/ds.c |   49 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff -puN arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers arch/x86/events/intel/ds.c
--- a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers	2017-11-10 11:22:14.866244935 -0800
+++ b/arch/x86/events/intel/ds.c	2017-11-10 11:22:14.869244935 -0800
@@ -2,11 +2,15 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 
+#include <asm/kaiser.h>
 #include <asm/perf_event.h>
 #include <asm/insn.h>
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
@@ -278,6 +282,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+	unsigned int order = get_order(size);
+	struct page *page;
+	unsigned long addr;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	if (!page)
+		return NULL;
+	addr = (unsigned long)page_address(page);
+	if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+		__free_pages(page, order);
+		addr = 0;
+	}
+	return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+	if (!buffer)
+		return;
+	kaiser_remove_mapping((unsigned long)buffer, size);
+	free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +317,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -299,7 +328,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree(buffer, x86_pmu.pebs_buffer_size);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +354,8 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+			x86_pmu.pebs_buffer_size);
 	ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +369,7 @@ static int alloc_bts_buffer(int cpu)
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
@@ -365,19 +395,15 @@ static void release_bts_buffer(int cpu)
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
 	ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
 
 	return 0;
@@ -391,7 +417,6 @@ static void release_ds_buffer(int cpu)
 		return;
 
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 19/30] x86, mm: Move CR3 construction functions
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

For flushing the TLB, the ASID which has been programmed into the
hardware must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/mmu_context.h |   29 +----------------------------
 b/arch/x86/include/asm/tlbflush.h    |   27 +++++++++++++++++++++++++++
 b/arch/x86/mm/tlb.c                  |    8 ++++----
 3 files changed, 32 insertions(+), 32 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~kaiser-pcid-pre-build-func-move arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~kaiser-pcid-pre-build-func-move	2017-11-10 11:22:15.405244934 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2017-11-10 11:22:15.412244934 -0800
@@ -281,33 +281,6 @@ static inline bool arch_vma_access_permi
 }
 
 /*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
-/*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
  *
@@ -316,7 +289,7 @@ static inline unsigned long build_cr3_no
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-func-move arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-func-move	2017-11-10 11:22:15.407244934 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:15.412244934 -0800
@@ -74,6 +74,33 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+ * bits.  This serves two purposes.  It prevents a nasty situation in
+ * which PCID-unaware code saves CR3, loads some other value (with PCID
+ * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+ * the saved ASID was nonzero.  It also means that any bugs involving
+ * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+ * deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
diff -puN arch/x86/mm/tlb.c~kaiser-pcid-pre-build-func-move arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid-pre-build-func-move	2017-11-10 11:22:15.408244934 -0800
+++ b/arch/x86/mm/tlb.c	2017-11-10 11:22:15.412244934 -0800
@@ -127,7 +127,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -194,12 +194,12 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
@@ -277,7 +277,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 19/30] x86, mm: Move CR3 construction functions
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

For flushing the TLB, the ASID which has been programmed into the
hardware must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/mmu_context.h |   29 +----------------------------
 b/arch/x86/include/asm/tlbflush.h    |   27 +++++++++++++++++++++++++++
 b/arch/x86/mm/tlb.c                  |    8 ++++----
 3 files changed, 32 insertions(+), 32 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~kaiser-pcid-pre-build-func-move arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~kaiser-pcid-pre-build-func-move	2017-11-10 11:22:15.405244934 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2017-11-10 11:22:15.412244934 -0800
@@ -281,33 +281,6 @@ static inline bool arch_vma_access_permi
 }
 
 /*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
-/*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
  *
@@ -316,7 +289,7 @@ static inline unsigned long build_cr3_no
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-func-move arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-func-move	2017-11-10 11:22:15.407244934 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:15.412244934 -0800
@@ -74,6 +74,33 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+ * bits.  This serves two purposes.  It prevents a nasty situation in
+ * which PCID-unaware code saves CR3, loads some other value (with PCID
+ * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+ * the saved ASID was nonzero.  It also means that any bugs involving
+ * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+ * deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
diff -puN arch/x86/mm/tlb.c~kaiser-pcid-pre-build-func-move arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid-pre-build-func-move	2017-11-10 11:22:15.408244934 -0800
+++ b/arch/x86/mm/tlb.c	2017-11-10 11:22:15.412244934 -0800
@@ -127,7 +127,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -194,12 +194,12 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
@@ -277,7 +277,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 20/30] x86, mm: remove hard-coded ASID limit checks
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, KAISER is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell
out this new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-asids-macros arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-asids-macros	2017-11-10 11:22:15.990244932 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:15.994244932 -0800
@@ -74,6 +74,19 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS 12
+/* When enabled, KAISER consumes a single bit for user/kernel switches */
+#define KAISER_CONSUMED_ASID_BITS 0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
+ * to account for them being zero-absed.  Another -1 is because ASID 0
+ * is reserved for use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
  * bits.  This serves two purposes.  It prevents a nasty situation in
@@ -87,7 +100,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -97,7 +110,7 @@ static inline unsigned long build_cr3(pg
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 20/30] x86, mm: remove hard-coded ASID limit checks
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, KAISER is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell
out this new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-asids-macros arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-asids-macros	2017-11-10 11:22:15.990244932 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:15.994244932 -0800
@@ -74,6 +74,19 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS 12
+/* When enabled, KAISER consumes a single bit for user/kernel switches */
+#define KAISER_CONSUMED_ASID_BITS 0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
+ * to account for them being zero-absed.  Another -1 is because ASID 0
+ * is reserved for use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
  * bits.  This serves two purposes.  It prevents a nasty situation in
@@ -87,7 +100,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -97,7 +110,7 @@ static inline unsigned long build_cr3(pg
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0->5
2. The one programmed into the hardware that goes from 1->6

This consolidates the locations where converting beween the two
(by doing +1) to a single place which gives us a nice place to
comment.  KAISER will also need to, given an ASID, know which
hardware ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern	2017-11-10 11:22:16.521244931 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:16.525244931 -0800
@@ -87,21 +87,26 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
+static inline u16 kern_asid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+	 * bits.  This serves two purposes.  It prevents a nasty situation in
+	 * which PCID-unaware code saves CR3, loads some other value (with PCID
+	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+	 * the saved ASID was nonzero.  It also means that any bugs involving
+	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+	 * deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_asid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -111,7 +116,8 @@ static inline unsigned long build_cr3(pg
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_asid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0->5
2. The one programmed into the hardware that goes from 1->6

This consolidates the locations where converting beween the two
(by doing +1) to a single place which gives us a nice place to
comment.  KAISER will also need to, given an ASID, know which
hardware ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern	2017-11-10 11:22:16.521244931 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:16.525244931 -0800
@@ -87,21 +87,26 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
+static inline u16 kern_asid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+	 * bits.  This serves two purposes.  It prevents a nasty situation in
+	 * which PCID-unaware code saves CR3, loads some other value (with PCID
+	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+	 * the saved ASID was nonzero.  It also means that any bugs involving
+	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+	 * deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_asid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -111,7 +116,8 @@ static inline unsigned long build_cr3(pg
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_asid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 22/30] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

If changing the page tables in such a way that an invalidation of
all contexts (aka. PCIDs / ASIDs) is required, they can be
actively invalidated by:

 1. INVPCID for each PCID (works for single pages too).
 2. Load CR3 with each PCID without the NOFLUSH bit set
 3. Load CR3 with the NOFLUSH bit set for each and do
    INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
KAISER) at the time that invalidation is required.  Instead of
actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation
to be required.

At the next context-switch, look for this indicator
('all_other_ctxs_invalid' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   47 +++++++++++++++++++++++++++++---------
 b/arch/x86/mm/tlb.c               |   35 ++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 10 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-clear-pcid-cache arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-clear-pcid-cache	2017-11-10 11:22:17.055244930 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:17.060244930 -0800
@@ -184,6 +184,17 @@ struct tlb_state {
 	bool is_lazy;
 
 	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool all_other_ctxs_invalid;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -260,6 +271,19 @@ static inline unsigned long cr4_read_sha
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void tlb_flush_shared_nonglobals(void)
+{
+	/*
+	 * With global pages, all of the shared kenel page tables
+	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
+	 * and nothing to do here.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES))
+		return;
+
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -289,6 +313,10 @@ static inline void __native_flush_tlb(vo
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
 	preempt_enable();
+	/*
+	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
+	 * without PCIDs flushes all non-globals.
+	 */
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -348,24 +376,23 @@ static inline void __native_flush_tlb_si
 
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
 		__flush_tlb();
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
+		tlb_flush_shared_nonglobals();
+	}
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+	/*
+	 * Invalidate other address spaces inaccessible to single-page
+	 * invalidation:
+	 */
+	tlb_flush_shared_nonglobals();
 }
 
 #define TLB_FLUSH_ALL	-1UL
diff -puN arch/x86/mm/tlb.c~kaiser-pcid-pre-clear-pcid-cache arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid-pre-clear-pcid-cache	2017-11-10 11:22:17.057244930 -0800
+++ b/arch/x86/mm/tlb.c	2017-11-10 11:22:17.060244930 -0800
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_non_loaded_ctxs(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_st
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.all_other_ctxs_invalid))
+		clear_non_loaded_ctxs();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 22/30] x86, pcid, kaiser: allow flushing for future ASID switches
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

If changing the page tables in such a way that an invalidation of
all contexts (aka. PCIDs / ASIDs) is required, they can be
actively invalidated by:

 1. INVPCID for each PCID (works for single pages too).
 2. Load CR3 with each PCID without the NOFLUSH bit set
 3. Load CR3 with the NOFLUSH bit set for each and do
    INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
KAISER) at the time that invalidation is required.  Instead of
actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation
to be required.

At the next context-switch, look for this indicator
('all_other_ctxs_invalid' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   47 +++++++++++++++++++++++++++++---------
 b/arch/x86/mm/tlb.c               |   35 ++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 10 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-clear-pcid-cache arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-clear-pcid-cache	2017-11-10 11:22:17.055244930 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:17.060244930 -0800
@@ -184,6 +184,17 @@ struct tlb_state {
 	bool is_lazy;
 
 	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool all_other_ctxs_invalid;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -260,6 +271,19 @@ static inline unsigned long cr4_read_sha
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void tlb_flush_shared_nonglobals(void)
+{
+	/*
+	 * With global pages, all of the shared kenel page tables
+	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
+	 * and nothing to do here.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES))
+		return;
+
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -289,6 +313,10 @@ static inline void __native_flush_tlb(vo
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
 	preempt_enable();
+	/*
+	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
+	 * without PCIDs flushes all non-globals.
+	 */
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -348,24 +376,23 @@ static inline void __native_flush_tlb_si
 
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
 		__flush_tlb();
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
+		tlb_flush_shared_nonglobals();
+	}
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+	/*
+	 * Invalidate other address spaces inaccessible to single-page
+	 * invalidation:
+	 */
+	tlb_flush_shared_nonglobals();
 }
 
 #define TLB_FLUSH_ALL	-1UL
diff -puN arch/x86/mm/tlb.c~kaiser-pcid-pre-clear-pcid-cache arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid-pre-clear-pcid-cache	2017-11-10 11:22:17.057244930 -0800
+++ b/arch/x86/mm/tlb.c	2017-11-10 11:22:17.060244930 -0800
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_non_loaded_ctxs(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_st
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.all_other_ctxs_invalid))
+		clear_non_loaded_ctxs();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
      the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
      flush done for each.  For instance, what is currently a
      single instruction without KAISER:

		invpcid_flush_one(current_pcid, addr);

      becomes this with KAISER:

      		invpcid_flush_one(current_kern_pcid, addr);
		invpcid_flush_one(current_user_pcid, addr);

      and this without INVPCID:

      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h                    |   25 +++-
 b/arch/x86/entry/entry_64.S                   |    1 
 b/arch/x86/include/asm/cpufeatures.h          |    1 
 b/arch/x86/include/asm/pgtable_types.h        |   11 ++
 b/arch/x86/include/asm/tlbflush.h             |  137 +++++++++++++++++++++-----
 b/arch/x86/include/uapi/asm/processor-flags.h |    3 
 b/arch/x86/kvm/x86.c                          |    3 
 b/arch/x86/mm/init.c                          |   75 +++++++++-----
 b/arch/x86/mm/tlb.c                           |   66 ++++++++++++
 9 files changed, 262 insertions(+), 60 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-pcid	2017-11-10 11:22:17.618244928 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:17.637244928 -0800
@@ -2,6 +2,7 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/pgtable_types.h>
 
 /*
 
@@ -191,16 +192,20 @@ For 32-bit we have the following convent
 #ifdef CONFIG_KAISER
 
 /* KAISER PGDs are 8k.  We flip bit 12 to switch between the two halves: */
-#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_PGTABLES_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_MASK     (KAISER_SWITCH_PGTABLES_MASK|\
+				(1<<X86_CR3_KAISER_SWITCH_BIT))
 
 .macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
-	andq	$(~KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Clear PCID and "KAISER bit", point CR3 at kernel pagetables: */
+	andq    $(~KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Set user PCID bit, and move CR3 up a page to the user page tables: */
+	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -219,8 +224,14 @@ For 32-bit we have the following convent
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real KAISER patches in a moment.
+	 * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *	1. The user/kernel PCID bit, and
+	 *	2. The user/kernel "bit" that points CR3 to the
+	 *	   bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not user/shadow.
 	 */
 	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
 	jz	.Ldone_\@
diff -puN arch/x86/entry/entry_64.S~kaiser-pcid arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-pcid	2017-11-10 11:22:17.620244928 -0800
+++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:17.637244928 -0800
@@ -602,6 +602,7 @@ END(irq_entries_start)
 	 * tracking that we're in kernel mode.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	movq	%rsp, %rdi			/* pt_regs pointer */
 	call	sync_regs
diff -puN arch/x86/include/asm/cpufeatures.h~kaiser-pcid arch/x86/include/asm/cpufeatures.h
--- a/arch/x86/include/asm/cpufeatures.h~kaiser-pcid	2017-11-10 11:22:17.622244928 -0800
+++ b/arch/x86/include/asm/cpufeatures.h	2017-11-10 11:22:17.638244928 -0800
@@ -198,6 +198,7 @@
 #define X86_FEATURE_CAT_L3	( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2	( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3	( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE	( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
diff -puN arch/x86/include/asm/pgtable_types.h~kaiser-pcid arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~kaiser-pcid	2017-11-10 11:22:17.623244928 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10 11:22:17.638244928 -0800
@@ -139,6 +139,17 @@
 			 _PAGE_SOFT_DIRTY)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
+/* The ASID is the lower 12 bits of CR3 */
+#define X86_CR3_PCID_ASID_MASK  (_AC((1<<12)-1, UL))
+
+/* Mask for all the PCID-related bits in CR3: */
+#define X86_CR3_PCID_MASK       (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK)
+
+/* Make sure this is only usable in KAISER #ifdef'd code: */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#endif
+
 /*
  * The cache modes defined here are used to translate between pure SW usage
  * and the HW defined cache mode bits and/or PAT entries.
diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid	2017-11-10 11:22:17.625244928 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:17.638244928 -0800
@@ -77,7 +77,12 @@ static inline u64 inc_mm_tlb_gen(struct
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS 12
 /* When enabled, KAISER consumes a single bit for user/kernel switches */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#define KAISER_CONSUMED_ASID_BITS 1
+#else
 #define KAISER_CONSUMED_ASID_BITS 0
+#endif
 
 #define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)
 /*
@@ -87,21 +92,62 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define TLB_NR_DYN_ASIDS 6
+
 static inline u16 kern_asid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+
+#ifdef CONFIG_KAISER
+	/*
+	 * Make sure that the dynamic ASID space does not confict
+	 * with the bit we are using to switch between user and
+	 * kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<<X86_CR3_KAISER_SWITCH_BIT));
+
 	/*
-	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
-	 * bits.  This serves two purposes.  It prevents a nasty situation in
-	 * which PCID-unaware code saves CR3, loads some other value (with PCID
-	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
-	 * the saved ASID was nonzero.  It also means that any bugs involving
-	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
-	 * deterministically.
+	 * The ASID being passed in here should have respected
+	 * the MAX_ASID_AVAILABLE and thus never have the switch
+	 * bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1<<X86_CR3_KAISER_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in  are
+	 * small (<TLB_NR_DYN_ASIDS).  They never have the high
+	 * switch bit set, so do not bother to clear it.
+	 */
+
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1
+	 * into the PCID bits.  This serves two purposes.  It
+	 * prevents a nasty situation in which PCID-unaware code
+	 * saves CR3, loads some other value (with PCID == 0),
+	 * and then restores CR3, thus corrupting the TLB for
+	 * ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3
+	 * with CR4.PCIDE off will trigger deterministically.
 	 */
 	return asid + 1;
 }
 
+/*
+ * The user ASID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_asid(u16 asid)
+{
+	u16 ret = kern_asid(asid);
+#ifdef CONFIG_KAISER
+	ret |= 1<<X86_CR3_KAISER_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -144,12 +190,6 @@ static inline bool tlb_defer_switch_to_i
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -305,18 +345,42 @@ extern void initialize_tlbstate_and_flus
 
 static inline void __native_flush_tlb(void)
 {
+	if (!cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		/*
+		 * native_write_cr3() only clears the current PCID if
+		 * CR4 has X86_CR4_PCIDE set.  In other words, this does
+		 * not fully flush the TLB if PCIDs are in use.
+		 *
+		 * With KAISER and PCIDs, the means that we did not
+		 * flush the user PCID.  Warn if it gets called.
+		 */
+		if (IS_ENABLED(CONFIG_KAISER))
+			WARN_ON_ONCE(this_cpu_read(cpu_tlbstate.cr4) &
+				     X86_CR4_PCIDE);
+		/*
+		 * If current->mm == NULL then we borrow a mm
+		 * which may change during a task switch and
+		 * therefore we must not be preempted while we
+		 * write CR3 back:
+		 */
+		preempt_disable();
+		native_write_cr3(__native_read_cr3());
+		preempt_enable();
+		/*
+		 * Does not need tlb_flush_shared_nonglobals()
+		 * since the CR3 write without PCIDs flushes all
+		 * non-globals.
+		 */
+		return;
+	}
 	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
-	 */
-	preempt_disable();
-	native_write_cr3(__native_read_cr3());
-	preempt_enable();
-	/*
-	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
-	 * without PCIDs flushes all non-globals.
+	 * We are no longer using globals with KAISER, so a
+	 * "nonglobals" flush would work too. But, this is more
+	 * conservative.
+	 *
+	 * Note, this works with CR4.PCIDE=0 or 1.
 	 */
+	invpcid_flush_all();
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -352,6 +416,8 @@ static inline void __native_flush_tlb_gl
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -371,7 +437,30 @@ static inline void __native_flush_tlb_gl
 
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
-	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before
+	 * CR4.PCIDE=1.  Just call invpcid in the case we are called
+	 * early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+		return;
+	}
+	/* Flush the address out of both PCIDs. */
+	/*
+	 * An optimization here might be to determine addresses
+	 * that are only kernel-mapped and only flush the kernel
+	 * ASID.  But, userspace flushes are probably much more
+	 * important performance-wise.
+	 *
+	 * Make sure to do only a single invpcid when KAISER is
+	 * disabled and we have only a single ASID.
+	 */
+	if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid))
+		invpcid_flush_one(user_asid(loaded_mm_asid), addr);
+	invpcid_flush_one(kern_asid(loaded_mm_asid), addr);
 }
 
 static inline void __flush_tlb_all(void)
diff -puN arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid	2017-11-10 11:22:17.627244928 -0800
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2017-11-10 11:22:17.639244928 -0800
@@ -77,7 +77,8 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
diff -puN arch/x86/kvm/x86.c~kaiser-pcid arch/x86/kvm/x86.c
--- a/arch/x86/kvm/x86.c~kaiser-pcid	2017-11-10 11:22:17.629244928 -0800
+++ b/arch/x86/kvm/x86.c	2017-11-10 11:22:17.641244928 -0800
@@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 			return 1;
 
 		/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */
-		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu))
+		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) ||
+		    !is_long_mode(vcpu))
 			return 1;
 	}
 
diff -puN arch/x86/mm/init.c~kaiser-pcid arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~kaiser-pcid	2017-11-10 11:22:17.631244928 -0800
+++ b/arch/x86/mm/init.c	2017-11-10 11:22:17.641244928 -0800
@@ -196,34 +196,59 @@ static void __init probe_page_size_mask(
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * KAISER uses a PCID for the kernel and another
+		 * for userspace.  Both PCIDs need to be flushed
+		 * when the TLB flush functions are called.  But,
+		 * flushing *another* PCID is insane without
+		 * INVPCID.  Just avoid using PCIDs at all if we
+		 * have KAISER and do not have INVPCID.
+		 */
+		if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) &&
+		    !boot_cpu_has(X86_FEATURE_INVPCID)) {
 			setup_clear_cpu_cap(X86_FEATURE_PCID);
+			return;
 		}
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() --
+		 * the trampoline code can't handle CR4.PCIDE and
+		 * it wouldn't do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually
+		 * cause the bits in question to remain set all the
+		 * way through the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE
+		 * manually in start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work
+		 * if we set X86_CR4_PCIDE, *and* we INVPCID
+		 * support.  It's unusable on systems that have
+		 * X86_CR4_PCIDE clear, or that have no INVPCID
+		 * support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't
+		 * work if PCID is on but PGE is not.  Since that
+		 * combination doesn't exist on real hardware, there's
+		 * no reason to try to fully support it, but it's
+		 * polite to avoid corrupting data if we're on
+		 * an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32
diff -puN arch/x86/mm/tlb.c~kaiser-pcid arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid	2017-11-10 11:22:17.633244928 -0800
+++ b/arch/x86/mm/tlb.c	2017-11-10 11:22:17.642244928 -0800
@@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_st
 	*need_flush = true;
 }
 
+/*
+ * Given a kernel asid, flush the corresponding KAISER
+ * user ASID.
+ */
+static void flush_user_asid(pgd_t *pgd, u16 kern_asid)
+{
+	/* There is no user ASID if KAISER is off */
+	if (!IS_ENABLED(CONFIG_KAISER))
+		return;
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+	/*
+	 * With PCIDs enabled, write_cr3() only flushes TLB
+	 * entries for the current (kernel) ASID.  This leaves
+	 * old TLB entries for the user ASID in place and we must
+	 * flush that context separately.  We can theoretically
+	 * delay doing this until we actually load up the
+	 * userspace CR3, but do it here for simplicity.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		invpcid_flush_single_context(user_asid(kern_asid));
+	} else {
+		/*
+		 * On systems with PCIDs, but no INVPCID, the only
+		 * way to flush a PCID is a CR3 write.  Note that
+		 * we use the kernel page tables with the *user*
+		 * ASID here.
+		 */
+		unsigned long user_asid_flush_cr3;
+		user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid));
+		write_cr3(user_asid_flush_cr3);
+		/*
+		 * We do not use PCIDs with KAISER unless we also
+		 * have INVPCID.  Getting here is unexpected.
+		 */
+		WARN_ON_ONCE(1);
+	}
+}
+
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		flush_user_asid(pgdir, new_asid);
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -229,12 +291,12 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
      the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
      flush done for each.  For instance, what is currently a
      single instruction without KAISER:

		invpcid_flush_one(current_pcid, addr);

      becomes this with KAISER:

      		invpcid_flush_one(current_kern_pcid, addr);
		invpcid_flush_one(current_user_pcid, addr);

      and this without INVPCID:

      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h                    |   25 +++-
 b/arch/x86/entry/entry_64.S                   |    1 
 b/arch/x86/include/asm/cpufeatures.h          |    1 
 b/arch/x86/include/asm/pgtable_types.h        |   11 ++
 b/arch/x86/include/asm/tlbflush.h             |  137 +++++++++++++++++++++-----
 b/arch/x86/include/uapi/asm/processor-flags.h |    3 
 b/arch/x86/kvm/x86.c                          |    3 
 b/arch/x86/mm/init.c                          |   75 +++++++++-----
 b/arch/x86/mm/tlb.c                           |   66 ++++++++++++
 9 files changed, 262 insertions(+), 60 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-pcid	2017-11-10 11:22:17.618244928 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:17.637244928 -0800
@@ -2,6 +2,7 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/pgtable_types.h>
 
 /*
 
@@ -191,16 +192,20 @@ For 32-bit we have the following convent
 #ifdef CONFIG_KAISER
 
 /* KAISER PGDs are 8k.  We flip bit 12 to switch between the two halves: */
-#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_PGTABLES_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_MASK     (KAISER_SWITCH_PGTABLES_MASK|\
+				(1<<X86_CR3_KAISER_SWITCH_BIT))
 
 .macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
-	andq	$(~KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Clear PCID and "KAISER bit", point CR3 at kernel pagetables: */
+	andq    $(~KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Set user PCID bit, and move CR3 up a page to the user page tables: */
+	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -219,8 +224,14 @@ For 32-bit we have the following convent
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real KAISER patches in a moment.
+	 * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *	1. The user/kernel PCID bit, and
+	 *	2. The user/kernel "bit" that points CR3 to the
+	 *	   bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not user/shadow.
 	 */
 	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
 	jz	.Ldone_\@
diff -puN arch/x86/entry/entry_64.S~kaiser-pcid arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-pcid	2017-11-10 11:22:17.620244928 -0800
+++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:17.637244928 -0800
@@ -602,6 +602,7 @@ END(irq_entries_start)
 	 * tracking that we're in kernel mode.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	movq	%rsp, %rdi			/* pt_regs pointer */
 	call	sync_regs
diff -puN arch/x86/include/asm/cpufeatures.h~kaiser-pcid arch/x86/include/asm/cpufeatures.h
--- a/arch/x86/include/asm/cpufeatures.h~kaiser-pcid	2017-11-10 11:22:17.622244928 -0800
+++ b/arch/x86/include/asm/cpufeatures.h	2017-11-10 11:22:17.638244928 -0800
@@ -198,6 +198,7 @@
 #define X86_FEATURE_CAT_L3	( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2	( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3	( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE	( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
diff -puN arch/x86/include/asm/pgtable_types.h~kaiser-pcid arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~kaiser-pcid	2017-11-10 11:22:17.623244928 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10 11:22:17.638244928 -0800
@@ -139,6 +139,17 @@
 			 _PAGE_SOFT_DIRTY)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
+/* The ASID is the lower 12 bits of CR3 */
+#define X86_CR3_PCID_ASID_MASK  (_AC((1<<12)-1, UL))
+
+/* Mask for all the PCID-related bits in CR3: */
+#define X86_CR3_PCID_MASK       (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK)
+
+/* Make sure this is only usable in KAISER #ifdef'd code: */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#endif
+
 /*
  * The cache modes defined here are used to translate between pure SW usage
  * and the HW defined cache mode bits and/or PAT entries.
diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid	2017-11-10 11:22:17.625244928 -0800
+++ b/arch/x86/include/asm/tlbflush.h	2017-11-10 11:22:17.638244928 -0800
@@ -77,7 +77,12 @@ static inline u64 inc_mm_tlb_gen(struct
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS 12
 /* When enabled, KAISER consumes a single bit for user/kernel switches */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#define KAISER_CONSUMED_ASID_BITS 1
+#else
 #define KAISER_CONSUMED_ASID_BITS 0
+#endif
 
 #define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)
 /*
@@ -87,21 +92,62 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define TLB_NR_DYN_ASIDS 6
+
 static inline u16 kern_asid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+
+#ifdef CONFIG_KAISER
+	/*
+	 * Make sure that the dynamic ASID space does not confict
+	 * with the bit we are using to switch between user and
+	 * kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<<X86_CR3_KAISER_SWITCH_BIT));
+
 	/*
-	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
-	 * bits.  This serves two purposes.  It prevents a nasty situation in
-	 * which PCID-unaware code saves CR3, loads some other value (with PCID
-	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
-	 * the saved ASID was nonzero.  It also means that any bugs involving
-	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
-	 * deterministically.
+	 * The ASID being passed in here should have respected
+	 * the MAX_ASID_AVAILABLE and thus never have the switch
+	 * bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1<<X86_CR3_KAISER_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in  are
+	 * small (<TLB_NR_DYN_ASIDS).  They never have the high
+	 * switch bit set, so do not bother to clear it.
+	 */
+
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1
+	 * into the PCID bits.  This serves two purposes.  It
+	 * prevents a nasty situation in which PCID-unaware code
+	 * saves CR3, loads some other value (with PCID == 0),
+	 * and then restores CR3, thus corrupting the TLB for
+	 * ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3
+	 * with CR4.PCIDE off will trigger deterministically.
 	 */
 	return asid + 1;
 }
 
+/*
+ * The user ASID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_asid(u16 asid)
+{
+	u16 ret = kern_asid(asid);
+#ifdef CONFIG_KAISER
+	ret |= 1<<X86_CR3_KAISER_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -144,12 +190,6 @@ static inline bool tlb_defer_switch_to_i
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -305,18 +345,42 @@ extern void initialize_tlbstate_and_flus
 
 static inline void __native_flush_tlb(void)
 {
+	if (!cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		/*
+		 * native_write_cr3() only clears the current PCID if
+		 * CR4 has X86_CR4_PCIDE set.  In other words, this does
+		 * not fully flush the TLB if PCIDs are in use.
+		 *
+		 * With KAISER and PCIDs, the means that we did not
+		 * flush the user PCID.  Warn if it gets called.
+		 */
+		if (IS_ENABLED(CONFIG_KAISER))
+			WARN_ON_ONCE(this_cpu_read(cpu_tlbstate.cr4) &
+				     X86_CR4_PCIDE);
+		/*
+		 * If current->mm == NULL then we borrow a mm
+		 * which may change during a task switch and
+		 * therefore we must not be preempted while we
+		 * write CR3 back:
+		 */
+		preempt_disable();
+		native_write_cr3(__native_read_cr3());
+		preempt_enable();
+		/*
+		 * Does not need tlb_flush_shared_nonglobals()
+		 * since the CR3 write without PCIDs flushes all
+		 * non-globals.
+		 */
+		return;
+	}
 	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
-	 */
-	preempt_disable();
-	native_write_cr3(__native_read_cr3());
-	preempt_enable();
-	/*
-	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
-	 * without PCIDs flushes all non-globals.
+	 * We are no longer using globals with KAISER, so a
+	 * "nonglobals" flush would work too. But, this is more
+	 * conservative.
+	 *
+	 * Note, this works with CR4.PCIDE=0 or 1.
 	 */
+	invpcid_flush_all();
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -352,6 +416,8 @@ static inline void __native_flush_tlb_gl
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -371,7 +437,30 @@ static inline void __native_flush_tlb_gl
 
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
-	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before
+	 * CR4.PCIDE=1.  Just call invpcid in the case we are called
+	 * early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+		return;
+	}
+	/* Flush the address out of both PCIDs. */
+	/*
+	 * An optimization here might be to determine addresses
+	 * that are only kernel-mapped and only flush the kernel
+	 * ASID.  But, userspace flushes are probably much more
+	 * important performance-wise.
+	 *
+	 * Make sure to do only a single invpcid when KAISER is
+	 * disabled and we have only a single ASID.
+	 */
+	if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid))
+		invpcid_flush_one(user_asid(loaded_mm_asid), addr);
+	invpcid_flush_one(kern_asid(loaded_mm_asid), addr);
 }
 
 static inline void __flush_tlb_all(void)
diff -puN arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid	2017-11-10 11:22:17.627244928 -0800
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2017-11-10 11:22:17.639244928 -0800
@@ -77,7 +77,8 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
diff -puN arch/x86/kvm/x86.c~kaiser-pcid arch/x86/kvm/x86.c
--- a/arch/x86/kvm/x86.c~kaiser-pcid	2017-11-10 11:22:17.629244928 -0800
+++ b/arch/x86/kvm/x86.c	2017-11-10 11:22:17.641244928 -0800
@@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 			return 1;
 
 		/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */
-		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu))
+		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) ||
+		    !is_long_mode(vcpu))
 			return 1;
 	}
 
diff -puN arch/x86/mm/init.c~kaiser-pcid arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~kaiser-pcid	2017-11-10 11:22:17.631244928 -0800
+++ b/arch/x86/mm/init.c	2017-11-10 11:22:17.641244928 -0800
@@ -196,34 +196,59 @@ static void __init probe_page_size_mask(
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * KAISER uses a PCID for the kernel and another
+		 * for userspace.  Both PCIDs need to be flushed
+		 * when the TLB flush functions are called.  But,
+		 * flushing *another* PCID is insane without
+		 * INVPCID.  Just avoid using PCIDs at all if we
+		 * have KAISER and do not have INVPCID.
+		 */
+		if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) &&
+		    !boot_cpu_has(X86_FEATURE_INVPCID)) {
 			setup_clear_cpu_cap(X86_FEATURE_PCID);
+			return;
 		}
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() --
+		 * the trampoline code can't handle CR4.PCIDE and
+		 * it wouldn't do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually
+		 * cause the bits in question to remain set all the
+		 * way through the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE
+		 * manually in start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work
+		 * if we set X86_CR4_PCIDE, *and* we INVPCID
+		 * support.  It's unusable on systems that have
+		 * X86_CR4_PCIDE clear, or that have no INVPCID
+		 * support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't
+		 * work if PCID is on but PGE is not.  Since that
+		 * combination doesn't exist on real hardware, there's
+		 * no reason to try to fully support it, but it's
+		 * polite to avoid corrupting data if we're on
+		 * an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32
diff -puN arch/x86/mm/tlb.c~kaiser-pcid arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid	2017-11-10 11:22:17.633244928 -0800
+++ b/arch/x86/mm/tlb.c	2017-11-10 11:22:17.642244928 -0800
@@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_st
 	*need_flush = true;
 }
 
+/*
+ * Given a kernel asid, flush the corresponding KAISER
+ * user ASID.
+ */
+static void flush_user_asid(pgd_t *pgd, u16 kern_asid)
+{
+	/* There is no user ASID if KAISER is off */
+	if (!IS_ENABLED(CONFIG_KAISER))
+		return;
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+	/*
+	 * With PCIDs enabled, write_cr3() only flushes TLB
+	 * entries for the current (kernel) ASID.  This leaves
+	 * old TLB entries for the user ASID in place and we must
+	 * flush that context separately.  We can theoretically
+	 * delay doing this until we actually load up the
+	 * userspace CR3, but do it here for simplicity.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		invpcid_flush_single_context(user_asid(kern_asid));
+	} else {
+		/*
+		 * On systems with PCIDs, but no INVPCID, the only
+		 * way to flush a PCID is a CR3 write.  Note that
+		 * we use the kernel page tables with the *user*
+		 * ASID here.
+		 */
+		unsigned long user_asid_flush_cr3;
+		user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid));
+		write_cr3(user_asid_flush_cr3);
+		/*
+		 * We do not use PCIDs with KAISER unless we also
+		 * have INVPCID.  Getting here is unexpected.
+		 */
+		WARN_ON_ONCE(1);
+	}
+}
+
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		flush_user_asid(pgdir, new_asid);
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -229,12 +291,12 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 24/30] x86, kaiser: disable native VSYSCALL
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER code attempts to "poison" the user portion of the kernel page
tables.  It detects entries that it wants that it wants to poison in two
ways:
 * Looking for addresses >= PAGE_OFFSET
 * Looking for entries without _PAGE_USER set

But, to allow the _PAGE_USER check to work, it must never be set on
init_mm entries, and an earlier patch in this series ensured that it
will never be set.

The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
Because of the earlier, KAISER-enforced restriction, _PAGE_USER is never
set which makes the VDSO unreadable to userspace.

This makes the "NATIVE" case totally unusable since userspace can not
even see the memory any more.  Disable it whenever KAISER is enabled.

Also add some help text about how KAISER might affect the emulation
case as well.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org

---

 b/arch/x86/Kconfig |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN arch/x86/Kconfig~kaiser-no-vsyscall arch/x86/Kconfig
--- a/arch/x86/Kconfig~kaiser-no-vsyscall	2017-11-10 11:22:18.366244926 -0800
+++ b/arch/x86/Kconfig	2017-11-10 11:22:18.370244926 -0800
@@ -2231,6 +2231,9 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
+		# The VSYSCALL page comes from the kernel page tables
+		# and is not available when KAISER is enabled.
+		depends on ! KAISER
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2248,6 +2251,11 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
+		  When KAISER is enabled, the vsyscall area will become
+		  unreadable.  This emulation option still works, but KAISER
+		  will make it harder to do things like trace code using the
+		  emulation.
+
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 24/30] x86, kaiser: disable native VSYSCALL
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER code attempts to "poison" the user portion of the kernel page
tables.  It detects entries that it wants that it wants to poison in two
ways:
 * Looking for addresses >= PAGE_OFFSET
 * Looking for entries without _PAGE_USER set

But, to allow the _PAGE_USER check to work, it must never be set on
init_mm entries, and an earlier patch in this series ensured that it
will never be set.

The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
Because of the earlier, KAISER-enforced restriction, _PAGE_USER is never
set which makes the VDSO unreadable to userspace.

This makes the "NATIVE" case totally unusable since userspace can not
even see the memory any more.  Disable it whenever KAISER is enabled.

Also add some help text about how KAISER might affect the emulation
case as well.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org

---

 b/arch/x86/Kconfig |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN arch/x86/Kconfig~kaiser-no-vsyscall arch/x86/Kconfig
--- a/arch/x86/Kconfig~kaiser-no-vsyscall	2017-11-10 11:22:18.366244926 -0800
+++ b/arch/x86/Kconfig	2017-11-10 11:22:18.370244926 -0800
@@ -2231,6 +2231,9 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
+		# The VSYSCALL page comes from the kernel page tables
+		# and is not available when KAISER is enabled.
+		depends on ! KAISER
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2248,6 +2251,11 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
+		  When KAISER is enabled, the vsyscall area will become
+		  unreadable.  This emulation option still works, but KAISER
+		  will make it harder to do things like trace code using the
+		  emulation.
+
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 25/30] x86, kaiser: add debugfs file to turn KAISER on/off at runtime
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

This will be used in a few patches.  Right now, it's not wired up
to do anything useful.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-dynamic-debugfs arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-dynamic-debugfs	2017-11-10 11:22:18.900244925 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:18.904244925 -0800
@@ -29,6 +29,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/bug.h>
+#include <linux/debugfs.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/spinlock.h>
@@ -457,3 +458,50 @@ void kaiser_remove_mapping(unsigned long
 	 */
 	__native_flush_tlb_global();
 }
+
+int kaiser_enabled = 1;
+static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = sprintf(buf, "%d\n", kaiser_enabled);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t kaiser_enabled_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	char buf[32];
+	ssize_t len;
+	unsigned int enable;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, user_buf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &enable))
+		return -EINVAL;
+
+	if (enable > 1)
+		return -EINVAL;
+
+	WRITE_ONCE(kaiser_enabled, enable);
+	return count;
+}
+
+static const struct file_operations fops_kaiser_enabled = {
+	.read = kaiser_enabled_read_file,
+	.write = kaiser_enabled_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_kaiser_enabled(void)
+{
+	debugfs_create_file("kaiser-enabled", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_kaiser_enabled);
+	return 0;
+}
+late_initcall(create_kaiser_enabled);
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 25/30] x86, kaiser: add debugfs file to turn KAISER on/off at runtime
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

This will be used in a few patches.  Right now, it's not wired up
to do anything useful.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-dynamic-debugfs arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-dynamic-debugfs	2017-11-10 11:22:18.900244925 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:18.904244925 -0800
@@ -29,6 +29,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/bug.h>
+#include <linux/debugfs.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/spinlock.h>
@@ -457,3 +458,50 @@ void kaiser_remove_mapping(unsigned long
 	 */
 	__native_flush_tlb_global();
 }
+
+int kaiser_enabled = 1;
+static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = sprintf(buf, "%d\n", kaiser_enabled);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t kaiser_enabled_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	char buf[32];
+	ssize_t len;
+	unsigned int enable;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, user_buf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &enable))
+		return -EINVAL;
+
+	if (enable > 1)
+		return -EINVAL;
+
+	WRITE_ONCE(kaiser_enabled, enable);
+	return count;
+}
+
+static const struct file_operations fops_kaiser_enabled = {
+	.read = kaiser_enabled_read_file,
+	.write = kaiser_enabled_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_kaiser_enabled(void)
+{
+	debugfs_create_file("kaiser-enabled", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_kaiser_enabled);
+	return 0;
+}
+late_initcall(create_kaiser_enabled);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 26/30] x86, kaiser: add a function to check for KAISER being enabled
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, all of the checks for KAISER are compile-time checks.

Runtime checks are needed for turning it on/off at runtime.

Add a function to do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/kaiser.h |    5 +++++
 b/include/linux/kaiser.h        |    4 ++++
 2 files changed, 9 insertions(+)

diff -puN arch/x86/include/asm/kaiser.h~kaiser-dynamic-check-func arch/x86/include/asm/kaiser.h
--- a/arch/x86/include/asm/kaiser.h~kaiser-dynamic-check-func	2017-11-10 11:22:19.435244924 -0800
+++ b/arch/x86/include/asm/kaiser.h	2017-11-10 11:22:19.440244924 -0800
@@ -50,6 +50,11 @@ extern void kaiser_remove_mapping(unsign
  */
 extern void kaiser_init(void);
 
+static inline bool kaiser_active(void)
+{
+	extern int kaiser_enabled;
+	return kaiser_enabled;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff -puN include/linux/kaiser.h~kaiser-dynamic-check-func include/linux/kaiser.h
--- a/include/linux/kaiser.h~kaiser-dynamic-check-func	2017-11-10 11:22:19.437244924 -0800
+++ b/include/linux/kaiser.h	2017-11-10 11:22:19.440244924 -0800
@@ -25,5 +25,9 @@ static inline int kaiser_add_mapping(uns
 	return 0;
 }
 
+static inline bool kaiser_active(void)
+{
+	return 0;
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 26/30] x86, kaiser: add a function to check for KAISER being enabled
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, all of the checks for KAISER are compile-time checks.

Runtime checks are needed for turning it on/off at runtime.

Add a function to do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/kaiser.h |    5 +++++
 b/include/linux/kaiser.h        |    4 ++++
 2 files changed, 9 insertions(+)

diff -puN arch/x86/include/asm/kaiser.h~kaiser-dynamic-check-func arch/x86/include/asm/kaiser.h
--- a/arch/x86/include/asm/kaiser.h~kaiser-dynamic-check-func	2017-11-10 11:22:19.435244924 -0800
+++ b/arch/x86/include/asm/kaiser.h	2017-11-10 11:22:19.440244924 -0800
@@ -50,6 +50,11 @@ extern void kaiser_remove_mapping(unsign
  */
 extern void kaiser_init(void);
 
+static inline bool kaiser_active(void)
+{
+	extern int kaiser_enabled;
+	return kaiser_enabled;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff -puN include/linux/kaiser.h~kaiser-dynamic-check-func include/linux/kaiser.h
--- a/include/linux/kaiser.h~kaiser-dynamic-check-func	2017-11-10 11:22:19.437244924 -0800
+++ b/include/linux/kaiser.h	2017-11-10 11:22:19.440244924 -0800
@@ -25,5 +25,9 @@ static inline int kaiser_add_mapping(uns
 	return 0;
 }
 
+static inline bool kaiser_active(void)
+{
+	return 0;
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 27/30] x86, kaiser: un-poison PGDs at runtime
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

With KAISER Kernel PGDs that map userspace are "poisoned" with
the NX bit.  This ensures that if a kernel->user CR3 switch is
missed, userspace crashes instead of running in an unhardened
state.

This code will be needed in a moment when KAISER is turned
on and off at runtime.

Note that an __ASSEMBLY__ #ifdef is now required since kaiser.h
is indirectly included into assembly.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_64.h |   16 ++++++++++++++-
 b/arch/x86/mm/kaiser.c              |   38 ++++++++++++++++++++++++++++++++++++
 b/include/linux/kaiser.h            |    3 +-
 3 files changed, 55 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-dynamic-unpoison-pgd arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-dynamic-unpoison-pgd	2017-11-10 11:22:19.992244922 -0800
+++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:19.998244922 -0800
@@ -2,6 +2,7 @@
 #define _ASM_X86_PGTABLE_64_H
 
 #include <linux/const.h>
+#include <linux/kaiser.h>
 #include <asm/pgtable_64_types.h>
 
 #ifndef __ASSEMBLY__
@@ -196,6 +197,18 @@ static inline bool pgd_userspace_access(
 	return (pgd.pgd & _PAGE_USER);
 }
 
+static inline void kaiser_poison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd |= _PAGE_NX;
+}
+
+static inline void kaiser_unpoison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd &= ~_PAGE_NX;
+}
+
 /*
  * Returns the pgd_t that the kernel should use in its page tables.
  */
@@ -216,7 +229,8 @@ static inline pgd_t kaiser_set_shadow_pg
 			 * wrong CR3 value, userspace will crash
 			 * instead of running.
 			 */
-			pgd.pgd |= _PAGE_NX;
+			if (kaiser_active())
+				kaiser_poison_pgd(&pgd);
 		}
 	} else if (!pgd.pgd) {
 		/*
diff -puN arch/x86/mm/kaiser.c~kaiser-dynamic-unpoison-pgd arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-dynamic-unpoison-pgd	2017-11-10 11:22:19.993244922 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:19.999244922 -0800
@@ -488,6 +488,9 @@ static ssize_t kaiser_enabled_write_file
 	if (enable > 1)
 		return -EINVAL;
 
+	if (kaiser_enabled == enable)
+		return count;
+
 	WRITE_ONCE(kaiser_enabled, enable);
 	return count;
 }
@@ -505,3 +508,38 @@ static int __init create_kaiser_enabled(
 	return 0;
 }
 late_initcall(create_kaiser_enabled);
+
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
+{
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		pgd_t *pgd = &pgd_page[i];
+
+		/* Stop once we hit kernel addresses: */
+		if (!pgdp_maps_userspace(pgd))
+			break;
+
+		if (do_poison == KAISER_POISON)
+			kaiser_poison_pgd(pgd);
+		else
+			kaiser_unpoison_pgd(pgd);
+	}
+
+}
+
+void kaiser_poison_pgds(enum poison do_poison)
+{
+	struct page *page;
+
+	spin_lock(&pgd_lock);
+	list_for_each_entry(page, &pgd_list, lru) {
+		pgd_t *pgd = (pgd_t *)page_address(page);
+		kaiser_poison_pgd_page(pgd, do_poison);
+	}
+	spin_unlock(&pgd_lock);
+}
diff -puN include/linux/kaiser.h~kaiser-dynamic-unpoison-pgd include/linux/kaiser.h
--- a/include/linux/kaiser.h~kaiser-dynamic-unpoison-pgd	2017-11-10 11:22:19.995244922 -0800
+++ b/include/linux/kaiser.h	2017-11-10 11:22:19.999244922 -0800
@@ -4,7 +4,7 @@
 #ifdef CONFIG_KAISER
 #include <asm/kaiser.h>
 #else
-
+#ifndef __ASSEMBLY__
 /*
  * These stubs are used whenever CONFIG_KAISER is off, which
  * includes architectures that support KAISER, but have it
@@ -29,5 +29,6 @@ static inline bool kaiser_active(void)
 {
 	return 0;
 }
+#endif /* __ASSEMBLY__ */
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 27/30] x86, kaiser: un-poison PGDs at runtime
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

With KAISER Kernel PGDs that map userspace are "poisoned" with
the NX bit.  This ensures that if a kernel->user CR3 switch is
missed, userspace crashes instead of running in an unhardened
state.

This code will be needed in a moment when KAISER is turned
on and off at runtime.

Note that an __ASSEMBLY__ #ifdef is now required since kaiser.h
is indirectly included into assembly.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_64.h |   16 ++++++++++++++-
 b/arch/x86/mm/kaiser.c              |   38 ++++++++++++++++++++++++++++++++++++
 b/include/linux/kaiser.h            |    3 +-
 3 files changed, 55 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-dynamic-unpoison-pgd arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-dynamic-unpoison-pgd	2017-11-10 11:22:19.992244922 -0800
+++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:19.998244922 -0800
@@ -2,6 +2,7 @@
 #define _ASM_X86_PGTABLE_64_H
 
 #include <linux/const.h>
+#include <linux/kaiser.h>
 #include <asm/pgtable_64_types.h>
 
 #ifndef __ASSEMBLY__
@@ -196,6 +197,18 @@ static inline bool pgd_userspace_access(
 	return (pgd.pgd & _PAGE_USER);
 }
 
+static inline void kaiser_poison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd |= _PAGE_NX;
+}
+
+static inline void kaiser_unpoison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd &= ~_PAGE_NX;
+}
+
 /*
  * Returns the pgd_t that the kernel should use in its page tables.
  */
@@ -216,7 +229,8 @@ static inline pgd_t kaiser_set_shadow_pg
 			 * wrong CR3 value, userspace will crash
 			 * instead of running.
 			 */
-			pgd.pgd |= _PAGE_NX;
+			if (kaiser_active())
+				kaiser_poison_pgd(&pgd);
 		}
 	} else if (!pgd.pgd) {
 		/*
diff -puN arch/x86/mm/kaiser.c~kaiser-dynamic-unpoison-pgd arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-dynamic-unpoison-pgd	2017-11-10 11:22:19.993244922 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:19.999244922 -0800
@@ -488,6 +488,9 @@ static ssize_t kaiser_enabled_write_file
 	if (enable > 1)
 		return -EINVAL;
 
+	if (kaiser_enabled == enable)
+		return count;
+
 	WRITE_ONCE(kaiser_enabled, enable);
 	return count;
 }
@@ -505,3 +508,38 @@ static int __init create_kaiser_enabled(
 	return 0;
 }
 late_initcall(create_kaiser_enabled);
+
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
+{
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		pgd_t *pgd = &pgd_page[i];
+
+		/* Stop once we hit kernel addresses: */
+		if (!pgdp_maps_userspace(pgd))
+			break;
+
+		if (do_poison == KAISER_POISON)
+			kaiser_poison_pgd(pgd);
+		else
+			kaiser_unpoison_pgd(pgd);
+	}
+
+}
+
+void kaiser_poison_pgds(enum poison do_poison)
+{
+	struct page *page;
+
+	spin_lock(&pgd_lock);
+	list_for_each_entry(page, &pgd_list, lru) {
+		pgd_t *pgd = (pgd_t *)page_address(page);
+		kaiser_poison_pgd_page(pgd, do_poison);
+	}
+	spin_unlock(&pgd_lock);
+}
diff -puN include/linux/kaiser.h~kaiser-dynamic-unpoison-pgd include/linux/kaiser.h
--- a/include/linux/kaiser.h~kaiser-dynamic-unpoison-pgd	2017-11-10 11:22:19.995244922 -0800
+++ b/include/linux/kaiser.h	2017-11-10 11:22:19.999244922 -0800
@@ -4,7 +4,7 @@
 #ifdef CONFIG_KAISER
 #include <asm/kaiser.h>
 #else
-
+#ifndef __ASSEMBLY__
 /*
  * These stubs are used whenever CONFIG_KAISER is off, which
  * includes architectures that support KAISER, but have it
@@ -29,5 +29,6 @@ static inline bool kaiser_active(void)
 {
 	return 0;
 }
+#endif /* __ASSEMBLY__ */
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 28/30] x86, kaiser: allow KAISER to be enabled/disabled at runtime
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:31   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER CR3 switches are expensive for many reasons.  Not all systems
benefit from the protection provided by KAISER.  Some of them can not
pay the high performance cost.

This patch adds a debugfs file.  To disable KAISER, you do:

	echo 0 > /sys/kernel/debug/x86/kaiser-enabled

and to re-enable it, you can:

	echo 1 > /sys/kernel/debug/x86/kaiser-enabled

This is a *minimal* implementation.  There are certainly plenty of
optimizations that can be done on top of this by using ALTERNATIVES
among other things.

This does, however, completely remove all the KAISER-based CR3 writes.
This permits a paravirtualized system that can not tolerate CR3
writes to theoretically survive with CONFIG_KAISER=y, albeit with
/sys/kernel/debug/x86/kaiser-enabled=0.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h |   12 +++++++
 b/arch/x86/mm/kaiser.c     |   70 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-dynamic-asm arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-dynamic-asm	2017-11-10 11:22:20.575244921 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:20.580244921 -0800
@@ -208,19 +208,29 @@ For 32-bit we have the following convent
 	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
+.macro JUMP_IF_KAISER_OFF	label
+	testq   $1, kaiser_asm_do_switch
+	jz      \label
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_USER_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
@@ -243,11 +253,13 @@ For 32-bit we have the following convent
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	/*
 	 * We could avoid the CR3 write if not changing its value,
 	 * but that requires a CR3 read *and* a scratch register.
 	 */
 	movq	\save_reg, %cr3
+.Ldone_\@:
 .endm
 
 #else /* CONFIG_KAISER=n: */
diff -puN arch/x86/mm/kaiser.c~kaiser-dynamic-asm arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-dynamic-asm	2017-11-10 11:22:20.577244921 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:20.581244921 -0800
@@ -42,6 +42,9 @@
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
 
+__aligned(PAGE_SIZE)
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+
 /*
  * At runtime, the only things we map are some things for CPU
  * hotplug, and stacks for new processes.  No two CPUs will ever
@@ -366,6 +369,9 @@ void __init kaiser_init(void)
 
 	kaiser_init_all_pgds();
 
+	kaiser_add_user_map_early(&kaiser_asm_do_switch, PAGE_SIZE,
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
 	for_each_possible_cpu(cpu) {
 		void *percpu_vaddr = __per_cpu_user_mapped_start +
 				     per_cpu_offset(cpu);
@@ -470,6 +476,56 @@ static ssize_t kaiser_enabled_read_file(
 	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
 }
 
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgds(enum poison do_poison);
+
+void kaiser_do_disable(void)
+{
+	/* Make sure the kernel PGDs are usable by userspace: */
+	kaiser_poison_pgds(KAISER_UNPOISON);
+
+	/*
+	 * Make sure all the CPUs have the poison clear in their TLBs.
+	 * This also functions as a barrier to ensure that everyone
+	 * sees the unpoisoned PGDs.
+	 */
+	flush_tlb_all();
+
+	/* Tell the assembly code to stop switching CR3. */
+	kaiser_asm_do_switch[0] = 0;
+
+	/*
+	 * Make sure everybody does an interrupt.  This means that
+	 * they have gone through a SWITCH_TO_KERNEL_CR3 amd are no
+	 * longer running on the userspace CR3.  If we did not do
+	 * this, we might have CPUs running on the shadow page tables
+	 * that then enter the kernel and think they do *not* need to
+	 * switch.
+	 */
+	flush_tlb_all();
+}
+
+void kaiser_do_enable(void)
+{
+	/* Tell the assembly code to start switching CR3: */
+	kaiser_asm_do_switch[0] = 1;
+
+	/* Make sure everyone can see the kaiser_asm_do_switch update: */
+	synchronize_rcu();
+
+	/*
+	 * Now that userspace is no longer using the kernel copy of
+	 * the page tables, we can poison it:
+	 */
+	kaiser_poison_pgds(KAISER_POISON);
+
+	/* Make sure all the CPUs see the poison: */
+	flush_tlb_all();
+}
+
 static ssize_t kaiser_enabled_write_file(struct file *file,
 		 const char __user *user_buf, size_t count, loff_t *ppos)
 {
@@ -491,7 +547,17 @@ static ssize_t kaiser_enabled_write_file
 	if (kaiser_enabled == enable)
 		return count;
 
+	/*
+	 * This tells the page table code to stop poisoning PGDs
+	 */
 	WRITE_ONCE(kaiser_enabled, enable);
+	synchronize_rcu();
+
+	if (enable)
+		kaiser_do_enable();
+	else
+		kaiser_do_disable();
+
 	return count;
 }
 
@@ -509,10 +575,6 @@ static int __init create_kaiser_enabled(
 }
 late_initcall(create_kaiser_enabled);
 
-enum poison {
-	KAISER_POISON,
-	KAISER_UNPOISON
-};
 void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
 {
 	int i = 0;
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 28/30] x86, kaiser: allow KAISER to be enabled/disabled at runtime
@ 2017-11-10 19:31   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER CR3 switches are expensive for many reasons.  Not all systems
benefit from the protection provided by KAISER.  Some of them can not
pay the high performance cost.

This patch adds a debugfs file.  To disable KAISER, you do:

	echo 0 > /sys/kernel/debug/x86/kaiser-enabled

and to re-enable it, you can:

	echo 1 > /sys/kernel/debug/x86/kaiser-enabled

This is a *minimal* implementation.  There are certainly plenty of
optimizations that can be done on top of this by using ALTERNATIVES
among other things.

This does, however, completely remove all the KAISER-based CR3 writes.
This permits a paravirtualized system that can not tolerate CR3
writes to theoretically survive with CONFIG_KAISER=y, albeit with
/sys/kernel/debug/x86/kaiser-enabled=0.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h |   12 +++++++
 b/arch/x86/mm/kaiser.c     |   70 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-dynamic-asm arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-dynamic-asm	2017-11-10 11:22:20.575244921 -0800
+++ b/arch/x86/entry/calling.h	2017-11-10 11:22:20.580244921 -0800
@@ -208,19 +208,29 @@ For 32-bit we have the following convent
 	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
+.macro JUMP_IF_KAISER_OFF	label
+	testq   $1, kaiser_asm_do_switch
+	jz      \label
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_USER_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
@@ -243,11 +253,13 @@ For 32-bit we have the following convent
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	/*
 	 * We could avoid the CR3 write if not changing its value,
 	 * but that requires a CR3 read *and* a scratch register.
 	 */
 	movq	\save_reg, %cr3
+.Ldone_\@:
 .endm
 
 #else /* CONFIG_KAISER=n: */
diff -puN arch/x86/mm/kaiser.c~kaiser-dynamic-asm arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-dynamic-asm	2017-11-10 11:22:20.577244921 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:20.581244921 -0800
@@ -42,6 +42,9 @@
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
 
+__aligned(PAGE_SIZE)
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+
 /*
  * At runtime, the only things we map are some things for CPU
  * hotplug, and stacks for new processes.  No two CPUs will ever
@@ -366,6 +369,9 @@ void __init kaiser_init(void)
 
 	kaiser_init_all_pgds();
 
+	kaiser_add_user_map_early(&kaiser_asm_do_switch, PAGE_SIZE,
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
 	for_each_possible_cpu(cpu) {
 		void *percpu_vaddr = __per_cpu_user_mapped_start +
 				     per_cpu_offset(cpu);
@@ -470,6 +476,56 @@ static ssize_t kaiser_enabled_read_file(
 	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
 }
 
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgds(enum poison do_poison);
+
+void kaiser_do_disable(void)
+{
+	/* Make sure the kernel PGDs are usable by userspace: */
+	kaiser_poison_pgds(KAISER_UNPOISON);
+
+	/*
+	 * Make sure all the CPUs have the poison clear in their TLBs.
+	 * This also functions as a barrier to ensure that everyone
+	 * sees the unpoisoned PGDs.
+	 */
+	flush_tlb_all();
+
+	/* Tell the assembly code to stop switching CR3. */
+	kaiser_asm_do_switch[0] = 0;
+
+	/*
+	 * Make sure everybody does an interrupt.  This means that
+	 * they have gone through a SWITCH_TO_KERNEL_CR3 amd are no
+	 * longer running on the userspace CR3.  If we did not do
+	 * this, we might have CPUs running on the shadow page tables
+	 * that then enter the kernel and think they do *not* need to
+	 * switch.
+	 */
+	flush_tlb_all();
+}
+
+void kaiser_do_enable(void)
+{
+	/* Tell the assembly code to start switching CR3: */
+	kaiser_asm_do_switch[0] = 1;
+
+	/* Make sure everyone can see the kaiser_asm_do_switch update: */
+	synchronize_rcu();
+
+	/*
+	 * Now that userspace is no longer using the kernel copy of
+	 * the page tables, we can poison it:
+	 */
+	kaiser_poison_pgds(KAISER_POISON);
+
+	/* Make sure all the CPUs see the poison: */
+	flush_tlb_all();
+}
+
 static ssize_t kaiser_enabled_write_file(struct file *file,
 		 const char __user *user_buf, size_t count, loff_t *ppos)
 {
@@ -491,7 +547,17 @@ static ssize_t kaiser_enabled_write_file
 	if (kaiser_enabled == enable)
 		return count;
 
+	/*
+	 * This tells the page table code to stop poisoning PGDs
+	 */
 	WRITE_ONCE(kaiser_enabled, enable);
+	synchronize_rcu();
+
+	if (enable)
+		kaiser_do_enable();
+	else
+		kaiser_do_disable();
+
 	return count;
 }
 
@@ -509,10 +575,6 @@ static int __init create_kaiser_enabled(
 }
 late_initcall(create_kaiser_enabled);
 
-enum poison {
-	KAISER_POISON,
-	KAISER_UNPOISON
-};
 void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
 {
 	int i = 0;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 29/30] x86, kaiser: add Kconfig
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:32   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

PARAVIRT generally requires that the kernel not manage its own page
tables.  It also means that the hypervisor and kernel must agree
wholeheartedly about what format the page tables are in and what
they contain.  KAISER, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether
they want the Kconfig magic to go first or last in a patch series.
It's going last here because the partially-applied series leads to
kernels that can not boot in a bunch of cases.  I did a run through
the entire series with CONFIG_KAISER=y to look for build errors,
though.

Note from Hugh Dickins on why it depends on SMP:

	It is absurd that KAISER should depend on SMP, but
	apparently nobody has tried a UP build before: which
	breaks on implicit declaration of function
	'per_cpu_offset' in arch/x86/mm/kaiser.c.

	Now, you would expect that to be trivially fixed up; but
	looking at the System.map when that block is #ifdef'ed
	out of kaiser_init(), I see that in a UP build
	__per_cpu_user_mapped_end is precisely at
	__per_cpu_user_mapped_start, and the items carefully
	gathered into that section for user-mapping on SMP,
	dispersed elsewhere on UP.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/security/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN security/Kconfig~kaiser-kconfig security/Kconfig
--- a/security/Kconfig~kaiser-kconfig	2017-11-10 11:22:21.138244919 -0800
+++ b/security/Kconfig	2017-11-10 11:22:21.141244919 -0800
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config KAISER
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && SMP && !PARAVIRT
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/kaiser.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 29/30] x86, kaiser: add Kconfig
@ 2017-11-10 19:32   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

PARAVIRT generally requires that the kernel not manage its own page
tables.  It also means that the hypervisor and kernel must agree
wholeheartedly about what format the page tables are in and what
they contain.  KAISER, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether
they want the Kconfig magic to go first or last in a patch series.
It's going last here because the partially-applied series leads to
kernels that can not boot in a bunch of cases.  I did a run through
the entire series with CONFIG_KAISER=y to look for build errors,
though.

Note from Hugh Dickins on why it depends on SMP:

	It is absurd that KAISER should depend on SMP, but
	apparently nobody has tried a UP build before: which
	breaks on implicit declaration of function
	'per_cpu_offset' in arch/x86/mm/kaiser.c.

	Now, you would expect that to be trivially fixed up; but
	looking at the System.map when that block is #ifdef'ed
	out of kaiser_init(), I see that in a UP build
	__per_cpu_user_mapped_end is precisely at
	__per_cpu_user_mapped_start, and the items carefully
	gathered into that section for user-mapping on SMP,
	dispersed elsewhere on UP.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/security/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN security/Kconfig~kaiser-kconfig security/Kconfig
--- a/security/Kconfig~kaiser-kconfig	2017-11-10 11:22:21.138244919 -0800
+++ b/security/Kconfig	2017-11-10 11:22:21.141244919 -0800
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config KAISER
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && SMP && !PARAVIRT
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/kaiser.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 30/30] x86, kaiser, xen: Dynamically disable KAISER when running under Xen PV
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-10 19:32   ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, jgross, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

If you paravirtualize the MMU, you can not use KAISER.  This boils down
to the fact that KAISER needs to do CR3 writes in places that it is not
feasible to do real hypercalls.

If Xen PV is detected to be in use, do not do the KAISER CR3 switches.

I don't think this too bug of a deal for Xen.  I was under the
impression that the Xen guest kernel and Xen guest userspace didn't
share an address space *anyway* so Xen PV is not normally even exposed
to the kinds of things that KAISER protects against.

This allows KAISER=y kernels to deployed in environments that also
require PARAVIRT=y.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   24 ++++++++++++++++++++++--
 b/security/Kconfig     |    2 +-
 2 files changed, 23 insertions(+), 3 deletions(-)

diff -puN arch/x86/mm/kaiser.c~kaiser-disable-for-xen-pv arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-disable-for-xen-pv	2017-11-10 11:22:21.668244918 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:21.673244918 -0800
@@ -42,8 +42,20 @@
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
 
+/*
+ * We need a two-stage enable/disable.  One (kaiser_enabled) to stop
+ * the ongoing work that keeps KAISER from being disabled (like PGD
+ * poisoning) and another (kaiser_asm_do_switch) that we set when it
+ * is completely safe to run without doing KAISER switches.
+ */
+int kaiser_enabled;
+
+/*
+ * Sized and aligned so that we can easily map it out to userspace
+ * for use before we have done the assembly CR3 switching.
+ */
 __aligned(PAGE_SIZE)
-unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)];
 
 /*
  * At runtime, the only things we map are some things for CPU
@@ -415,6 +427,15 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_ptrs_early(__irqentry_text_start,
 				       __irqentry_text_end,
 				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+
+	if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		pr_info("x86/kaiser: Xen PV detected, disabling "
+			"KAISER protection\n");
+	} else {
+		pr_info("x86/kaiser: Unmapping kernel while in userspace\n");
+		kaiser_asm_do_switch[0] = 1;
+		kaiser_enabled = 1;
+	}
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
@@ -465,7 +486,6 @@ void kaiser_remove_mapping(unsigned long
 	__native_flush_tlb_global();
 }
 
-int kaiser_enabled = 1;
 static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
diff -puN security/Kconfig~kaiser-disable-for-xen-pv security/Kconfig
--- a/security/Kconfig~kaiser-disable-for-xen-pv	2017-11-10 11:22:21.670244918 -0800
+++ b/security/Kconfig	2017-11-10 11:22:21.673244918 -0800
@@ -56,7 +56,7 @@ config SECURITY_NETWORK
 
 config KAISER
 	bool "Remove the kernel mapping in user mode"
-	depends on X86_64 && SMP && !PARAVIRT
+	depends on X86_64 && SMP
 	help
 	  This feature reduces the number of hardware side channels by
 	  ensuring that the majority of kernel addresses are not mapped
_

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 30/30] x86, kaiser, xen: Dynamically disable KAISER when running under Xen PV
@ 2017-11-10 19:32   ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 19:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, jgross, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


From: Dave Hansen <dave.hansen@linux.intel.com>

If you paravirtualize the MMU, you can not use KAISER.  This boils down
to the fact that KAISER needs to do CR3 writes in places that it is not
feasible to do real hypercalls.

If Xen PV is detected to be in use, do not do the KAISER CR3 switches.

I don't think this too bug of a deal for Xen.  I was under the
impression that the Xen guest kernel and Xen guest userspace didn't
share an address space *anyway* so Xen PV is not normally even exposed
to the kinds of things that KAISER protects against.

This allows KAISER=y kernels to deployed in environments that also
require PARAVIRT=y.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   24 ++++++++++++++++++++++--
 b/security/Kconfig     |    2 +-
 2 files changed, 23 insertions(+), 3 deletions(-)

diff -puN arch/x86/mm/kaiser.c~kaiser-disable-for-xen-pv arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-disable-for-xen-pv	2017-11-10 11:22:21.668244918 -0800
+++ b/arch/x86/mm/kaiser.c	2017-11-10 11:22:21.673244918 -0800
@@ -42,8 +42,20 @@
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
 
+/*
+ * We need a two-stage enable/disable.  One (kaiser_enabled) to stop
+ * the ongoing work that keeps KAISER from being disabled (like PGD
+ * poisoning) and another (kaiser_asm_do_switch) that we set when it
+ * is completely safe to run without doing KAISER switches.
+ */
+int kaiser_enabled;
+
+/*
+ * Sized and aligned so that we can easily map it out to userspace
+ * for use before we have done the assembly CR3 switching.
+ */
 __aligned(PAGE_SIZE)
-unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)];
 
 /*
  * At runtime, the only things we map are some things for CPU
@@ -415,6 +427,15 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_ptrs_early(__irqentry_text_start,
 				       __irqentry_text_end,
 				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+
+	if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		pr_info("x86/kaiser: Xen PV detected, disabling "
+			"KAISER protection\n");
+	} else {
+		pr_info("x86/kaiser: Unmapping kernel while in userspace\n");
+		kaiser_asm_do_switch[0] = 1;
+		kaiser_enabled = 1;
+	}
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
@@ -465,7 +486,6 @@ void kaiser_remove_mapping(unsigned long
 	__native_flush_tlb_global();
 }
 
-int kaiser_enabled = 1;
 static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
diff -puN security/Kconfig~kaiser-disable-for-xen-pv security/Kconfig
--- a/security/Kconfig~kaiser-disable-for-xen-pv	2017-11-10 11:22:21.670244918 -0800
+++ b/security/Kconfig	2017-11-10 11:22:21.673244918 -0800
@@ -56,7 +56,7 @@ config SECURITY_NETWORK
 
 config KAISER
 	bool "Remove the kernel mapping in user mode"
-	depends on X86_64 && SMP && !PARAVIRT
+	depends on X86_64 && SMP
 	help
 	  This feature reduces the number of hardware side channels by
 	  ensuring that the majority of kernel addresses are not mapped
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-10 22:03     ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-10 22:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Andrew Lutomirski,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Fri, Nov 10, 2017 at 11:31 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> There are effectively two ASID types:
> 1. The one stored in the mmu_context that goes from 0->5
> 2. The one programmed into the hardware that goes from 1->6
>
> This consolidates the locations where converting beween the two
> (by doing +1) to a single place which gives us a nice place to
> comment.  KAISER will also need to, given an ASID, know which
> hardware ASID to flush for the userspace mapping.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Richard Fellner <richard.fellner@student.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> ---
>
>  b/arch/x86/include/asm/tlbflush.h |   30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern arch/x86/include/asm/tlbflush.h
> --- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern        2017-11-10 11:22:16.521244931 -0800
> +++ b/arch/x86/include/asm/tlbflush.h   2017-11-10 11:22:16.525244931 -0800
> @@ -87,21 +87,26 @@ static inline u64 inc_mm_tlb_gen(struct
>   */
>  #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
>
> -/*
> - * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
> - * bits.  This serves two purposes.  It prevents a nasty situation in
> - * which PCID-unaware code saves CR3, loads some other value (with PCID
> - * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
> - * the saved ASID was nonzero.  It also means that any bugs involving
> - * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
> - * deterministically.
> - */
> +static inline u16 kern_asid(u16 asid)
> +{
> +       VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
> +       /*
> +        * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
> +        * bits.  This serves two purposes.  It prevents a nasty situation in
> +        * which PCID-unaware code saves CR3, loads some other value (with PCID
> +        * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
> +        * the saved ASID was nonzero.  It also means that any bugs involving
> +        * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
> +        * deterministically.
> +        */
> +       return asid + 1;
> +}

This seems really error-prone.  Maybe we should have a pcid_t type and
make all the interfaces that want a h/w PCID take pcid_t.

--Andy

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
@ 2017-11-10 22:03     ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-10 22:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Andrew Lutomirski,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Fri, Nov 10, 2017 at 11:31 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> There are effectively two ASID types:
> 1. The one stored in the mmu_context that goes from 0->5
> 2. The one programmed into the hardware that goes from 1->6
>
> This consolidates the locations where converting beween the two
> (by doing +1) to a single place which gives us a nice place to
> comment.  KAISER will also need to, given an ASID, know which
> hardware ASID to flush for the userspace mapping.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Richard Fellner <richard.fellner@student.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> ---
>
>  b/arch/x86/include/asm/tlbflush.h |   30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern arch/x86/include/asm/tlbflush.h
> --- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern        2017-11-10 11:22:16.521244931 -0800
> +++ b/arch/x86/include/asm/tlbflush.h   2017-11-10 11:22:16.525244931 -0800
> @@ -87,21 +87,26 @@ static inline u64 inc_mm_tlb_gen(struct
>   */
>  #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
>
> -/*
> - * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
> - * bits.  This serves two purposes.  It prevents a nasty situation in
> - * which PCID-unaware code saves CR3, loads some other value (with PCID
> - * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
> - * the saved ASID was nonzero.  It also means that any bugs involving
> - * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
> - * deterministically.
> - */
> +static inline u16 kern_asid(u16 asid)
> +{
> +       VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
> +       /*
> +        * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
> +        * bits.  This serves two purposes.  It prevents a nasty situation in
> +        * which PCID-unaware code saves CR3, loads some other value (with PCID
> +        * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
> +        * the saved ASID was nonzero.  It also means that any bugs involving
> +        * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
> +        * deterministically.
> +        */
> +       return asid + 1;
> +}

This seems really error-prone.  Maybe we should have a pcid_t type and
make all the interfaces that want a h/w PCID take pcid_t.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
  2017-11-10 22:03     ` Andy Lutomirski
@ 2017-11-10 22:09       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 22:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On 11/10/2017 02:03 PM, Andy Lutomirski wrote:
>> +static inline u16 kern_asid(u16 asid)
>> +{
>> +       VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
>> +       /*
>> +        * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
>> +        * bits.  This serves two purposes.  It prevents a nasty situation in
>> +        * which PCID-unaware code saves CR3, loads some other value (with PCID
>> +        * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
>> +        * the saved ASID was nonzero.  It also means that any bugs involving
>> +        * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
>> +        * deterministically.
>> +        */
>> +       return asid + 1;
>> +}
> This seems really error-prone.  Maybe we should have a pcid_t type and
> make all the interfaces that want a h/w PCID take pcid_t.

Yeah, totally agree.  I actually had a nasty bug or two around this area
because of this.

I divided them among hw_asid_t and sw_asid_t.  You can turn a sw_asid_t
into a kernel hw_asid_t or a user hw_asid_t.  But, it cause too much
churn across the TLB flushing code so I shelved it for now.

I'd love to come back nd fix this up properly though.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
@ 2017-11-10 22:09       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-10 22:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On 11/10/2017 02:03 PM, Andy Lutomirski wrote:
>> +static inline u16 kern_asid(u16 asid)
>> +{
>> +       VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
>> +       /*
>> +        * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
>> +        * bits.  This serves two purposes.  It prevents a nasty situation in
>> +        * which PCID-unaware code saves CR3, loads some other value (with PCID
>> +        * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
>> +        * the saved ASID was nonzero.  It also means that any bugs involving
>> +        * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
>> +        * deterministically.
>> +        */
>> +       return asid + 1;
>> +}
> This seems really error-prone.  Maybe we should have a pcid_t type and
> make all the interfaces that want a h/w PCID take pcid_t.

Yeah, totally agree.  I actually had a nasty bug or two around this area
because of this.

I divided them among hw_asid_t and sw_asid_t.  You can turn a sw_asid_t
into a kernel hw_asid_t or a user hw_asid_t.  But, it cause too much
churn across the TLB flushing code so I shelved it for now.

I'd love to come back nd fix this up properly though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
  2017-11-10 22:09       ` Dave Hansen
@ 2017-11-10 22:10         ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-10 22:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On Fri, Nov 10, 2017 at 2:09 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/10/2017 02:03 PM, Andy Lutomirski wrote:
>>> +static inline u16 kern_asid(u16 asid)
>>> +{
>>> +       VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
>>> +       /*
>>> +        * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
>>> +        * bits.  This serves two purposes.  It prevents a nasty situation in
>>> +        * which PCID-unaware code saves CR3, loads some other value (with PCID
>>> +        * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
>>> +        * the saved ASID was nonzero.  It also means that any bugs involving
>>> +        * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
>>> +        * deterministically.
>>> +        */
>>> +       return asid + 1;
>>> +}
>> This seems really error-prone.  Maybe we should have a pcid_t type and
>> make all the interfaces that want a h/w PCID take pcid_t.
>
> Yeah, totally agree.  I actually had a nasty bug or two around this area
> because of this.
>
> I divided them among hw_asid_t and sw_asid_t.  You can turn a sw_asid_t
> into a kernel hw_asid_t or a user hw_asid_t.  But, it cause too much
> churn across the TLB flushing code so I shelved it for now.
>
> I'd love to come back nd fix this up properly though.

In the long run, I would go with int for the sw asid and pcid_t for
the PCID.  After all, we index arrays with the SW one.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place
@ 2017-11-10 22:10         ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-10 22:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On Fri, Nov 10, 2017 at 2:09 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/10/2017 02:03 PM, Andy Lutomirski wrote:
>>> +static inline u16 kern_asid(u16 asid)
>>> +{
>>> +       VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
>>> +       /*
>>> +        * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
>>> +        * bits.  This serves two purposes.  It prevents a nasty situation in
>>> +        * which PCID-unaware code saves CR3, loads some other value (with PCID
>>> +        * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
>>> +        * the saved ASID was nonzero.  It also means that any bugs involving
>>> +        * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
>>> +        * deterministically.
>>> +        */
>>> +       return asid + 1;
>>> +}
>> This seems really error-prone.  Maybe we should have a pcid_t type and
>> make all the interfaces that want a h/w PCID take pcid_t.
>
> Yeah, totally agree.  I actually had a nasty bug or two around this area
> because of this.
>
> I divided them among hw_asid_t and sw_asid_t.  You can turn a sw_asid_t
> into a kernel hw_asid_t or a user hw_asid_t.  But, it cause too much
> churn across the TLB flushing code so I shelved it for now.
>
> I'd love to come back nd fix this up properly though.

In the long run, I would go with int for the sw asid and pcid_t for
the PCID.  After all, we index arrays with the SW one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-14 18:20     ` Peter Zijlstra
  -1 siblings, 0 replies; 149+ messages in thread
From: Peter Zijlstra @ 2017-11-14 18:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86

On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>  static int alloc_ds_buffer(int cpu)
>  {
> +	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
>  
> +	memset(ds, 0, sizeof(*ds));

Still wondering about that memset...

>  	per_cpu(cpu_hw_events, cpu).ds = ds;
>  
>  	return 0;

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
@ 2017-11-14 18:20     ` Peter Zijlstra
  0 siblings, 0 replies; 149+ messages in thread
From: Peter Zijlstra @ 2017-11-14 18:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86

On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>  static int alloc_ds_buffer(int cpu)
>  {
> +	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
>  
> +	memset(ds, 0, sizeof(*ds));

Still wondering about that memset...

>  	per_cpu(cpu_hw_events, cpu).ds = ds;
>  
>  	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-11-14 18:20     ` Peter Zijlstra
@ 2017-11-14 18:28       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-14 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86

On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>>  static int alloc_ds_buffer(int cpu)
>>  {
>> +	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
>>  
>> +	memset(ds, 0, sizeof(*ds));
> Still wondering about that memset...

My guess is that it was done to mirror the zeroing done by the original
kzalloc().  But, I think you're right that it's zero'd already by virtue
of being static:

static
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
cpu_debug_store);

I'll queue a cleanup, or update it if I re-post the set.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
@ 2017-11-14 18:28       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-14 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86

On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>>  static int alloc_ds_buffer(int cpu)
>>  {
>> +	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
>>  
>> +	memset(ds, 0, sizeof(*ds));
> Still wondering about that memset...

My guess is that it was done to mirror the zeroing done by the original
kzalloc().  But, I think you're right that it's zero'd already by virtue
of being static:

static
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
cpu_debug_store);

I'll queue a cleanup, or update it if I re-post the set.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-11-14 18:28       ` Dave Hansen
@ 2017-11-14 19:10         ` Hugh Dickins
  -1 siblings, 0 replies; 149+ messages in thread
From: Hugh Dickins @ 2017-11-14 19:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, linux-mm, hughd, moritz.lipp,
	daniel.gruss, michael.schwarz, richard.fellner, luto, torvalds,
	keescook, x86

On Tue, 14 Nov 2017, Dave Hansen wrote:
> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
> >>  static int alloc_ds_buffer(int cpu)
> >>  {
> >> +	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
> >>  
> >> +	memset(ds, 0, sizeof(*ds));
> > Still wondering about that memset...

Sorry, my attention is far away at the moment.

> 
> My guess is that it was done to mirror the zeroing done by the original
> kzalloc().

You guess right.

> But, I think you're right that it's zero'd already by virtue
> of being static:
> 
> static
> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
> cpu_debug_store);
> 
> I'll queue a cleanup, or update it if I re-post the set.

I was about to agree, but now I'm not so sure.  I don't know much
about these PMC things, but at a glance it looks like what is reserved
by x86_reserve_hardware() may later be released by x86_release_hardware(),
and then later reserved again by x86_reserve_hardware().  And although
the static per-cpu area would be zeroed the first time, the second time
it will contain data left over from before, so really needs the memset?

Hugh

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
@ 2017-11-14 19:10         ` Hugh Dickins
  0 siblings, 0 replies; 149+ messages in thread
From: Hugh Dickins @ 2017-11-14 19:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, linux-mm, hughd, moritz.lipp,
	daniel.gruss, michael.schwarz, richard.fellner, luto, torvalds,
	keescook, x86

On Tue, 14 Nov 2017, Dave Hansen wrote:
> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
> >>  static int alloc_ds_buffer(int cpu)
> >>  {
> >> +	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
> >>  
> >> +	memset(ds, 0, sizeof(*ds));
> > Still wondering about that memset...

Sorry, my attention is far away at the moment.

> 
> My guess is that it was done to mirror the zeroing done by the original
> kzalloc().

You guess right.

> But, I think you're right that it's zero'd already by virtue
> of being static:
> 
> static
> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
> cpu_debug_store);
> 
> I'll queue a cleanup, or update it if I re-post the set.

I was about to agree, but now I'm not so sure.  I don't know much
about these PMC things, but at a glance it looks like what is reserved
by x86_reserve_hardware() may later be released by x86_release_hardware(),
and then later reserved again by x86_reserve_hardware().  And although
the static per-cpu area would be zeroed the first time, the second time
it will contain data left over from before, so really needs the memset?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-11-14 19:10         ` Hugh Dickins
@ 2017-11-14 19:24           ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-14 19:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Hansen, Peter Zijlstra, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner,
	Andrew Lutomirski, Linus Torvalds, Kees Cook, X86 ML

On Tue, Nov 14, 2017 at 11:10 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 14 Nov 2017, Dave Hansen wrote:
>> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
>> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>> >>  static int alloc_ds_buffer(int cpu)
>> >>  {
>> >> +  struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
>> >>
>> >> +  memset(ds, 0, sizeof(*ds));
>> > Still wondering about that memset...
>
> Sorry, my attention is far away at the moment.
>
>>
>> My guess is that it was done to mirror the zeroing done by the original
>> kzalloc().
>
> You guess right.
>
>> But, I think you're right that it's zero'd already by virtue
>> of being static:
>>
>> static
>> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
>> cpu_debug_store);
>>
>> I'll queue a cleanup, or update it if I re-post the set.
>
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?
>

For an upstream solution, I would really really like to see
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED and friends completely gone
and to use cpu_entry_area instead.  I don't know whether this has any
material impact on this particular discussion, though.

--Andy

> Hugh

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
@ 2017-11-14 19:24           ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-14 19:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Hansen, Peter Zijlstra, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner,
	Andrew Lutomirski, Linus Torvalds, Kees Cook, X86 ML

On Tue, Nov 14, 2017 at 11:10 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 14 Nov 2017, Dave Hansen wrote:
>> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
>> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>> >>  static int alloc_ds_buffer(int cpu)
>> >>  {
>> >> +  struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
>> >>
>> >> +  memset(ds, 0, sizeof(*ds));
>> > Still wondering about that memset...
>
> Sorry, my attention is far away at the moment.
>
>>
>> My guess is that it was done to mirror the zeroing done by the original
>> kzalloc().
>
> You guess right.
>
>> But, I think you're right that it's zero'd already by virtue
>> of being static:
>>
>> static
>> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
>> cpu_debug_store);
>>
>> I'll queue a cleanup, or update it if I re-post the set.
>
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?
>

For an upstream solution, I would really really like to see
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED and friends completely gone
and to use cpu_entry_area instead.  I don't know whether this has any
material impact on this particular discussion, though.

--Andy

> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
  2017-11-10 19:31   ` Dave Hansen
  (?)
@ 2017-11-14 19:38   ` Rik van Riel
  2017-11-26 14:48       ` Ingo Molnar
  -1 siblings, 1 reply; 149+ messages in thread
From: Rik van Riel @ 2017-11-14 19:38 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: linux-mm, bp, tglx, moritz.lipp, daniel.gruss, michael.schwarz,
	richard.fellner, luto, torvalds, keescook, hughd, x86

[-- Attachment #1: Type: text/plain, Size: 1819 bytes --]

On Fri, 2017-11-10 at 11:31 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Global pages stay in the TLB across context switches.  Since all
> contexts
> share the same kernel mapping, these mappings are marked as global
> pages
> so kernel entries in the TLB are not flushed out on a context switch.
> 
> But, even having these entries in the TLB opens up something that an
> attacker can use [1].
> 
> That means that even when KAISER switches page tables on return to
> user
> space the global pages would stay in the TLB cache.
> 
> Disable global pages so that kernel TLB entries can be flushed before
> returning to user space. This way, all accesses to kernel addresses
> from
> userspace result in a TLB miss independent of the existence of a
> kernel
> mapping.
> 
> Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
> available so that it can still be used for a few selected kernel
> mappings
> which must be visible to userspace, when KAISER is enabled, like the
> entry/exit code and data.

Nice changelog.

Why am I pointing this out?

> +++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10
> 11:22:06.626244956 -0800
> @@ -179,8 +179,20 @@ enum page_cache_mode {
>  #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT |
> _PAGE_USER |	\
>  					 _PAGE_ACCESSED)
>  
> +/*
> + * Disable global pages for anything using the default
> + * __PAGE_KERNEL* macros.  PGE will still be enabled
> + * and _PAGE_GLOBAL may still be used carefully.
> + */
> +#ifdef CONFIG_KAISER
> +#define __PAGE_KERNEL_GLOBAL	0
> +#else
> +#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> +#endif
> +					

The comment above could use a little more info
on why things are done that way, though :)

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-11-14 19:10         ` Hugh Dickins
@ 2017-11-15  9:41           ` Peter Zijlstra
  -1 siblings, 0 replies; 149+ messages in thread
From: Peter Zijlstra @ 2017-11-15  9:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86

On Tue, Nov 14, 2017 at 11:10:23AM -0800, Hugh Dickins wrote:
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?

Ah, yes. It does get reused. I think its still fine, but yes lets keep
it. Better safe than sorry and its not a hot path in any case.

Thanks!

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers
@ 2017-11-15  9:41           ` Peter Zijlstra
  0 siblings, 0 replies; 149+ messages in thread
From: Peter Zijlstra @ 2017-11-15  9:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook, x86

On Tue, Nov 14, 2017 at 11:10:23AM -0800, Hugh Dickins wrote:
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?

Ah, yes. It does get reused. I think its still fine, but yes lets keep
it. Better safe than sorry and its not a hot path in any case.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-16 19:19     ` Andrea Arcangeli
  -1 siblings, 0 replies; 149+ messages in thread
From: Andrea Arcangeli @ 2017-11-16 19:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

Hello,

On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
> Hugh Dickins also points out that PCIDs really have two distinct
> use-cases in the context of KAISER.  The first way they can be used

I don't see why you try to retain such a minor optimization for newer
Intel chips when at the same you prevent KAISER to run with good
performance on older Intel chips like SandyBridge/IvyBridge which
would create a major performance regression for those two. I'd prefer
if you reverse the PCID feature of v4.14 when KASIER is enabled (at
build time would be enough initially), and you use just two asids to
only accelerate enter/exit kernel and you flush the whole TLB over mm
switch like Hugh suggested. It may not even be worth to flush over
cr4, as you've only two asids to deal with anyway.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
@ 2017-11-16 19:19     ` Andrea Arcangeli
  0 siblings, 0 replies; 149+ messages in thread
From: Andrea Arcangeli @ 2017-11-16 19:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

Hello,

On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
> Hugh Dickins also points out that PCIDs really have two distinct
> use-cases in the context of KAISER.  The first way they can be used

I don't see why you try to retain such a minor optimization for newer
Intel chips when at the same you prevent KAISER to run with good
performance on older Intel chips like SandyBridge/IvyBridge which
would create a major performance regression for those two. I'd prefer
if you reverse the PCID feature of v4.14 when KASIER is enabled (at
build time would be enough initially), and you use just two asids to
only accelerate enter/exit kernel and you flush the whole TLB over mm
switch like Hugh suggested. It may not even be worth to flush over
cr4, as you've only two asids to deal with anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
  2017-11-16 19:19     ` Andrea Arcangeli
@ 2017-11-16 19:25       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-16 19:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/16/2017 11:19 AM, Andrea Arcangeli wrote:
> On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
>> Hugh Dickins also points out that PCIDs really have two distinct
>> use-cases in the context of KAISER.  The first way they can be used
> I don't see why you try to retain such a minor optimization for newer
> Intel chips when at the same you prevent KAISER to run with good
> performance on older Intel chips like SandyBridge/IvyBridge which
> would create a major performance regression for those two.

This was more straightforward to do.

The other way requires having *TWO* PCID modes.  So, we need to
disambiguate the two modes in the existing infrastructure in addition to
adding KAISER.

Had I gone and done that, my fear was that we would be left with no
usable PCIDs on *any* hardware.  So, this was easier, I went and did it
first, and I'd love to see someone add support for PCIDs on those older
non-INVPCID systems.  "Someone" may even be me, but it'll be in v2.

Patches welcome before then. :)

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
@ 2017-11-16 19:25       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-16 19:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/16/2017 11:19 AM, Andrea Arcangeli wrote:
> On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
>> Hugh Dickins also points out that PCIDs really have two distinct
>> use-cases in the context of KAISER.  The first way they can be used
> I don't see why you try to retain such a minor optimization for newer
> Intel chips when at the same you prevent KAISER to run with good
> performance on older Intel chips like SandyBridge/IvyBridge which
> would create a major performance regression for those two.

This was more straightforward to do.

The other way requires having *TWO* PCID modes.  So, we need to
disambiguate the two modes in the existing infrastructure in addition to
adding KAISER.

Had I gone and done that, my fear was that we would be left with no
usable PCIDs on *any* hardware.  So, this was easier, I went and did it
first, and I'd love to see someone add support for PCIDs on those older
non-INVPCID systems.  "Someone" may even be me, but it'll be in v2.

Patches welcome before then. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 05/30] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 12:17     ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 12:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This is largely code from Andy Lutomirski.  I fixed a few bugs
> in it, and added a few SWITCH_TO_* spots.
> 
> KAISER needs to switch to a different CR3 value when it enters
> the kernel and switch back when it exits.  This essentially
> needs to be done before leaving assembly code.
> 
> This is extra challenging because the switching context is
> tricky: the registers that can be clobbered can vary.  It is also
> hard to store things on the stack because there is an established
> ABI (ptregs) or the stack is entirely unsafe to use.

Changelog nitpicking starts here

> This patch establishes a set of macros that allow changing to

s/This patch establishes/Establish/

> the user and kernel CR3 values.
> 
> Interactions with SWAPGS: previous versions of the KAISER code
> relied on having per-cpu scratch space to save/restore a register
> that can be used for the CR3 MOV.  The %GS register is used to
> index into our per-cpu space, so SWAPGS *had* to be done before

s/our/the/

> the CR3 switch.  That scratch space is gone now, but the semantic
> that SWAPGS must be done before the CR3 MOV is retained.  This is
> good to keep because it is not that hard to do and it allows us

s/us//

> to do things like add per-cpu debugging information to help us
> figure out what goes wrong sometimes.

the part after 'information' is fairy tale mode and redundant. Debugging
information says it all, right?

> What this does in the NMI code is worth pointing out.  NMIs
> can interrupt *any* context and they can also be nested with
> NMIs interrupting other NMIs.  The comments below
> ".Lnmi_from_kernel" explain the format of the stack during this
> situation.  Changing the format of this stack is not a fun
> exercise: I tried.  Instead of storing the old CR3 value on the
> stack, this patch depend on the *regular* register save/restore
> mechanism and then uses %r14 to keep CR3 during the NMI.  It is
> callee-saved and will not be clobbered by the C NMI handlers that
> get called.

  The comments below ".Lnmi_from_kernel" explain the format of the stack
  during this situation. Changing this stack format is too complex and
  risky, so the following solution has been used:

  Instead of storing the old CR3 value on the stack, depend on the regular
  register save/restore mechanism and use %r14 to hold CR3 during the
  NMI. r14 is callee-saved and will not be clobbered by the C NMI handlers
  that get called.

End of nitpicking

> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +	movq	%cr3, %r\scratch_reg
> +	movq	%r\scratch_reg, \save_reg
> +	/*
> +	 * Is the switch bit zero?  This means the address is
> +	 * up in real KAISER patches in a moment.

  	 * If the switch bit is zero, CR3 points at the kernel page tables
	 * already.
Hmm?

>  /*
> @@ -1189,6 +1201,7 @@ ENTRY(paranoid_exit)
>  	testl	%ebx, %ebx			/* swapgs needed? */
>  	jnz	.Lparanoid_exit_no_swapgs
>  	TRACE_IRQS_IRETQ
> +	RESTORE_CR3	%r14

You have the named macro arguments everywhere, just not here.

Other than that.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 05/30] x86, kaiser: prepare assembly for entry/exit CR3 switching
@ 2017-11-20 12:17     ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 12:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This is largely code from Andy Lutomirski.  I fixed a few bugs
> in it, and added a few SWITCH_TO_* spots.
> 
> KAISER needs to switch to a different CR3 value when it enters
> the kernel and switch back when it exits.  This essentially
> needs to be done before leaving assembly code.
> 
> This is extra challenging because the switching context is
> tricky: the registers that can be clobbered can vary.  It is also
> hard to store things on the stack because there is an established
> ABI (ptregs) or the stack is entirely unsafe to use.

Changelog nitpicking starts here

> This patch establishes a set of macros that allow changing to

s/This patch establishes/Establish/

> the user and kernel CR3 values.
> 
> Interactions with SWAPGS: previous versions of the KAISER code
> relied on having per-cpu scratch space to save/restore a register
> that can be used for the CR3 MOV.  The %GS register is used to
> index into our per-cpu space, so SWAPGS *had* to be done before

s/our/the/

> the CR3 switch.  That scratch space is gone now, but the semantic
> that SWAPGS must be done before the CR3 MOV is retained.  This is
> good to keep because it is not that hard to do and it allows us

s/us//

> to do things like add per-cpu debugging information to help us
> figure out what goes wrong sometimes.

the part after 'information' is fairy tale mode and redundant. Debugging
information says it all, right?

> What this does in the NMI code is worth pointing out.  NMIs
> can interrupt *any* context and they can also be nested with
> NMIs interrupting other NMIs.  The comments below
> ".Lnmi_from_kernel" explain the format of the stack during this
> situation.  Changing the format of this stack is not a fun
> exercise: I tried.  Instead of storing the old CR3 value on the
> stack, this patch depend on the *regular* register save/restore
> mechanism and then uses %r14 to keep CR3 during the NMI.  It is
> callee-saved and will not be clobbered by the C NMI handlers that
> get called.

  The comments below ".Lnmi_from_kernel" explain the format of the stack
  during this situation. Changing this stack format is too complex and
  risky, so the following solution has been used:

  Instead of storing the old CR3 value on the stack, depend on the regular
  register save/restore mechanism and use %r14 to hold CR3 during the
  NMI. r14 is callee-saved and will not be clobbered by the C NMI handlers
  that get called.

End of nitpicking

> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +	movq	%cr3, %r\scratch_reg
> +	movq	%r\scratch_reg, \save_reg
> +	/*
> +	 * Is the switch bit zero?  This means the address is
> +	 * up in real KAISER patches in a moment.

  	 * If the switch bit is zero, CR3 points at the kernel page tables
	 * already.
Hmm?

>  /*
> @@ -1189,6 +1201,7 @@ ENTRY(paranoid_exit)
>  	testl	%ebx, %ebx			/* swapgs needed? */
>  	jnz	.Lparanoid_exit_no_swapgs
>  	TRACE_IRQS_IRETQ
> +	RESTORE_CR3	%r14

You have the named macro arguments everywhere, just not here.

Other than that.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/30] [v3] KAISER: unmap most of the kernel from userspace page tables
  2017-11-10 19:30 ` Dave Hansen
@ 2017-11-20 16:02   ` Juerg Haefliger
  -1 siblings, 0 replies; 149+ messages in thread
From: Juerg Haefliger @ 2017-11-20 16:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, linux-mm, moritz.lipp, daniel.gruss, michael.schwarz,
	richard.fellner, luto, Linus Torvalds, keescook, hughd, x86,
	jgross

On Fri, Nov 10, 2017 at 8:30 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> Thanks, everyone for all the reviews thus far.  I hope I managed to
> address all the feedback given so far, except for the TODOs of
> course.  This is a pretty minor update compared to v1->v2.
>
> These patches are all on top of Andy's entry changes here:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/entry_consolidation
>
> Changes from v2:
>  * Reword documentation removing "we"
>  * Fix some whitespace damage
>  * Fix up MAX ASID values off-by-one noted by Peter Z
>  * Change CodingStyle stuff from Borislav comments
>  * Always use _KERNPG_TABLE for pmd_populate_kernel().
>
> Changes from v1:
>  * Updated to be on top of Andy L's new entry code
>  * Allow global pages again, and use them for pages mapped into
>    userspace page tables.
>  * Use trampoline stack instead of process stack at entry so no
>    longer need to map process stack (big win in fork() speed)
>  * Made the page table walking less generic by restricting it
>    to kernel addresses and !_PAGE_USER pages.
>  * Added a debugfs file to enable/disable CR3 switching at
>    runtime.  This does not remove all the KAISER overhead, but
>    it removes the largest source.
>  * Use runtime disable with Xen to permit Xen-PV guests with
>    KAISER=y.
>  * Moved assembly code from "core" to "prepare assembly" patch
>  * Pass full register name to asm macros
>  * Remove double stack switch in entry_SYSENTER_compat
>  * Disable vsyscall native case when KAISER=y
>  * Separate PER_CPU_USER_MAPPED generic definitions from use
>    by arch/x86/.
>
> TODO:
>  * Allow dumping the shadow page tables with the ptdump code
>  * Put LDT at top of userspace
>  * Create separate tlb flushing functions for user and kernel
>  * Chase down the source of the new !CR4.PGE warning that 0day
>    found with i386
>
> ---
>
> tl;dr:
>
> KAISER makes it harder to defeat KASLR, but makes syscalls and
> interrupts slower.  These patches are based on work from a team at
> Graz University of Technology posted here[1].  The major addition is
> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> for a wide variety of use cases.
>
> Full Description:
>
> KAISER is a countermeasure against attacks on kernel address
> information.  There are at least three existing, published,
> approaches using the shared user/kernel mapping and hardware features
> to defeat KASLR.  One approach referenced in the paper locates the
> kernel by observing differences in page fault timing between
> present-but-inaccessable kernel pages and non-present pages.
>
> KAISER addresses this by unmapping (most of) the kernel when
> userspace runs.  It leaves the existing page tables largely alone and
> refers to them as "kernel page tables".  For running userspace, a new
> "shadow" copy of the page tables is allocated for each process.  The
> shadow page tables map all the same user memory as the "kernel" copy,
> but only maps a minimal set of kernel memory.
>
> When we enter the kernel via syscalls, interrupts or exceptions,
> page tables are switched to the full "kernel" copy.  When the system
> switches back to user mode, the "shadow" copy is used.  Process
> Context IDentifiers (PCIDs) are used to to ensure that the TLB is not
> flushed when switching between page tables, which makes syscalls
> roughly 2x faster than without it.  PCIDs are usable on Haswell and
> newer CPUs (the ones with "v4", or called fourth-generation Core).
>
> The minimal kernel page tables try to map only what is needed to
> enter/exit the kernel such as the entry/exit functions, interrupt
> descriptors (IDT) and the kernel trampoline stacks.  This minimal set
> of data can still reveal the kernel's ASLR base address.  But, this
> minimal kernel data is all trusted, which makes it harder to exploit
> than data in the kernel direct map which contains loads of
> user-controlled data.
>
> KAISER will affect performance for anything that does system calls or
> interrupts: everything.  Just the new instructions (CR3 manipulation)
> add a few hundred cycles to a syscall or interrupt.  Most workloads
> that we have run show single-digit regressions.  5% is a good round
> number for what is typical.  The worst we have seen is a roughly 30%
> regression on a loopback networking test that did a ton of syscalls
> and context switches.  More details about possible performance
> impacts are in the new Documentation/ file.
>
> This code is based on a version I downloaded from
> (https://github.com/IAIK/KAISER).  It has been heavily modified.
>
> The approach is described in detail in a paper[2].  However, there is
> some incorrect and information in the paper, both on how Linux and
> the hardware works.  For instance, I do not share the opinion that
> KAISER has "runtime overhead of only 0.28%".  Please rely on this
> patch series as the canonical source of information about this
> submission.
>
> Here is one example of how the kernel image grow with CONFIG_KAISER
> on and off.  Most of the size increase is presumably from additional
> alignment requirements for mapping entry/exit code and structures.
>
>     text    data     bss      dec filename
> 11786064 7356724 2928640 22071428 vmlinux-nokaiser
> 11798203 7371704 2928640 22098547 vmlinux-kaiser
>   +12139  +14980       0   +27119
>
> To give folks an idea what the performance impact is like, I took
> the following test and ran it single-threaded:
>
>         https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c
>
> It's a pretty quick syscall so this shows how much KAISER slows
> down syscalls (and how much PCIDs help).  The units here are
> lseeks/second:
>
>         no kaiser: 5.2M
>     kaiser+  pcid: 3.0M
>     kaiser+nopcid: 2.2M
>
> "nopcid" is literally with the "nopcid" command-line option which
> turns PCIDs off entirely.
>
> Thanks to:
> The original KAISER team at Graz University of Technology.
> Andy Lutomirski for all the help with the entry code.
> Kirill Shutemov for a helpful review of the code.
>
> 1. https://github.com/IAIK/KAISER
> 2. https://gruss.cc/files/kaiser.pdf
>
> --
>
> The code is available here:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/
>
>  Documentation/x86/kaiser.txt                | 160 +++++
>  arch/x86/Kconfig                            |   8 +
>  arch/x86/entry/calling.h                    |  89 +++
>  arch/x86/entry/entry_64.S                   |  44 +-
>  arch/x86/entry/entry_64_compat.S            |   8 +
>  arch/x86/events/intel/ds.c                  |  49 +-
>  arch/x86/include/asm/cpufeatures.h          |   1 +
>  arch/x86/include/asm/desc.h                 |   2 +-
>  arch/x86/include/asm/kaiser.h               |  62 ++
>  arch/x86/include/asm/mmu_context.h          |  29 +-
>  arch/x86/include/asm/pgalloc.h              |  37 +-
>  arch/x86/include/asm/pgtable.h              |  20 +-
>  arch/x86/include/asm/pgtable_64.h           | 135 +++++
>  arch/x86/include/asm/pgtable_types.h        |  25 +-
>  arch/x86/include/asm/processor.h            |   2 +-
>  arch/x86/include/asm/tlbflush.h             | 232 +++++++-
>  arch/x86/include/uapi/asm/processor-flags.h |   3 +-
>  arch/x86/kernel/cpu/common.c                |  21 +-
>  arch/x86/kernel/espfix_64.c                 |  27 +-
>  arch/x86/kernel/head_64.S                   |  30 +-
>  arch/x86/kernel/ldt.c                       |  25 +-
>  arch/x86/kernel/process.c                   |   2 +-
>  arch/x86/kernel/process_64.c                |   2 +-
>  arch/x86/kernel/traps.c                     |  46 +-
>  arch/x86/kvm/x86.c                          |   3 +-
>  arch/x86/mm/Makefile                        |   1 +
>  arch/x86/mm/init.c                          |  75 ++-
>  arch/x86/mm/kaiser.c                        | 627 ++++++++++++++++++++
>  arch/x86/mm/pageattr.c                      |  18 +-
>  arch/x86/mm/pgtable.c                       |  16 +-
>  arch/x86/mm/tlb.c                           | 105 +++-
>  include/asm-generic/vmlinux.lds.h           |  17 +
>  include/linux/kaiser.h                      |  34 ++
>  include/linux/percpu-defs.h                 |  30 +
>  init/main.c                                 |   3 +
>  kernel/fork.c                               |   1 +
>  security/Kconfig                            |  10 +
>  37 files changed, 1851 insertions(+), 148 deletions(-)
>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Richard Fellner <richard.fellner@student.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>

I get a compilation error with:
CONFIG_RANDOMIZE_BASE=y

  OBJCOPY arch/x86/boot/compressed/vmlinux.bin
  RELOCS  arch/x86/boot/compressed/vmlinux.relocs
  CC      arch/x86/boot/compressed/early_serial_console.o
  CC      arch/x86/boot/compressed/kaslr.o
  CC      arch/x86/boot/compressed/pagetable.o
  CC      arch/x86/boot/compressed/misc.o
  GZIP    arch/x86/boot/compressed/vmlinux.bin.gz
  MKPIGGY arch/x86/boot/compressed/piggy.S
  AS      arch/x86/boot/compressed/piggy.o
  DATAREL arch/x86/boot/compressed/vmlinux
  LD      arch/x86/boot/compressed/vmlinux
arch/x86/boot/compressed/pagetable.o: In function `kernel_ident_mapping_init':
pagetable.c:(.text+0x31b): undefined reference to `kaiser_enabled'
arch/x86/boot/compressed/Makefile:106: recipe for target
'arch/x86/boot/compressed/vmlinux' failed
make[2]: *** [arch/x86/boot/compressed/vmlinux] Error 1
arch/x86/boot/Makefile:112: recipe for target
'arch/x86/boot/compressed/vmlinux' failed
make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
arch/x86/Makefile:295: recipe for target 'bzImage' failed
make: *** [bzImage] Error 2

Compiles fine with:
# CONFIG_RANDOMIZE_BASE is not set

...Juerg


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/30] [v3] KAISER: unmap most of the kernel from userspace page tables
@ 2017-11-20 16:02   ` Juerg Haefliger
  0 siblings, 0 replies; 149+ messages in thread
From: Juerg Haefliger @ 2017-11-20 16:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, linux-mm, moritz.lipp, daniel.gruss, michael.schwarz,
	richard.fellner, luto, Linus Torvalds, keescook, hughd, x86,
	jgross

On Fri, Nov 10, 2017 at 8:30 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> Thanks, everyone for all the reviews thus far.  I hope I managed to
> address all the feedback given so far, except for the TODOs of
> course.  This is a pretty minor update compared to v1->v2.
>
> These patches are all on top of Andy's entry changes here:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/entry_consolidation
>
> Changes from v2:
>  * Reword documentation removing "we"
>  * Fix some whitespace damage
>  * Fix up MAX ASID values off-by-one noted by Peter Z
>  * Change CodingStyle stuff from Borislav comments
>  * Always use _KERNPG_TABLE for pmd_populate_kernel().
>
> Changes from v1:
>  * Updated to be on top of Andy L's new entry code
>  * Allow global pages again, and use them for pages mapped into
>    userspace page tables.
>  * Use trampoline stack instead of process stack at entry so no
>    longer need to map process stack (big win in fork() speed)
>  * Made the page table walking less generic by restricting it
>    to kernel addresses and !_PAGE_USER pages.
>  * Added a debugfs file to enable/disable CR3 switching at
>    runtime.  This does not remove all the KAISER overhead, but
>    it removes the largest source.
>  * Use runtime disable with Xen to permit Xen-PV guests with
>    KAISER=y.
>  * Moved assembly code from "core" to "prepare assembly" patch
>  * Pass full register name to asm macros
>  * Remove double stack switch in entry_SYSENTER_compat
>  * Disable vsyscall native case when KAISER=y
>  * Separate PER_CPU_USER_MAPPED generic definitions from use
>    by arch/x86/.
>
> TODO:
>  * Allow dumping the shadow page tables with the ptdump code
>  * Put LDT at top of userspace
>  * Create separate tlb flushing functions for user and kernel
>  * Chase down the source of the new !CR4.PGE warning that 0day
>    found with i386
>
> ---
>
> tl;dr:
>
> KAISER makes it harder to defeat KASLR, but makes syscalls and
> interrupts slower.  These patches are based on work from a team at
> Graz University of Technology posted here[1].  The major addition is
> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> for a wide variety of use cases.
>
> Full Description:
>
> KAISER is a countermeasure against attacks on kernel address
> information.  There are at least three existing, published,
> approaches using the shared user/kernel mapping and hardware features
> to defeat KASLR.  One approach referenced in the paper locates the
> kernel by observing differences in page fault timing between
> present-but-inaccessable kernel pages and non-present pages.
>
> KAISER addresses this by unmapping (most of) the kernel when
> userspace runs.  It leaves the existing page tables largely alone and
> refers to them as "kernel page tables".  For running userspace, a new
> "shadow" copy of the page tables is allocated for each process.  The
> shadow page tables map all the same user memory as the "kernel" copy,
> but only maps a minimal set of kernel memory.
>
> When we enter the kernel via syscalls, interrupts or exceptions,
> page tables are switched to the full "kernel" copy.  When the system
> switches back to user mode, the "shadow" copy is used.  Process
> Context IDentifiers (PCIDs) are used to to ensure that the TLB is not
> flushed when switching between page tables, which makes syscalls
> roughly 2x faster than without it.  PCIDs are usable on Haswell and
> newer CPUs (the ones with "v4", or called fourth-generation Core).
>
> The minimal kernel page tables try to map only what is needed to
> enter/exit the kernel such as the entry/exit functions, interrupt
> descriptors (IDT) and the kernel trampoline stacks.  This minimal set
> of data can still reveal the kernel's ASLR base address.  But, this
> minimal kernel data is all trusted, which makes it harder to exploit
> than data in the kernel direct map which contains loads of
> user-controlled data.
>
> KAISER will affect performance for anything that does system calls or
> interrupts: everything.  Just the new instructions (CR3 manipulation)
> add a few hundred cycles to a syscall or interrupt.  Most workloads
> that we have run show single-digit regressions.  5% is a good round
> number for what is typical.  The worst we have seen is a roughly 30%
> regression on a loopback networking test that did a ton of syscalls
> and context switches.  More details about possible performance
> impacts are in the new Documentation/ file.
>
> This code is based on a version I downloaded from
> (https://github.com/IAIK/KAISER).  It has been heavily modified.
>
> The approach is described in detail in a paper[2].  However, there is
> some incorrect and information in the paper, both on how Linux and
> the hardware works.  For instance, I do not share the opinion that
> KAISER has "runtime overhead of only 0.28%".  Please rely on this
> patch series as the canonical source of information about this
> submission.
>
> Here is one example of how the kernel image grow with CONFIG_KAISER
> on and off.  Most of the size increase is presumably from additional
> alignment requirements for mapping entry/exit code and structures.
>
>     text    data     bss      dec filename
> 11786064 7356724 2928640 22071428 vmlinux-nokaiser
> 11798203 7371704 2928640 22098547 vmlinux-kaiser
>   +12139  +14980       0   +27119
>
> To give folks an idea what the performance impact is like, I took
> the following test and ran it single-threaded:
>
>         https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c
>
> It's a pretty quick syscall so this shows how much KAISER slows
> down syscalls (and how much PCIDs help).  The units here are
> lseeks/second:
>
>         no kaiser: 5.2M
>     kaiser+  pcid: 3.0M
>     kaiser+nopcid: 2.2M
>
> "nopcid" is literally with the "nopcid" command-line option which
> turns PCIDs off entirely.
>
> Thanks to:
> The original KAISER team at Graz University of Technology.
> Andy Lutomirski for all the help with the entry code.
> Kirill Shutemov for a helpful review of the code.
>
> 1. https://github.com/IAIK/KAISER
> 2. https://gruss.cc/files/kaiser.pdf
>
> --
>
> The code is available here:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/
>
>  Documentation/x86/kaiser.txt                | 160 +++++
>  arch/x86/Kconfig                            |   8 +
>  arch/x86/entry/calling.h                    |  89 +++
>  arch/x86/entry/entry_64.S                   |  44 +-
>  arch/x86/entry/entry_64_compat.S            |   8 +
>  arch/x86/events/intel/ds.c                  |  49 +-
>  arch/x86/include/asm/cpufeatures.h          |   1 +
>  arch/x86/include/asm/desc.h                 |   2 +-
>  arch/x86/include/asm/kaiser.h               |  62 ++
>  arch/x86/include/asm/mmu_context.h          |  29 +-
>  arch/x86/include/asm/pgalloc.h              |  37 +-
>  arch/x86/include/asm/pgtable.h              |  20 +-
>  arch/x86/include/asm/pgtable_64.h           | 135 +++++
>  arch/x86/include/asm/pgtable_types.h        |  25 +-
>  arch/x86/include/asm/processor.h            |   2 +-
>  arch/x86/include/asm/tlbflush.h             | 232 +++++++-
>  arch/x86/include/uapi/asm/processor-flags.h |   3 +-
>  arch/x86/kernel/cpu/common.c                |  21 +-
>  arch/x86/kernel/espfix_64.c                 |  27 +-
>  arch/x86/kernel/head_64.S                   |  30 +-
>  arch/x86/kernel/ldt.c                       |  25 +-
>  arch/x86/kernel/process.c                   |   2 +-
>  arch/x86/kernel/process_64.c                |   2 +-
>  arch/x86/kernel/traps.c                     |  46 +-
>  arch/x86/kvm/x86.c                          |   3 +-
>  arch/x86/mm/Makefile                        |   1 +
>  arch/x86/mm/init.c                          |  75 ++-
>  arch/x86/mm/kaiser.c                        | 627 ++++++++++++++++++++
>  arch/x86/mm/pageattr.c                      |  18 +-
>  arch/x86/mm/pgtable.c                       |  16 +-
>  arch/x86/mm/tlb.c                           | 105 +++-
>  include/asm-generic/vmlinux.lds.h           |  17 +
>  include/linux/kaiser.h                      |  34 ++
>  include/linux/percpu-defs.h                 |  30 +
>  init/main.c                                 |   3 +
>  kernel/fork.c                               |   1 +
>  security/Kconfig                            |  10 +
>  37 files changed, 1851 insertions(+), 148 deletions(-)
>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Richard Fellner <richard.fellner@student.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>

I get a compilation error with:
CONFIG_RANDOMIZE_BASE=y

  OBJCOPY arch/x86/boot/compressed/vmlinux.bin
  RELOCS  arch/x86/boot/compressed/vmlinux.relocs
  CC      arch/x86/boot/compressed/early_serial_console.o
  CC      arch/x86/boot/compressed/kaslr.o
  CC      arch/x86/boot/compressed/pagetable.o
  CC      arch/x86/boot/compressed/misc.o
  GZIP    arch/x86/boot/compressed/vmlinux.bin.gz
  MKPIGGY arch/x86/boot/compressed/piggy.S
  AS      arch/x86/boot/compressed/piggy.o
  DATAREL arch/x86/boot/compressed/vmlinux
  LD      arch/x86/boot/compressed/vmlinux
arch/x86/boot/compressed/pagetable.o: In function `kernel_ident_mapping_init':
pagetable.c:(.text+0x31b): undefined reference to `kaiser_enabled'
arch/x86/boot/compressed/Makefile:106: recipe for target
'arch/x86/boot/compressed/vmlinux' failed
make[2]: *** [arch/x86/boot/compressed/vmlinux] Error 1
arch/x86/boot/Makefile:112: recipe for target
'arch/x86/boot/compressed/vmlinux' failed
make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
arch/x86/Makefile:295: recipe for target 'bzImage' failed
make: *** [bzImage] Error 2

Compiles fine with:
# CONFIG_RANDOMIZE_BASE is not set

...Juerg


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 17:21     ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 17:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
> diff -puN arch/x86/entry/entry_64.S~kaiser-base arch/x86/entry/entry_64.S
> --- a/arch/x86/entry/entry_64.S~kaiser-base	2017-11-10 11:22:09.007244950 -0800
> +++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:09.031244950 -0800
> @@ -145,6 +145,16 @@ ENTRY(entry_SYSCALL_64)
>  
>  	swapgs
>  	movq	%rsp, PER_CPU_VAR(rsp_scratch)
> +
> +	/*
> +	 * We need a good kernel CR3 to be able to map the process
> +	 * stack, but we need a scratch register to be able to load
> +	 * CR3.  We could create another PER_CPU_VAR(), but %rsp is
> +	 * actually clobberable right now.  Just use it.  It will only
> +	 * be insane for one a couple instructions.
> +	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp

Shouldn't this be in the patch which introduces all that SWITCH macro magic?

>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>  
>  	/* Construct struct pt_regs on stack */
> @@ -169,8 +179,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>  
>  	/* NB: right here, all regs except r11 are live. */

Stale comment

>  
> -	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
> -
>  	/* Must wait until we have the kernel CR3 to call C functions: */
>  	TRACE_IRQS_OFF
>  
> @@ -1269,6 +1277,7 @@ ENTRY(error_entry)
>  	 * gsbase and proceed.  We'll fix up the exception and land in
>  	 * .Lgs_change's error handler with kernel gsbase.
>  	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

See above.

>  	SWAPGS
>  	jmp .Lerror_entry_done
>  
> @@ -1382,6 +1391,7 @@ ENTRY(nmi)
>  
>  	swapgs
>  	cld
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
>  	movq	%rsp, %rdx
>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>  	UNWIND_HINT_IRET_REGS base=%rdx offset=8
> @@ -1410,7 +1420,6 @@ ENTRY(nmi)
>  	UNWIND_HINT_REGS
>  	ENCODE_FRAME_POINTER
>  
> -	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi

Ditto

> +#ifdef CONFIG_KAISER
> +/*
> + * All top-level KAISER page tables are order-1 pages (8k-aligned
> + * and 8k in size).  The kernel one is at the beginning 4k and
> + * the user (shadow) one is in the last 4k.  To switch between
> + * them, you just need to flip the 12th bit in their addresses.
> + */
> +#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
> +
> +/*
> + * This generates better code than the inline assembly in
> + * __set_bit().
> + */
> +static inline void *ptr_set_bit(void *ptr, int bit)
> +{
> +	unsigned long __ptr = (unsigned long)ptr;

Newline between declaration and code please.

> +	__ptr |= (1<<bit);

  __ptr |= 1UL << bit;

> +	return (void *)__ptr;
> +}
> +static inline void *ptr_clear_bit(void *ptr, int bit)
> +{
> +	unsigned long __ptr = (unsigned long)ptr;
> +	__ptr &= ~(1<<bit);
> +	return (void *)__ptr;

Ditto

> +}

> +/*
> + * Page table pages are page-aligned.  The lower half of the top
> + * level is used for userspace and the top half for the kernel.
> + * This returns true for user pages that need to get copied into
> + * both the user and kernel copies of the page tables, and false
> + * for kernel pages that should only be in the kernel copy.
> + */
> +static inline bool is_userspace_pgd(void *__ptr)
> +{
> +	unsigned long ptr = (unsigned long)__ptr;
> +
> +	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));

The outer brackets are not required and the obvious way to write that is:

  	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);

I guess the compiler is smart enought to figure that out itself, but ...

> +}
> +
>  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
>  {
> +#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
> +	/*
> +	 * set_pgd() does not get called when we are running
> +	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
> +	 * know here that we have a p4d but that it is really at
> +	 * the top level of the page tables; it is really just a
> +	 * pgd.
> +	 */
> +	/* Do we need to also populate the shadow p4d? */
> +	if (is_userspace_pgd(p4dp))
> +		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;

native_get_shadow_p4d() is kinda confusing, as it suggest that we get the
entry not the pointer to it. native_get_shadow_p4d_ptr() is what it
actually wants to be, but a setter e.g. native_set_shadow...(), we also
have set_pgd() would be more obvious I think.

> +	/*
> +	 * Even if the entry is *mapping* userspace, ensure
> +	 * that userspace can not use it.  This way, if we
> +	 * get out to userspace with the wrong CR3 value,
> +	 * userspace will crash instead of running.
> +	 */
> +	if (!p4d.pgd.pgd)
> +		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;

Confused. Contrary to the comment this sets the NX bit on every non null
entry.

> +#else /* CONFIG_KAISER */
>  	*p4dp = p4d;
> +#endif
>  }

>  static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
>  {
>         memcpy(dst, src, count * sizeof(pgd_t));
> +#ifdef CONFIG_KAISER
> +	/* Clone the shadow pgd part as well */
> +	memcpy(native_get_shadow_pgd(dst),
> +	       native_get_shadow_pgd(src),
> +	       count * sizeof(pgd_t));

Nitpick: this fits in two lines

> +#endif
>  }

>  /*
>   * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
> @@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
>  	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
>  	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
>  	p4d_populate(&init_mm, p4d, espfix_pud_page);
> +	/*
> +	 * Just copy the top-level PGD that is mapping the espfix
> +	 * area to ensure it is mapped into the shadow user page
> +	 * tables.
> +	 *
> +	 * For 5-level paging, we should have already populated

should we have it populated or is it de facto populated?

> +	 * the espfix pgd when kaiser_init() pre-populated all
> +	 * the pgd entries.  The above p4d_alloc() would never do
> +	 * anything and the p4d_populate() would be done to a p4d
> +	 * already mapped in the userspace pgd.
> +	 */
> +#ifdef CONFIG_KAISER
> +	if (CONFIG_PGTABLE_LEVELS <= 4)
> +		set_pgd(native_get_shadow_pgd(pgd),
> +			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));

Nit: Please add curly braces on the first condition.

> +/*
> + * This "fakes" a #GP from userspace upon returning (iret'ing)
> + * from this double fault.
> + */
> +void setup_fake_gp_at_iret(struct pt_regs *regs)
> +{
> +	unsigned long *new_stack_top = (unsigned long *)
> +		(this_cpu_read(cpu_tss.x86_tss.ist[0]) - 0x1500);

0x1500? No magic numbers. Please use defines with a proper explanation.

> +	/*
> +	 * Set up a stack just like the hardware would for a #GP.
> +	 *
> +	 * This format is an "iret frame", plus the error code
> +	 * that the hardware puts on the stack for us for
> +	 * exceptions.  (see struct pt_regs).
> +	 */
> +	new_stack_top[-1] = regs->ss;
> +	new_stack_top[-2] = regs->sp;
> +	new_stack_top[-3] = regs->flags;
> +	new_stack_top[-4] = regs->cs;
> +	new_stack_top[-5] = regs->ip;
> +	new_stack_top[-6] = 0;	/* faked #GP error code */
> +
> +	/*
> +	 * 'regs' points to the "iret frame" for *this*
> +	 * exception, *not* the #GP we are faking.  Here,
> +	 * we are telling 'iret' to jump to general_protection
> +	 * when returning from this double fault.
> +	 */
> +	regs->ip = (unsigned long)general_protection;
> +	/*
> +	 * Make iret move the stack to the "fake #GP" stack
> +	 * we created above.
> +	 */
> +	regs->sp = (unsigned long)&new_stack_top[-6];
> +}
> +
>  #ifdef CONFIG_X86_64
>  /* Runs on IST stack */
>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> @@ -354,14 +391,7 @@ dotraplinkage void do_double_fault(struc
>  		regs->cs == __KERNEL_CS &&
>  		regs->ip == (unsigned long)native_irq_return_iret)
>  	{
> -		struct pt_regs *normal_regs = task_pt_regs(current);
> -
> -		/* Fake a #GP(0) from userspace. */
> -		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
> -		normal_regs->orig_ax = 0;  /* Missing (lost) #GP error code */
> -		regs->ip = (unsigned long)general_protection;
> -		regs->sp = (unsigned long)&normal_regs->orig_ax;
> -
> +		setup_fake_gp_at_iret(regs);

Please split that out into a preparatory patch and explain the difference
between the original magic and the new one which puts the fake stake at
offset 0x1500.

> +/*
> + * This is only for walking kernel addresses.  We use it too help

s/too/to/ ?

> + * recreate the "shadow" page tables which are used while we are in
> + * userspace.
> + *
> + * This can be called on any kernel memory addresses and will work
> + * with any page sizes and any types: normal linear map memory,
> + * vmalloc(), even kmap().
> + *
> + * Note: this is only used when mapping new *kernel* entries into
> + * the user/shadow page tables.  It is never used for userspace
> + * addresses.
> + *
> + * Returns -1 on error.
> + */
> +static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
> +{
> +/*

> + * Walk the shadow copy of the page tables (optionally) trying to
> + * allocate page table pages on the way down.  Does not support
> + * large pages since the data we are mapping is (generally) not
> + * large enough or aligned to 2MB.
> + *
> + * Note: this is only used when mapping *new* kernel data into the
> + * user/shadow page tables.  It is never used for userspace data.
> + *
> + * Returns a pointer to a PTE on success, or NULL on failure.
> + */
> +#define KAISER_WALK_ATOMIC  0x1
> +static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
> +					   unsigned long flags)

Please do not glue defines right before the function definition. That's
really hard to read. That define is used at the callsite as well, so please
put that on top of the file.

> +{
> +	pte_t *pte;
> +	pmd_t *pmd;
> +	pud_t *pud;
> +	p4d_t *p4d;
> +	pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address));
> +	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
> +
> +	if (flags & KAISER_WALK_ATOMIC) {
> +		gfp &= ~GFP_KERNEL;
> +		gfp |= __GFP_HIGH | __GFP_ATOMIC;
> +	}
> +
> +	if (address < PAGE_OFFSET) {
> +		WARN_ONCE(1, "attempt to walk user address\n");
> +		return NULL;
> +	}
> +
> +	if (pgd_none(*pgd)) {
> +		WARN_ONCE(1, "All shadow pgds should have been populated\n");
> +		return NULL;
> +	}
> +	BUILD_BUG_ON(pgd_large(*pgd) != 0);

So in get_pa_from_kernel_map() you use a WARN(). Here you use a
BUILD_BUG_ON(). Can we use one of those consistently, please?

> +	p4d = p4d_offset(pgd, address);
> +	BUILD_BUG_ON(p4d_large(*p4d) != 0);

> +/*
> + * Given a kernel address, @__start_addr, copy that mapping into
> + * the user (shadow) page tables.  This may need to allocate page
> + * table pages.
> + */
> +int kaiser_add_user_map(const void *__start_addr, unsigned long size,
> +			unsigned long flags)
> +{
> +	pte_t *pte;
> +	unsigned long start_addr = (unsigned long)__start_addr;
> +	unsigned long address = start_addr & PAGE_MASK;
> +	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
> +	unsigned long target_address;
> +
> +	for (; address < end_addr; address += PAGE_SIZE) {
> +		target_address = get_pa_from_kernel_map(address);
> +		if (target_address == -1)
> +			return -EIO;
> +
> +		pte = kaiser_shadow_pagetable_walk(address, false);
> +		/*
> +		 * Errors come from either -ENOMEM for a page
> +		 * table page, or something screwy that did a
> +		 * WARN_ON().  Just return -ENOMEM.
> +		 */
> +		if (!pte)
> +			return -ENOMEM;
> +		if (pte_none(*pte)) {
> +			set_pte(pte, __pte(flags | target_address));
> +		} else {
> +			pte_t tmp;
> +			set_pte(&tmp, __pte(flags | target_address));
> +			WARN_ON_ONCE(!pte_same(*pte, tmp));

So the warning is here because these tables should only be populated once,
right? A comment to that effect would be helpful.

> +		}
> +	}
> +	return 0;
> +}
> +
> +int kaiser_add_user_map_ptrs(const void *__start_addr,
> +			     const void *__end_addr,
> +			     unsigned long flags)
> +{
> +	return kaiser_add_user_map(__start_addr,
> +				   __end_addr - __start_addr,
> +				   flags);
> +}
> +
> +/*
> + * Ensure that the top level of the (shadow) page tables are
> + * entirely populated.  This ensures that all processes that get
> + * forked have the same entries.  This way, we do not have to
> + * ever go set up new entries in older processes.
> + *
> + * Note: we never free these, so there are no updates to them
> + * after this.
> + */
> +static void __init kaiser_init_all_pgds(void)
> +{
> +	pgd_t *pgd;
> +	int i = 0;

Initializing i is pointless

> +
> +	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
> +	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
> +		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;

This looks wrong. The kernel address space gets incremented by PGDIR_SIZE
and does not make a jump from PAGE_OFFSET to PAGE_OFFSET + 256 * PGDIR_SIZE

	int i, j;

	for (i = PTRS_PER_PGD / 2, j = 0; i < PTRS_PER_PGD; i++, j++) {
		unsigned long addr = PAGE_OFFSET + j * PGDIR_SIZE;

Not that is has any effect right now. Neither p4d_alloc_one() nor
pud_alloc_one() are using the 'addr' argument.

> +#if CONFIG_PGTABLE_LEVELS > 4
> +		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
> +		if (!p4d) {
> +			WARN_ON(1);
> +			break;
> +		}
> +		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
> +#else /* CONFIG_PGTABLE_LEVELS <= 4 */
> +		pud_t *pud = pud_alloc_one(&init_mm, addr);
> +		if (!pud) {
> +			WARN_ON(1);
> +			break;
> +		}
> +		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
> +#endif /* CONFIG_PGTABLE_LEVELS */
> +	}
> +}
> +
> +/*
> + * The page table allocations in here can theoretically fail, but
> + * we can not do much about it in early boot.  Do the checking
> + * and warning in a macro to make it more readable.
> + */
> +#define kaiser_add_user_map_early(start, size, flags) do {	\
> +	int __ret = kaiser_add_user_map(start, size, flags);	\
> +	WARN_ON(__ret);						\
> +} while (0)
> +
> +#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
> +	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
> +	WARN_ON(__ret);							\
> +} while (0)

Any reason why this cannot be an inline?

> +void kaiser_remove_mapping(unsigned long start, unsigned long size)
> +{
> +	unsigned long addr;
> +
> +	/* The shadow page tables always use small pages: */
> +	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
> +		/*
> +		 * Do an "atomic" walk in case this got called from an atomic
> +		 * context.  This should not do any allocations because we
> +		 * should only be walking things that are known to be mapped.
> +		 */
> +		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
> +
> +		/*
> +		 * We are removing a mapping that should
> +		 * exist.  WARN if it was not there:
> +		 */
> +		if (!pte) {
> +			WARN_ON_ONCE(1);
> +			continue;
> +		}
> +
> +		pte_clear(&init_mm, addr, pte);
> +	}
> +	/*
> +	 * This ensures that the TLB entries used to map this data are
> +	 * no longer usable on *this* CPU.  We theoretically want to
> +	 * flush the entries on all CPUs here, but that's too
> +	 * expensive right now: this is called to unmap process
> +	 * stacks in the exit() path path.

s/path path/path/

> +	 *
> +	 * This can change if we get to the point where this is not
> +	 * in a remotely hot path, like only called via write_ldt().
> +	 *
> +	 * Note: we could probably also just invalidate the individual
> +	 * addresses to take care of *this* PCID and then do a
> +	 * tlb_flush_shared_nonglobals() to ensure that all other
> +	 * PCIDs get flushed before being used again.
> +	 */
> +	__native_flush_tlb_global();
> +}

> --- a/arch/x86/mm/pageattr.c~kaiser-base	2017-11-10 11:22:09.020244950 -0800
> +++ b/arch/x86/mm/pageattr.c	2017-11-10 11:22:09.035244950 -0800
> @@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud,
>  			pud_clear(pud);
>  }
>  
> -static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
> +void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)

Should go into a preparatory patch.

>  {
>  	pud_t *pud = pud_offset(p4d, start);
>  

> diff -puN /dev/null Documentation/x86/kaiser.txt
> --- /dev/null	2017-11-06 07:51:38.702108459 -0800
> +++ b/Documentation/x86/kaiser.txt	2017-11-10 11:22:09.035244950 -0800
> @@ -0,0 +1,160 @@
> +Overview
> +========
> +
> +KAISER is a countermeasure against attacks on kernel address
> +information.  There are at least three existing, published,
> +approaches using the shared user/kernel mapping and hardware features
> +to defeat KASLR.  One approach referenced in the paper locates the
> +kernel by observing differences in page fault timing between
> +present-but-inaccessable kernel pages and non-present pages.
> +
> +When we enter the kernel via syscalls, interrupts or exceptions,

When the kernel is entered via ...

> +page tables are switched to the full "kernel" copy.  When the
> +system switches back to user mode, the user/shadow copy is used.
> +
> +The minimalistic kernel portion of the user page tables try to
> +map only what is needed to enter/exit the kernel such as the
> +entry/exit functions themselves and the interrupt descriptor
> +table (IDT).

s/try to//

> +This helps ensure that side-channel attacks that leverage the

helps to ensure

> +paging structures do not function when KAISER is enabled.  It
> +can be enabled by setting CONFIG_KAISER=y
> +
> +Page Table Management
> +=====================
> +
> +KAISER logically keeps a "copy" of the page tables which unmap
> +the kernel while in userspace.  The kernel manages the page
> +tables as normal, but the "copying" is done with a few tricks
> +that mean that we do not have to manage two full copies.
> +The first trick is that for any any new kernel mapping, we
> +presume that we do not want it mapped to userspace.  That means
> +we normally have no copying to do.  We only copy the kernel
> +entries over to the shadow in response to a kaiser_add_*()
> +call which is rare.

 When KAISER is enabled the kernel manages two page tables for the kernel
 mappings. The regular page table which is used while executing in kernel
 space and a shadow copy which only contains the mapping entries which are
 required for the kernel-userspace transition. These mappings have to be
 copied into the shadow page tables explicitely with the kaiser_add_*()
 functions.

Hmm?

> +For a new userspace mapping, the kernel makes the entries in
> +its page tables like normal.  The only difference is when the
> +kernel makes entries in the top (PGD) level.  In addition to
> +setting the entry in the main kernel PGD, a copy if the entry
> +is made in the shadow PGD.
> +PGD entries always point to another page table.  Two PGD
< +entries pointing to the same thing gives us shared page tables
> +for all the lower entries.  This leaves a single, shared set of
> +userspace page tables to manage.  One PTE to lock, one set set
> +of accessed bits, dirty bits, etc...

  For user space mappings the kernel creates an entry in the kernel PGD and
  the same entry in the shadow PGD, so the underlying page table to which
  the PGD entry points is shared down to the PTE level. This leaves a
  single, shared set of userspace page tables to manage.  One PTE to
  lock, one set set of accessed bits, dirty bits, etc...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
@ 2017-11-20 17:21     ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 17:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
> diff -puN arch/x86/entry/entry_64.S~kaiser-base arch/x86/entry/entry_64.S
> --- a/arch/x86/entry/entry_64.S~kaiser-base	2017-11-10 11:22:09.007244950 -0800
> +++ b/arch/x86/entry/entry_64.S	2017-11-10 11:22:09.031244950 -0800
> @@ -145,6 +145,16 @@ ENTRY(entry_SYSCALL_64)
>  
>  	swapgs
>  	movq	%rsp, PER_CPU_VAR(rsp_scratch)
> +
> +	/*
> +	 * We need a good kernel CR3 to be able to map the process
> +	 * stack, but we need a scratch register to be able to load
> +	 * CR3.  We could create another PER_CPU_VAR(), but %rsp is
> +	 * actually clobberable right now.  Just use it.  It will only
> +	 * be insane for one a couple instructions.
> +	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp

Shouldn't this be in the patch which introduces all that SWITCH macro magic?

>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>  
>  	/* Construct struct pt_regs on stack */
> @@ -169,8 +179,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>  
>  	/* NB: right here, all regs except r11 are live. */

Stale comment

>  
> -	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
> -
>  	/* Must wait until we have the kernel CR3 to call C functions: */
>  	TRACE_IRQS_OFF
>  
> @@ -1269,6 +1277,7 @@ ENTRY(error_entry)
>  	 * gsbase and proceed.  We'll fix up the exception and land in
>  	 * .Lgs_change's error handler with kernel gsbase.
>  	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

See above.

>  	SWAPGS
>  	jmp .Lerror_entry_done
>  
> @@ -1382,6 +1391,7 @@ ENTRY(nmi)
>  
>  	swapgs
>  	cld
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
>  	movq	%rsp, %rdx
>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>  	UNWIND_HINT_IRET_REGS base=%rdx offset=8
> @@ -1410,7 +1420,6 @@ ENTRY(nmi)
>  	UNWIND_HINT_REGS
>  	ENCODE_FRAME_POINTER
>  
> -	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi

Ditto

> +#ifdef CONFIG_KAISER
> +/*
> + * All top-level KAISER page tables are order-1 pages (8k-aligned
> + * and 8k in size).  The kernel one is at the beginning 4k and
> + * the user (shadow) one is in the last 4k.  To switch between
> + * them, you just need to flip the 12th bit in their addresses.
> + */
> +#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
> +
> +/*
> + * This generates better code than the inline assembly in
> + * __set_bit().
> + */
> +static inline void *ptr_set_bit(void *ptr, int bit)
> +{
> +	unsigned long __ptr = (unsigned long)ptr;

Newline between declaration and code please.

> +	__ptr |= (1<<bit);

  __ptr |= 1UL << bit;

> +	return (void *)__ptr;
> +}
> +static inline void *ptr_clear_bit(void *ptr, int bit)
> +{
> +	unsigned long __ptr = (unsigned long)ptr;
> +	__ptr &= ~(1<<bit);
> +	return (void *)__ptr;

Ditto

> +}

> +/*
> + * Page table pages are page-aligned.  The lower half of the top
> + * level is used for userspace and the top half for the kernel.
> + * This returns true for user pages that need to get copied into
> + * both the user and kernel copies of the page tables, and false
> + * for kernel pages that should only be in the kernel copy.
> + */
> +static inline bool is_userspace_pgd(void *__ptr)
> +{
> +	unsigned long ptr = (unsigned long)__ptr;
> +
> +	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));

The outer brackets are not required and the obvious way to write that is:

  	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);

I guess the compiler is smart enought to figure that out itself, but ...

> +}
> +
>  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
>  {
> +#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
> +	/*
> +	 * set_pgd() does not get called when we are running
> +	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
> +	 * know here that we have a p4d but that it is really at
> +	 * the top level of the page tables; it is really just a
> +	 * pgd.
> +	 */
> +	/* Do we need to also populate the shadow p4d? */
> +	if (is_userspace_pgd(p4dp))
> +		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;

native_get_shadow_p4d() is kinda confusing, as it suggest that we get the
entry not the pointer to it. native_get_shadow_p4d_ptr() is what it
actually wants to be, but a setter e.g. native_set_shadow...(), we also
have set_pgd() would be more obvious I think.

> +	/*
> +	 * Even if the entry is *mapping* userspace, ensure
> +	 * that userspace can not use it.  This way, if we
> +	 * get out to userspace with the wrong CR3 value,
> +	 * userspace will crash instead of running.
> +	 */
> +	if (!p4d.pgd.pgd)
> +		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;

Confused. Contrary to the comment this sets the NX bit on every non null
entry.

> +#else /* CONFIG_KAISER */
>  	*p4dp = p4d;
> +#endif
>  }

>  static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
>  {
>         memcpy(dst, src, count * sizeof(pgd_t));
> +#ifdef CONFIG_KAISER
> +	/* Clone the shadow pgd part as well */
> +	memcpy(native_get_shadow_pgd(dst),
> +	       native_get_shadow_pgd(src),
> +	       count * sizeof(pgd_t));

Nitpick: this fits in two lines

> +#endif
>  }

>  /*
>   * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
> @@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
>  	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
>  	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
>  	p4d_populate(&init_mm, p4d, espfix_pud_page);
> +	/*
> +	 * Just copy the top-level PGD that is mapping the espfix
> +	 * area to ensure it is mapped into the shadow user page
> +	 * tables.
> +	 *
> +	 * For 5-level paging, we should have already populated

should we have it populated or is it de facto populated?

> +	 * the espfix pgd when kaiser_init() pre-populated all
> +	 * the pgd entries.  The above p4d_alloc() would never do
> +	 * anything and the p4d_populate() would be done to a p4d
> +	 * already mapped in the userspace pgd.
> +	 */
> +#ifdef CONFIG_KAISER
> +	if (CONFIG_PGTABLE_LEVELS <= 4)
> +		set_pgd(native_get_shadow_pgd(pgd),
> +			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));

Nit: Please add curly braces on the first condition.

> +/*
> + * This "fakes" a #GP from userspace upon returning (iret'ing)
> + * from this double fault.
> + */
> +void setup_fake_gp_at_iret(struct pt_regs *regs)
> +{
> +	unsigned long *new_stack_top = (unsigned long *)
> +		(this_cpu_read(cpu_tss.x86_tss.ist[0]) - 0x1500);

0x1500? No magic numbers. Please use defines with a proper explanation.

> +	/*
> +	 * Set up a stack just like the hardware would for a #GP.
> +	 *
> +	 * This format is an "iret frame", plus the error code
> +	 * that the hardware puts on the stack for us for
> +	 * exceptions.  (see struct pt_regs).
> +	 */
> +	new_stack_top[-1] = regs->ss;
> +	new_stack_top[-2] = regs->sp;
> +	new_stack_top[-3] = regs->flags;
> +	new_stack_top[-4] = regs->cs;
> +	new_stack_top[-5] = regs->ip;
> +	new_stack_top[-6] = 0;	/* faked #GP error code */
> +
> +	/*
> +	 * 'regs' points to the "iret frame" for *this*
> +	 * exception, *not* the #GP we are faking.  Here,
> +	 * we are telling 'iret' to jump to general_protection
> +	 * when returning from this double fault.
> +	 */
> +	regs->ip = (unsigned long)general_protection;
> +	/*
> +	 * Make iret move the stack to the "fake #GP" stack
> +	 * we created above.
> +	 */
> +	regs->sp = (unsigned long)&new_stack_top[-6];
> +}
> +
>  #ifdef CONFIG_X86_64
>  /* Runs on IST stack */
>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> @@ -354,14 +391,7 @@ dotraplinkage void do_double_fault(struc
>  		regs->cs == __KERNEL_CS &&
>  		regs->ip == (unsigned long)native_irq_return_iret)
>  	{
> -		struct pt_regs *normal_regs = task_pt_regs(current);
> -
> -		/* Fake a #GP(0) from userspace. */
> -		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
> -		normal_regs->orig_ax = 0;  /* Missing (lost) #GP error code */
> -		regs->ip = (unsigned long)general_protection;
> -		regs->sp = (unsigned long)&normal_regs->orig_ax;
> -
> +		setup_fake_gp_at_iret(regs);

Please split that out into a preparatory patch and explain the difference
between the original magic and the new one which puts the fake stake at
offset 0x1500.

> +/*
> + * This is only for walking kernel addresses.  We use it too help

s/too/to/ ?

> + * recreate the "shadow" page tables which are used while we are in
> + * userspace.
> + *
> + * This can be called on any kernel memory addresses and will work
> + * with any page sizes and any types: normal linear map memory,
> + * vmalloc(), even kmap().
> + *
> + * Note: this is only used when mapping new *kernel* entries into
> + * the user/shadow page tables.  It is never used for userspace
> + * addresses.
> + *
> + * Returns -1 on error.
> + */
> +static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
> +{
> +/*

> + * Walk the shadow copy of the page tables (optionally) trying to
> + * allocate page table pages on the way down.  Does not support
> + * large pages since the data we are mapping is (generally) not
> + * large enough or aligned to 2MB.
> + *
> + * Note: this is only used when mapping *new* kernel data into the
> + * user/shadow page tables.  It is never used for userspace data.
> + *
> + * Returns a pointer to a PTE on success, or NULL on failure.
> + */
> +#define KAISER_WALK_ATOMIC  0x1
> +static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
> +					   unsigned long flags)

Please do not glue defines right before the function definition. That's
really hard to read. That define is used at the callsite as well, so please
put that on top of the file.

> +{
> +	pte_t *pte;
> +	pmd_t *pmd;
> +	pud_t *pud;
> +	p4d_t *p4d;
> +	pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address));
> +	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
> +
> +	if (flags & KAISER_WALK_ATOMIC) {
> +		gfp &= ~GFP_KERNEL;
> +		gfp |= __GFP_HIGH | __GFP_ATOMIC;
> +	}
> +
> +	if (address < PAGE_OFFSET) {
> +		WARN_ONCE(1, "attempt to walk user address\n");
> +		return NULL;
> +	}
> +
> +	if (pgd_none(*pgd)) {
> +		WARN_ONCE(1, "All shadow pgds should have been populated\n");
> +		return NULL;
> +	}
> +	BUILD_BUG_ON(pgd_large(*pgd) != 0);

So in get_pa_from_kernel_map() you use a WARN(). Here you use a
BUILD_BUG_ON(). Can we use one of those consistently, please?

> +	p4d = p4d_offset(pgd, address);
> +	BUILD_BUG_ON(p4d_large(*p4d) != 0);

> +/*
> + * Given a kernel address, @__start_addr, copy that mapping into
> + * the user (shadow) page tables.  This may need to allocate page
> + * table pages.
> + */
> +int kaiser_add_user_map(const void *__start_addr, unsigned long size,
> +			unsigned long flags)
> +{
> +	pte_t *pte;
> +	unsigned long start_addr = (unsigned long)__start_addr;
> +	unsigned long address = start_addr & PAGE_MASK;
> +	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
> +	unsigned long target_address;
> +
> +	for (; address < end_addr; address += PAGE_SIZE) {
> +		target_address = get_pa_from_kernel_map(address);
> +		if (target_address == -1)
> +			return -EIO;
> +
> +		pte = kaiser_shadow_pagetable_walk(address, false);
> +		/*
> +		 * Errors come from either -ENOMEM for a page
> +		 * table page, or something screwy that did a
> +		 * WARN_ON().  Just return -ENOMEM.
> +		 */
> +		if (!pte)
> +			return -ENOMEM;
> +		if (pte_none(*pte)) {
> +			set_pte(pte, __pte(flags | target_address));
> +		} else {
> +			pte_t tmp;
> +			set_pte(&tmp, __pte(flags | target_address));
> +			WARN_ON_ONCE(!pte_same(*pte, tmp));

So the warning is here because these tables should only be populated once,
right? A comment to that effect would be helpful.

> +		}
> +	}
> +	return 0;
> +}
> +
> +int kaiser_add_user_map_ptrs(const void *__start_addr,
> +			     const void *__end_addr,
> +			     unsigned long flags)
> +{
> +	return kaiser_add_user_map(__start_addr,
> +				   __end_addr - __start_addr,
> +				   flags);
> +}
> +
> +/*
> + * Ensure that the top level of the (shadow) page tables are
> + * entirely populated.  This ensures that all processes that get
> + * forked have the same entries.  This way, we do not have to
> + * ever go set up new entries in older processes.
> + *
> + * Note: we never free these, so there are no updates to them
> + * after this.
> + */
> +static void __init kaiser_init_all_pgds(void)
> +{
> +	pgd_t *pgd;
> +	int i = 0;

Initializing i is pointless

> +
> +	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
> +	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
> +		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;

This looks wrong. The kernel address space gets incremented by PGDIR_SIZE
and does not make a jump from PAGE_OFFSET to PAGE_OFFSET + 256 * PGDIR_SIZE

	int i, j;

	for (i = PTRS_PER_PGD / 2, j = 0; i < PTRS_PER_PGD; i++, j++) {
		unsigned long addr = PAGE_OFFSET + j * PGDIR_SIZE;

Not that is has any effect right now. Neither p4d_alloc_one() nor
pud_alloc_one() are using the 'addr' argument.

> +#if CONFIG_PGTABLE_LEVELS > 4
> +		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
> +		if (!p4d) {
> +			WARN_ON(1);
> +			break;
> +		}
> +		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
> +#else /* CONFIG_PGTABLE_LEVELS <= 4 */
> +		pud_t *pud = pud_alloc_one(&init_mm, addr);
> +		if (!pud) {
> +			WARN_ON(1);
> +			break;
> +		}
> +		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
> +#endif /* CONFIG_PGTABLE_LEVELS */
> +	}
> +}
> +
> +/*
> + * The page table allocations in here can theoretically fail, but
> + * we can not do much about it in early boot.  Do the checking
> + * and warning in a macro to make it more readable.
> + */
> +#define kaiser_add_user_map_early(start, size, flags) do {	\
> +	int __ret = kaiser_add_user_map(start, size, flags);	\
> +	WARN_ON(__ret);						\
> +} while (0)
> +
> +#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
> +	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
> +	WARN_ON(__ret);							\
> +} while (0)

Any reason why this cannot be an inline?

> +void kaiser_remove_mapping(unsigned long start, unsigned long size)
> +{
> +	unsigned long addr;
> +
> +	/* The shadow page tables always use small pages: */
> +	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
> +		/*
> +		 * Do an "atomic" walk in case this got called from an atomic
> +		 * context.  This should not do any allocations because we
> +		 * should only be walking things that are known to be mapped.
> +		 */
> +		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
> +
> +		/*
> +		 * We are removing a mapping that should
> +		 * exist.  WARN if it was not there:
> +		 */
> +		if (!pte) {
> +			WARN_ON_ONCE(1);
> +			continue;
> +		}
> +
> +		pte_clear(&init_mm, addr, pte);
> +	}
> +	/*
> +	 * This ensures that the TLB entries used to map this data are
> +	 * no longer usable on *this* CPU.  We theoretically want to
> +	 * flush the entries on all CPUs here, but that's too
> +	 * expensive right now: this is called to unmap process
> +	 * stacks in the exit() path path.

s/path path/path/

> +	 *
> +	 * This can change if we get to the point where this is not
> +	 * in a remotely hot path, like only called via write_ldt().
> +	 *
> +	 * Note: we could probably also just invalidate the individual
> +	 * addresses to take care of *this* PCID and then do a
> +	 * tlb_flush_shared_nonglobals() to ensure that all other
> +	 * PCIDs get flushed before being used again.
> +	 */
> +	__native_flush_tlb_global();
> +}

> --- a/arch/x86/mm/pageattr.c~kaiser-base	2017-11-10 11:22:09.020244950 -0800
> +++ b/arch/x86/mm/pageattr.c	2017-11-10 11:22:09.035244950 -0800
> @@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud,
>  			pud_clear(pud);
>  }
>  
> -static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
> +void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)

Should go into a preparatory patch.

>  {
>  	pud_t *pud = pud_offset(p4d, start);
>  

> diff -puN /dev/null Documentation/x86/kaiser.txt
> --- /dev/null	2017-11-06 07:51:38.702108459 -0800
> +++ b/Documentation/x86/kaiser.txt	2017-11-10 11:22:09.035244950 -0800
> @@ -0,0 +1,160 @@
> +Overview
> +========
> +
> +KAISER is a countermeasure against attacks on kernel address
> +information.  There are at least three existing, published,
> +approaches using the shared user/kernel mapping and hardware features
> +to defeat KASLR.  One approach referenced in the paper locates the
> +kernel by observing differences in page fault timing between
> +present-but-inaccessable kernel pages and non-present pages.
> +
> +When we enter the kernel via syscalls, interrupts or exceptions,

When the kernel is entered via ...

> +page tables are switched to the full "kernel" copy.  When the
> +system switches back to user mode, the user/shadow copy is used.
> +
> +The minimalistic kernel portion of the user page tables try to
> +map only what is needed to enter/exit the kernel such as the
> +entry/exit functions themselves and the interrupt descriptor
> +table (IDT).

s/try to//

> +This helps ensure that side-channel attacks that leverage the

helps to ensure

> +paging structures do not function when KAISER is enabled.  It
> +can be enabled by setting CONFIG_KAISER=y
> +
> +Page Table Management
> +=====================
> +
> +KAISER logically keeps a "copy" of the page tables which unmap
> +the kernel while in userspace.  The kernel manages the page
> +tables as normal, but the "copying" is done with a few tricks
> +that mean that we do not have to manage two full copies.
> +The first trick is that for any any new kernel mapping, we
> +presume that we do not want it mapped to userspace.  That means
> +we normally have no copying to do.  We only copy the kernel
> +entries over to the shadow in response to a kaiser_add_*()
> +call which is rare.

 When KAISER is enabled the kernel manages two page tables for the kernel
 mappings. The regular page table which is used while executing in kernel
 space and a shadow copy which only contains the mapping entries which are
 required for the kernel-userspace transition. These mappings have to be
 copied into the shadow page tables explicitely with the kaiser_add_*()
 functions.

Hmm?

> +For a new userspace mapping, the kernel makes the entries in
> +its page tables like normal.  The only difference is when the
> +kernel makes entries in the top (PGD) level.  In addition to
> +setting the entry in the main kernel PGD, a copy if the entry
> +is made in the shadow PGD.
> +PGD entries always point to another page table.  Two PGD
< +entries pointing to the same thing gives us shared page tables
> +for all the lower entries.  This leaves a single, shared set of
> +userspace page tables to manage.  One PTE to lock, one set set
> +of accessed bits, dirty bits, etc...

  For user space mappings the kernel creates an entry in the kernel PGD and
  the same entry in the shadow PGD, so the underlying page table to which
  the PGD entry points is shared down to the PTE level. This leaves a
  single, shared set of userspace page tables to manage.  One PTE to
  lock, one set set of accessed bits, dirty bits, etc...

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 20:12     ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:

This should be folded into the previous patch.

>  b/arch/x86/include/asm/pgtable_64.h |   94 +++++++++++++++++++++++-------------
>  1 file changed, 61 insertions(+), 33 deletions(-)
> 
> diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX	2017-11-10 11:22:09.932244947 -0800
> +++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:09.935244947 -0800
> @@ -177,38 +177,76 @@ static inline p4d_t *native_get_normal_p
>  /*
>   * Page table pages are page-aligned.  The lower half of the top
>   * level is used for userspace and the top half for the kernel.
> - * This returns true for user pages that need to get copied into
> - * both the user and kernel copies of the page tables, and false
> - * for kernel pages that should only be in the kernel copy.
> + *
> + * Returns true for parts of the PGD that map userspace and
> + * false for the parts that map the kernel.
>   */
> -static inline bool is_userspace_pgd(void *__ptr)
> +static inline bool pgdp_maps_userspace(void *__ptr)
>  {
>  	unsigned long ptr = (unsigned long)__ptr;
>  
>  	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
>  }
>  
> +/*
> + * Does this PGD allow access via userspace?

s/via/from/

> + */
> +static inline bool pgd_userspace_access(pgd_t pgd)
> +{
> +	return (pgd.pgd & _PAGE_USER);
> +}
> +
> +/*
> + * Returns the pgd_t that the kernel should use in its page tables.

Should? Can the caller still decide to put something different there? I
doubt that.

> +static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
> +{
> +#ifdef CONFIG_KAISER
> +	if (pgd_userspace_access(pgd)) {
> +		if (pgdp_maps_userspace(pgdp)) {
> +			/*
> +			 * The user/shadow page tables get the full
> +			 * PGD, accessible to userspace:

s/to/from/

> +			 */
> +			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
> +			/*
> +			 * For the copy of the pgd that the kernel
> +			 * uses, make it unusable to userspace.  This
> +			 * ensures if we get out to userspace with the
> +			 * wrong CR3 value, userspace will crash
> +			 * instead of running.
> +			 */
> +			pgd.pgd |= _PAGE_NX;
> +		}
> +	} else if (!pgd.pgd) {
> +		/*
> +		 * We are clearing the PGD and can not check  _PAGE_USER
> +		 * in the zero'd PGD.

Just the argument cannot be checked because it's clearing the entry. The
pgd entry itself is not yet modified, so it could be checked.

  		 * We never do this on the
> +		 * pre-populated kernel PGDs, except for pgd_bad().
> +		 */
> +		if (pgdp_maps_userspace(pgdp)) {
> +			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
> +		} else {
> +			/*
> +			 * Uh, we are very confused.  We have been
> +			 * asked to clear a PGD that is in the kernel
> +			 * part of the address space.  We preallocated
> +			 * all the KAISER PGDs, so this should never
> +			 * happen.
> +			 */
> +			WARN_ON_ONCE(1);
> +		}
> +	}

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
@ 2017-11-20 20:12     ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:

This should be folded into the previous patch.

>  b/arch/x86/include/asm/pgtable_64.h |   94 +++++++++++++++++++++++-------------
>  1 file changed, 61 insertions(+), 33 deletions(-)
> 
> diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX	2017-11-10 11:22:09.932244947 -0800
> +++ b/arch/x86/include/asm/pgtable_64.h	2017-11-10 11:22:09.935244947 -0800
> @@ -177,38 +177,76 @@ static inline p4d_t *native_get_normal_p
>  /*
>   * Page table pages are page-aligned.  The lower half of the top
>   * level is used for userspace and the top half for the kernel.
> - * This returns true for user pages that need to get copied into
> - * both the user and kernel copies of the page tables, and false
> - * for kernel pages that should only be in the kernel copy.
> + *
> + * Returns true for parts of the PGD that map userspace and
> + * false for the parts that map the kernel.
>   */
> -static inline bool is_userspace_pgd(void *__ptr)
> +static inline bool pgdp_maps_userspace(void *__ptr)
>  {
>  	unsigned long ptr = (unsigned long)__ptr;
>  
>  	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
>  }
>  
> +/*
> + * Does this PGD allow access via userspace?

s/via/from/

> + */
> +static inline bool pgd_userspace_access(pgd_t pgd)
> +{
> +	return (pgd.pgd & _PAGE_USER);
> +}
> +
> +/*
> + * Returns the pgd_t that the kernel should use in its page tables.

Should? Can the caller still decide to put something different there? I
doubt that.

> +static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
> +{
> +#ifdef CONFIG_KAISER
> +	if (pgd_userspace_access(pgd)) {
> +		if (pgdp_maps_userspace(pgdp)) {
> +			/*
> +			 * The user/shadow page tables get the full
> +			 * PGD, accessible to userspace:

s/to/from/

> +			 */
> +			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
> +			/*
> +			 * For the copy of the pgd that the kernel
> +			 * uses, make it unusable to userspace.  This
> +			 * ensures if we get out to userspace with the
> +			 * wrong CR3 value, userspace will crash
> +			 * instead of running.
> +			 */
> +			pgd.pgd |= _PAGE_NX;
> +		}
> +	} else if (!pgd.pgd) {
> +		/*
> +		 * We are clearing the PGD and can not check  _PAGE_USER
> +		 * in the zero'd PGD.

Just the argument cannot be checked because it's clearing the entry. The
pgd entry itself is not yet modified, so it could be checked.

  		 * We never do this on the
> +		 * pre-populated kernel PGDs, except for pgd_bad().
> +		 */
> +		if (pgdp_maps_userspace(pgdp)) {
> +			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
> +		} else {
> +			/*
> +			 * Uh, we are very confused.  We have been
> +			 * asked to clear a PGD that is in the kernel
> +			 * part of the address space.  We preallocated
> +			 * all the KAISER PGDs, so this should never
> +			 * happen.
> +			 */
> +			WARN_ON_ONCE(1);
> +		}
> +	}

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 20:22     ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
>  	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
> +
> +	/* CPU 0's mapping is done in kaiser_init() */
> +	if (cpu) {
> +		int ret;
> +
> +		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
> +					 PAGE_SIZE, __PAGE_KERNEL_RO);
> +		/*
> +		 * We do not have a good way to fail CPU bringup.
> +		 * Just WARN about it and hope we boot far enough
> +		 * to get a good log out.
> +		 */

The GDT fixmap can be set up before the CPU is started. There is no reason
to do that in cpu_init().

> +
> +	/*
> +	 * We could theoretically do this in setup_fixmap_gdt().
> +	 * But, we would need to rewrite the above page table
> +	 * allocation code to use the bootmem allocator.  The
> +	 * buddy allocator is not available at the time that we
> +	 * call setup_fixmap_gdt() for CPU 0.
> +	 */
> +	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
> +				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);

This one is needs to stay.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-20 20:22     ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
>  	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
> +
> +	/* CPU 0's mapping is done in kaiser_init() */
> +	if (cpu) {
> +		int ret;
> +
> +		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
> +					 PAGE_SIZE, __PAGE_KERNEL_RO);
> +		/*
> +		 * We do not have a good way to fail CPU bringup.
> +		 * Just WARN about it and hope we boot far enough
> +		 * to get a good log out.
> +		 */

The GDT fixmap can be set up before the CPU is started. There is no reason
to do that in cpu_init().

> +
> +	/*
> +	 * We could theoretically do this in setup_fixmap_gdt().
> +	 * But, we would need to rewrite the above page table
> +	 * allocation code to use the bootmem allocator.  The
> +	 * buddy allocator is not available at the time that we
> +	 * call setup_fixmap_gdt() for CPU 0.
> +	 */
> +	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
> +				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);

This one is needs to stay.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 20:40     ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
>  
> +static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
> +				 unsigned long flags)
> +{
> +	int ret = kaiser_add_user_map(start_addr, size, flags);
> +	WARN_ON(ret);
> +	return ret;

What's the point of the return value when it is ignored at the call site?

> +}
> +
>  /*
>   * Ensure that the top level of the (shadow) page tables are
>   * entirely populated.  This ensures that all processes that get
> @@ -374,6 +382,10 @@ void __init kaiser_init(void)
>  				  sizeof(gate_desc) * NR_VECTORS,
>  				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>  
> +	kaiser_user_map_ptr_early(&debug_idt_table,
> +				  sizeof(gate_desc) * NR_VECTORS,
> +				  __PAGE_KERNEL | _PAGE_GLOBAL);
> +

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
@ 2017-11-20 20:40     ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
>  
> +static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
> +				 unsigned long flags)
> +{
> +	int ret = kaiser_add_user_map(start_addr, size, flags);
> +	WARN_ON(ret);
> +	return ret;

What's the point of the return value when it is ignored at the call site?

> +}
> +
>  /*
>   * Ensure that the top level of the (shadow) page tables are
>   * entirely populated.  This ensures that all processes that get
> @@ -374,6 +382,10 @@ void __init kaiser_init(void)
>  				  sizeof(gate_desc) * NR_VECTORS,
>  				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>  
> +	kaiser_user_map_ptr_early(&debug_idt_table,
> +				  sizeof(gate_desc) * NR_VECTORS,
> +				  __PAGE_KERNEL | _PAGE_GLOBAL);
> +

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 20:44     ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-20 20:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Andrew Lutomirski,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Fri, Nov 10, 2017 at 11:31 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The IDT is another structure which the CPU references via a
> virtual address.  It also obviously needs these to handle an
> interrupt in userspace, so these need to be mapped into the user
> copy of the page tables.

Why would the debug IDT ever be used in user mode?  IIRC it's a total
turd related to avoiding crap nesting inside NMI.  Or am I wrong?

If it *is* used in user mode, then we have a bug and it should be in
the IDT to avoid address leaks just like the normal IDT.

--Andy

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
@ 2017-11-20 20:44     ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-20 20:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Andrew Lutomirski,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Fri, Nov 10, 2017 at 11:31 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The IDT is another structure which the CPU references via a
> virtual address.  It also obviously needs these to handle an
> interrupt in userspace, so these need to be mapped into the user
> copy of the page tables.

Why would the debug IDT ever be used in user mode?  IIRC it's a total
turd related to avoiding crap nesting inside NMI.  Or am I wrong?

If it *is* used in user mode, then we have a bug and it should be in
the IDT to avoid address leaks just like the normal IDT.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-20 20:22     ` Thomas Gleixner
@ 2017-11-20 20:46       ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-20 20:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Andrew Lutomirski,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Mon, Nov 20, 2017 at 12:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 10 Nov 2017, Dave Hansen wrote:
>>       __set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
>> +
>> +     /* CPU 0's mapping is done in kaiser_init() */
>> +     if (cpu) {
>> +             int ret;
>> +
>> +             ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
>> +                                      PAGE_SIZE, __PAGE_KERNEL_RO);
>> +             /*
>> +              * We do not have a good way to fail CPU bringup.
>> +              * Just WARN about it and hope we boot far enough
>> +              * to get a good log out.
>> +              */
>
> The GDT fixmap can be set up before the CPU is started. There is no reason
> to do that in cpu_init().
>
>> +
>> +     /*
>> +      * We could theoretically do this in setup_fixmap_gdt().
>> +      * But, we would need to rewrite the above page table
>> +      * allocation code to use the bootmem allocator.  The
>> +      * buddy allocator is not available at the time that we
>> +      * call setup_fixmap_gdt() for CPU 0.
>> +      */
>> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
>> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>
> This one is needs to stay.

When you rebase on to my latest version, this should change to mapping
the entire cpu_entry_area.

--Andy

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-20 20:46       ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-20 20:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Andrew Lutomirski,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Mon, Nov 20, 2017 at 12:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 10 Nov 2017, Dave Hansen wrote:
>>       __set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
>> +
>> +     /* CPU 0's mapping is done in kaiser_init() */
>> +     if (cpu) {
>> +             int ret;
>> +
>> +             ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
>> +                                      PAGE_SIZE, __PAGE_KERNEL_RO);
>> +             /*
>> +              * We do not have a good way to fail CPU bringup.
>> +              * Just WARN about it and hope we boot far enough
>> +              * to get a good log out.
>> +              */
>
> The GDT fixmap can be set up before the CPU is started. There is no reason
> to do that in cpu_init().
>
>> +
>> +     /*
>> +      * We could theoretically do this in setup_fixmap_gdt().
>> +      * But, we would need to rewrite the above page table
>> +      * allocation code to use the bootmem allocator.  The
>> +      * buddy allocator is not available at the time that we
>> +      * call setup_fixmap_gdt() for CPU 0.
>> +      */
>> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
>> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>
> This one is needs to stay.

When you rebase on to my latest version, this should change to mapping
the entire cpu_entry_area.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 20/30] x86, mm: remove hard-coded ASID limit checks
  2017-11-10 19:31   ` Dave Hansen
@ 2017-11-20 20:47     ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
>  
> +/* There are 12 bits of space for ASIDS in CR3 */
> +#define CR3_HW_ASID_BITS 12
> +/* When enabled, KAISER consumes a single bit for user/kernel switches */
> +#define KAISER_CONSUMED_ASID_BITS 0
> +
> +#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)

Spaces around '-' please. Same for other operators.

> +/*
> + * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
> + * to account for them being zero-absed.  Another -1 is because ASID 0

s/absed/based/

> + * is reserved for use by non-PCID-aware users.
> + */
> +#define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
> +
>  /*

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 20/30] x86, mm: remove hard-coded ASID limit checks
@ 2017-11-20 20:47     ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On Fri, 10 Nov 2017, Dave Hansen wrote:
>  
> +/* There are 12 bits of space for ASIDS in CR3 */
> +#define CR3_HW_ASID_BITS 12
> +/* When enabled, KAISER consumes a single bit for user/kernel switches */
> +#define KAISER_CONSUMED_ASID_BITS 0
> +
> +#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)

Spaces around '-' please. Same for other operators.

> +/*
> + * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
> + * to account for them being zero-absed.  Another -1 is because ASID 0

s/absed/based/

> + * is reserved for use by non-PCID-aware users.
> + */
> +#define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
> +
>  /*

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
  2017-11-20 20:44     ` Andy Lutomirski
@ 2017-11-20 20:54       ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Mon, 20 Nov 2017, Andy Lutomirski wrote:

> On Fri, Nov 10, 2017 at 11:31 AM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > The IDT is another structure which the CPU references via a
> > virtual address.  It also obviously needs these to handle an
> > interrupt in userspace, so these need to be mapped into the user
> > copy of the page tables.
> 
> Why would the debug IDT ever be used in user mode?  IIRC it's a total
> turd related to avoiding crap nesting inside NMI.  Or am I wrong?

No. It's called from the TRACE_IRQS macros in the ASM entry code and from
do_nmi().

> If it *is* used in user mode, then we have a bug and it should be in
> the IDT to avoid address leaks just like the normal IDT.

It's not so this can go away. Good catch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
@ 2017-11-20 20:54       ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Mon, 20 Nov 2017, Andy Lutomirski wrote:

> On Fri, Nov 10, 2017 at 11:31 AM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > The IDT is another structure which the CPU references via a
> > virtual address.  It also obviously needs these to handle an
> > interrupt in userspace, so these need to be mapped into the user
> > copy of the page tables.
> 
> Why would the debug IDT ever be used in user mode?  IIRC it's a total
> turd related to avoiding crap nesting inside NMI.  Or am I wrong?

No. It's called from the TRACE_IRQS macros in the ASM entry code and from
do_nmi().

> If it *is* used in user mode, then we have a bug and it should be in
> the IDT to avoid address leaks just like the normal IDT.

It's not so this can go away. Good catch.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-20 20:46       ` Andy Lutomirski
@ 2017-11-20 20:55         ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Mon, 20 Nov 2017, Andy Lutomirski wrote:
> On Mon, Nov 20, 2017 at 12:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 10 Nov 2017, Dave Hansen wrote:
> >>       __set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
> >> +
> >> +     /* CPU 0's mapping is done in kaiser_init() */
> >> +     if (cpu) {
> >> +             int ret;
> >> +
> >> +             ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
> >> +                                      PAGE_SIZE, __PAGE_KERNEL_RO);
> >> +             /*
> >> +              * We do not have a good way to fail CPU bringup.
> >> +              * Just WARN about it and hope we boot far enough
> >> +              * to get a good log out.
> >> +              */
> >
> > The GDT fixmap can be set up before the CPU is started. There is no reason
> > to do that in cpu_init().
> >
> >> +
> >> +     /*
> >> +      * We could theoretically do this in setup_fixmap_gdt().
> >> +      * But, we would need to rewrite the above page table
> >> +      * allocation code to use the bootmem allocator.  The
> >> +      * buddy allocator is not available at the time that we
> >> +      * call setup_fixmap_gdt() for CPU 0.
> >> +      */
> >> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
> >> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
> >
> > This one is needs to stay.
> 
> When you rebase on to my latest version, this should change to mapping
> the entire cpu_entry_area.

Too much flux left and right :)

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-20 20:55         ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-20 20:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Mon, 20 Nov 2017, Andy Lutomirski wrote:
> On Mon, Nov 20, 2017 at 12:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 10 Nov 2017, Dave Hansen wrote:
> >>       __set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
> >> +
> >> +     /* CPU 0's mapping is done in kaiser_init() */
> >> +     if (cpu) {
> >> +             int ret;
> >> +
> >> +             ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
> >> +                                      PAGE_SIZE, __PAGE_KERNEL_RO);
> >> +             /*
> >> +              * We do not have a good way to fail CPU bringup.
> >> +              * Just WARN about it and hope we boot far enough
> >> +              * to get a good log out.
> >> +              */
> >
> > The GDT fixmap can be set up before the CPU is started. There is no reason
> > to do that in cpu_init().
> >
> >> +
> >> +     /*
> >> +      * We could theoretically do this in setup_fixmap_gdt().
> >> +      * But, we would need to rewrite the above page table
> >> +      * allocation code to use the bootmem allocator.  The
> >> +      * buddy allocator is not available at the time that we
> >> +      * call setup_fixmap_gdt() for CPU 0.
> >> +      */
> >> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
> >> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
> >
> > This one is needs to stay.
> 
> When you rebase on to my latest version, this should change to mapping
> the entire cpu_entry_area.

Too much flux left and right :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
  2017-11-20 20:12     ` Thomas Gleixner
@ 2017-11-21  7:05       ` Ingo Molnar
  -1 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-21  7:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> > + */
> > +static inline bool pgd_userspace_access(pgd_t pgd)
> > +{
> > +	return (pgd.pgd & _PAGE_USER);
> > +}

Also a nit: the parentheses are superfluous - these aren't macros.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
@ 2017-11-21  7:05       ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-21  7:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> > + */
> > +static inline bool pgd_userspace_access(pgd_t pgd)
> > +{
> > +	return (pgd.pgd & _PAGE_USER);
> > +}

Also a nit: the parentheses are superfluous - these aren't macros.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-20 20:46       ` Andy Lutomirski
@ 2017-11-21 21:19         ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 21:19 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On 11/20/2017 12:46 PM, Andy Lutomirski wrote:
>>> +     /*
>>> +      * We could theoretically do this in setup_fixmap_gdt().
>>> +      * But, we would need to rewrite the above page table
>>> +      * allocation code to use the bootmem allocator.  The
>>> +      * buddy allocator is not available at the time that we
>>> +      * call setup_fixmap_gdt() for CPU 0.
>>> +      */
>>> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
>>> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>> This one is needs to stay.
> When you rebase on to my latest version, this should change to mapping
> the entire cpu_entry_area.

I did this, but unfortunately it ends up having to individually map all
four pieces of cpu_entry_area.  They all need different permissions and
while theoretically we could do TSS+exception-stacks in the same call,
they're not next to each other:

 GDT: R/O
 TSS: R/W at least because of trampoline stack
 entry code: EXEC+R/O
 exception stacks: R/W

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-21 21:19         ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 21:19 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, richard.fellner, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On 11/20/2017 12:46 PM, Andy Lutomirski wrote:
>>> +     /*
>>> +      * We could theoretically do this in setup_fixmap_gdt().
>>> +      * But, we would need to rewrite the above page table
>>> +      * allocation code to use the bootmem allocator.  The
>>> +      * buddy allocator is not available at the time that we
>>> +      * call setup_fixmap_gdt() for CPU 0.
>>> +      */
>>> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
>>> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>> This one is needs to stay.
> When you rebase on to my latest version, this should change to mapping
> the entire cpu_entry_area.

I did this, but unfortunately it ends up having to individually map all
four pieces of cpu_entry_area.  They all need different permissions and
while theoretically we could do TSS+exception-stacks in the same call,
they're not next to each other:

 GDT: R/O
 TSS: R/W at least because of trampoline stack
 entry code: EXEC+R/O
 exception stacks: R/W

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
  2017-11-20 20:12     ` Thomas Gleixner
@ 2017-11-21 22:09       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 22:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/20/2017 12:12 PM, Thomas Gleixner wrote:
>> +			 */
>> +			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
>> +			/*
>> +			 * For the copy of the pgd that the kernel
>> +			 * uses, make it unusable to userspace.  This
>> +			 * ensures if we get out to userspace with the
>> +			 * wrong CR3 value, userspace will crash
>> +			 * instead of running.
>> +			 */
>> +			pgd.pgd |= _PAGE_NX;
>> +		}
>> +	} else if (!pgd.pgd) {
>> +		/*
>> +		 * We are clearing the PGD and can not check  _PAGE_USER
>> +		 * in the zero'd PGD.
> 
> Just the argument cannot be checked because it's clearing the entry. The
> pgd entry itself is not yet modified, so it could be checked.

So, I guess we could enforce that only PGDs with _PAGE_USER set can ever
be cleared.  That has a nice symmetry to it because we set the shadow
when we see _PAGE_USER and we would then clear the shadow when we see
_PAGE_USER.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
@ 2017-11-21 22:09       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 22:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/20/2017 12:12 PM, Thomas Gleixner wrote:
>> +			 */
>> +			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
>> +			/*
>> +			 * For the copy of the pgd that the kernel
>> +			 * uses, make it unusable to userspace.  This
>> +			 * ensures if we get out to userspace with the
>> +			 * wrong CR3 value, userspace will crash
>> +			 * instead of running.
>> +			 */
>> +			pgd.pgd |= _PAGE_NX;
>> +		}
>> +	} else if (!pgd.pgd) {
>> +		/*
>> +		 * We are clearing the PGD and can not check  _PAGE_USER
>> +		 * in the zero'd PGD.
> 
> Just the argument cannot be checked because it's clearing the entry. The
> pgd entry itself is not yet modified, so it could be checked.

So, I guess we could enforce that only PGDs with _PAGE_USER set can ever
be cleared.  That has a nice symmetry to it because we set the shadow
when we see _PAGE_USER and we would then clear the shadow when we see
_PAGE_USER.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-20 20:22     ` Thomas Gleixner
@ 2017-11-21 22:12       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 22:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/20/2017 12:22 PM, Thomas Gleixner wrote:
> On Fri, 10 Nov 2017, Dave Hansen wrote:
>>  	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
>> +
>> +	/* CPU 0's mapping is done in kaiser_init() */
>> +	if (cpu) {
>> +		int ret;
>> +
>> +		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
>> +					 PAGE_SIZE, __PAGE_KERNEL_RO);
>> +		/*
>> +		 * We do not have a good way to fail CPU bringup.
>> +		 * Just WARN about it and hope we boot far enough
>> +		 * to get a good log out.
>> +		 */
> 
> The GDT fixmap can be set up before the CPU is started. There is no reason
> to do that in cpu_init().

Do you mean the __set_fixmap(), or my call to kaiser_add_mapping()?

Where would you suggest we move it?  Here seems kinda nice because it's
right next to where the get_cpu_gdt_ro() mapping is created.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-21 22:12       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 22:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/20/2017 12:22 PM, Thomas Gleixner wrote:
> On Fri, 10 Nov 2017, Dave Hansen wrote:
>>  	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
>> +
>> +	/* CPU 0's mapping is done in kaiser_init() */
>> +	if (cpu) {
>> +		int ret;
>> +
>> +		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
>> +					 PAGE_SIZE, __PAGE_KERNEL_RO);
>> +		/*
>> +		 * We do not have a good way to fail CPU bringup.
>> +		 * Just WARN about it and hope we boot far enough
>> +		 * to get a good log out.
>> +		 */
> 
> The GDT fixmap can be set up before the CPU is started. There is no reason
> to do that in cpu_init().

Do you mean the __set_fixmap(), or my call to kaiser_add_mapping()?

Where would you suggest we move it?  Here seems kinda nice because it's
right next to where the get_cpu_gdt_ro() mapping is created.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
  2017-11-20 20:40     ` Thomas Gleixner
@ 2017-11-21 22:16       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 22:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/20/2017 12:40 PM, Thomas Gleixner wrote:
> On Fri, 10 Nov 2017, Dave Hansen wrote:
>>  
>> +static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
>> +				 unsigned long flags)
>> +{
>> +	int ret = kaiser_add_user_map(start_addr, size, flags);
>> +	WARN_ON(ret);
>> +	return ret;
> What's the point of the return value when it is ignored at the call site?

I'm dropping this patch, btw.  It was unnecessary.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 17/30] x86, kaiser: map debug IDT tables
@ 2017-11-21 22:16       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 22:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, richard.fellner, luto, torvalds, keescook,
	hughd, x86

On 11/20/2017 12:40 PM, Thomas Gleixner wrote:
> On Fri, 10 Nov 2017, Dave Hansen wrote:
>>  
>> +static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
>> +				 unsigned long flags)
>> +{
>> +	int ret = kaiser_add_user_map(start_addr, size, flags);
>> +	WARN_ON(ret);
>> +	return ret;
> What's the point of the return value when it is ignored at the call site?

I'm dropping this patch, btw.  It was unnecessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-21 21:19         ` Dave Hansen
@ 2017-11-21 22:46           ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-21 22:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML



> On Nov 21, 2017, at 2:19 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
> On 11/20/2017 12:46 PM, Andy Lutomirski wrote:
>>>> +     /*
>>>> +      * We could theoretically do this in setup_fixmap_gdt().
>>>> +      * But, we would need to rewrite the above page table
>>>> +      * allocation code to use the bootmem allocator.  The
>>>> +      * buddy allocator is not available at the time that we
>>>> +      * call setup_fixmap_gdt() for CPU 0.
>>>> +      */
>>>> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
>>>> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>>> This one is needs to stay.
>> When you rebase on to my latest version, this should change to mapping
>> the entire cpu_entry_area.
> 
> I did this, but unfortunately it ends up having to individually map all
> four pieces of cpu_entry_area.  They all need different permissions and
> while theoretically we could do TSS+exception-stacks in the same call,
> they're not next to each other:
> 
> GDT: R/O
> TSS: R/W at least because of trampoline stack
> entry code: EXEC+R/O
> exception stacks: R/W

Can you avoid code duplication by adding some logic right after the kernel cpu_entry_area is set up to iterate page by page over the PTEs in the cpu_entry_area for that CPU and just install exactly the same PTEs into the kaiser table?  E.g. just call kaiser_add_mapping once per page but with the parameters read out from the fixmap PTEs instead of hard coded?

As a fancier but maybe better option, we could fiddle with the fixmap indices so that the whole cpu_entry_area range is aligned to a PMD boundary or higher.  We'd preallocate all the page tables for this range before booting any APs.  Then the kaiser tables could just reference the same page tables, and we don't need any AP kaiser setup at all.

This should be a wee bit faster, too, since we reduce the number of cache lines needed to refill the TLB when needed.

I'm really hoping we can get rid of kaiser_add_mapping entirely.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-21 22:46           ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-21 22:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML



> On Nov 21, 2017, at 2:19 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
> On 11/20/2017 12:46 PM, Andy Lutomirski wrote:
>>>> +     /*
>>>> +      * We could theoretically do this in setup_fixmap_gdt().
>>>> +      * But, we would need to rewrite the above page table
>>>> +      * allocation code to use the bootmem allocator.  The
>>>> +      * buddy allocator is not available at the time that we
>>>> +      * call setup_fixmap_gdt() for CPU 0.
>>>> +      */
>>>> +     kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
>>>> +                               __PAGE_KERNEL_RO | _PAGE_GLOBAL);
>>> This one is needs to stay.
>> When you rebase on to my latest version, this should change to mapping
>> the entire cpu_entry_area.
> 
> I did this, but unfortunately it ends up having to individually map all
> four pieces of cpu_entry_area.  They all need different permissions and
> while theoretically we could do TSS+exception-stacks in the same call,
> they're not next to each other:
> 
> GDT: R/O
> TSS: R/W at least because of trampoline stack
> entry code: EXEC+R/O
> exception stacks: R/W

Can you avoid code duplication by adding some logic right after the kernel cpu_entry_area is set up to iterate page by page over the PTEs in the cpu_entry_area for that CPU and just install exactly the same PTEs into the kaiser table?  E.g. just call kaiser_add_mapping once per page but with the parameters read out from the fixmap PTEs instead of hard coded?

As a fancier but maybe better option, we could fiddle with the fixmap indices so that the whole cpu_entry_area range is aligned to a PMD boundary or higher.  We'd preallocate all the page tables for this range before booting any APs.  Then the kaiser tables could just reference the same page tables, and we don't need any AP kaiser setup at all.

This should be a wee bit faster, too, since we reduce the number of cache lines needed to refill the TLB when needed.

I'm really hoping we can get rid of kaiser_add_mapping entirely.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-21 22:46           ` Andy Lutomirski
@ 2017-11-21 23:17             ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 02:46 PM, Andy Lutomirski wrote:
>> GDT: R/O TSS: R/W at least because of trampoline stack entry code:
>> EXEC+R/O exception stacks: R/W
> Can you avoid code duplication by adding some logic right after the
> kernel cpu_entry_area is set up to iterate page by page over the PTEs
> in the cpu_entry_area for that CPU and just install exactly the same
> PTEs into the kaiser table?  E.g. just call kaiser_add_mapping once
> per page but with the parameters read out from the fixmap PTEs
> instead of hard coded?

Yes, we could do that.  But, what's the gain?  We end up removing
effectively three (long) lines of code from three kaiser_add_mapping()
calls.

To do this, we need to special-case the kernel page table walker to deal
with PTEs only since we can't just grab PMD or PUD flags and stick them
in a PTE.  We would only be able to use this path when populating things
that we know are 4k-mapped in the kernel.

I guess the upside is that we don't open-code the permissions in the
KAISER code that *have* to match the permissions that the kernel itself
established.

It also means that theoretically you could not touch the KAISER code the
next time we expand the cpu entry area.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-21 23:17             ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 02:46 PM, Andy Lutomirski wrote:
>> GDT: R/O TSS: R/W at least because of trampoline stack entry code:
>> EXEC+R/O exception stacks: R/W
> Can you avoid code duplication by adding some logic right after the
> kernel cpu_entry_area is set up to iterate page by page over the PTEs
> in the cpu_entry_area for that CPU and just install exactly the same
> PTEs into the kaiser table?  E.g. just call kaiser_add_mapping once
> per page but with the parameters read out from the fixmap PTEs
> instead of hard coded?

Yes, we could do that.  But, what's the gain?  We end up removing
effectively three (long) lines of code from three kaiser_add_mapping()
calls.

To do this, we need to special-case the kernel page table walker to deal
with PTEs only since we can't just grab PMD or PUD flags and stick them
in a PTE.  We would only be able to use this path when populating things
that we know are 4k-mapped in the kernel.

I guess the upside is that we don't open-code the permissions in the
KAISER code that *have* to match the permissions that the kernel itself
established.

It also means that theoretically you could not touch the KAISER code the
next time we expand the cpu entry area.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-21 23:17             ` Dave Hansen
@ 2017-11-21 23:32               ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-21 23:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Tue, Nov 21, 2017 at 3:17 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/21/2017 02:46 PM, Andy Lutomirski wrote:
>>> GDT: R/O TSS: R/W at least because of trampoline stack entry code:
>>> EXEC+R/O exception stacks: R/W
>> Can you avoid code duplication by adding some logic right after the
>> kernel cpu_entry_area is set up to iterate page by page over the PTEs
>> in the cpu_entry_area for that CPU and just install exactly the same
>> PTEs into the kaiser table?  E.g. just call kaiser_add_mapping once
>> per page but with the parameters read out from the fixmap PTEs
>> instead of hard coded?
>
> Yes, we could do that.  But, what's the gain?  We end up removing
> effectively three (long) lines of code from three kaiser_add_mapping()
> calls.

I'm hoping we can remove kaiser_add_mapping() entirely.  Maybe that's
silly optimism.

>
> To do this, we need to special-case the kernel page table walker to deal
> with PTEs only since we can't just grab PMD or PUD flags and stick them
> in a PTE.  We would only be able to use this path when populating things
> that we know are 4k-mapped in the kernel.

I'm not sure I'm understanding the issue.  We'd promise to map the
cpu_entry_area without using large pages, but I'm not sure I know what
you're referring to.  The only issue I see is that we'd have to be
quite careful when tearing down the user tables to avoid freeing the
shared part.

>
> I guess the upside is that we don't open-code the permissions in the
> KAISER code that *have* to match the permissions that the kernel itself
> established.
>
> It also means that theoretically you could not touch the KAISER code the
> next time we expand the cpu entry area.

I definitely like that part.

--Andy

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-21 23:32               ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-21 23:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Tue, Nov 21, 2017 at 3:17 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/21/2017 02:46 PM, Andy Lutomirski wrote:
>>> GDT: R/O TSS: R/W at least because of trampoline stack entry code:
>>> EXEC+R/O exception stacks: R/W
>> Can you avoid code duplication by adding some logic right after the
>> kernel cpu_entry_area is set up to iterate page by page over the PTEs
>> in the cpu_entry_area for that CPU and just install exactly the same
>> PTEs into the kaiser table?  E.g. just call kaiser_add_mapping once
>> per page but with the parameters read out from the fixmap PTEs
>> instead of hard coded?
>
> Yes, we could do that.  But, what's the gain?  We end up removing
> effectively three (long) lines of code from three kaiser_add_mapping()
> calls.

I'm hoping we can remove kaiser_add_mapping() entirely.  Maybe that's
silly optimism.

>
> To do this, we need to special-case the kernel page table walker to deal
> with PTEs only since we can't just grab PMD or PUD flags and stick them
> in a PTE.  We would only be able to use this path when populating things
> that we know are 4k-mapped in the kernel.

I'm not sure I'm understanding the issue.  We'd promise to map the
cpu_entry_area without using large pages, but I'm not sure I know what
you're referring to.  The only issue I see is that we'd have to be
quite careful when tearing down the user tables to avoid freeing the
shared part.

>
> I guess the upside is that we don't open-code the permissions in the
> KAISER code that *have* to match the permissions that the kernel itself
> established.
>
> It also means that theoretically you could not touch the KAISER code the
> next time we expand the cpu entry area.

I definitely like that part.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-21 23:32               ` Andy Lutomirski
@ 2017-11-21 23:42                 ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 23:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 03:32 PM, Andy Lutomirski wrote:
>> To do this, we need to special-case the kernel page table walker to deal
>> with PTEs only since we can't just grab PMD or PUD flags and stick them
>> in a PTE.  We would only be able to use this path when populating things
>> that we know are 4k-mapped in the kernel.
> I'm not sure I'm understanding the issue.  We'd promise to map the
> cpu_entry_area without using large pages, but I'm not sure I know what
> you're referring to.  The only issue I see is that we'd have to be
> quite careful when tearing down the user tables to avoid freeing the
> shared part.

It's just that it currently handles large and small pages in the kernel
mapping that it's copying.  If we want to have it just copy the PTE,
we've got to refactor things a bit to separate out the PTE flags from
the paddr being targeted, and also make sure we don't munge the flags
conversion from the large-page entries to 4k PTEs.  The PAT and PSE bits
cause a bit of trouble here.

IOW, it would make the call-sites look cleaner, but it largely just
shifts the complexity elsewhere.  But, either way, it's all contained to
kaiser.c

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-21 23:42                 ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-21 23:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 03:32 PM, Andy Lutomirski wrote:
>> To do this, we need to special-case the kernel page table walker to deal
>> with PTEs only since we can't just grab PMD or PUD flags and stick them
>> in a PTE.  We would only be able to use this path when populating things
>> that we know are 4k-mapped in the kernel.
> I'm not sure I'm understanding the issue.  We'd promise to map the
> cpu_entry_area without using large pages, but I'm not sure I know what
> you're referring to.  The only issue I see is that we'd have to be
> quite careful when tearing down the user tables to avoid freeing the
> shared part.

It's just that it currently handles large and small pages in the kernel
mapping that it's copying.  If we want to have it just copy the PTE,
we've got to refactor things a bit to separate out the PTE flags from
the paddr being targeted, and also make sure we don't munge the flags
conversion from the large-page entries to 4k PTEs.  The PAT and PSE bits
cause a bit of trouble here.

IOW, it would make the call-sites look cleaner, but it largely just
shifts the complexity elsewhere.  But, either way, it's all contained to
kaiser.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-21 23:42                 ` Dave Hansen
@ 2017-11-22  0:17                   ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-22  0:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Tue, Nov 21, 2017 at 3:42 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/21/2017 03:32 PM, Andy Lutomirski wrote:
>>> To do this, we need to special-case the kernel page table walker to deal
>>> with PTEs only since we can't just grab PMD or PUD flags and stick them
>>> in a PTE.  We would only be able to use this path when populating things
>>> that we know are 4k-mapped in the kernel.
>> I'm not sure I'm understanding the issue.  We'd promise to map the
>> cpu_entry_area without using large pages, but I'm not sure I know what
>> you're referring to.  The only issue I see is that we'd have to be
>> quite careful when tearing down the user tables to avoid freeing the
>> shared part.
>
> It's just that it currently handles large and small pages in the kernel
> mapping that it's copying.  If we want to have it just copy the PTE,
> we've got to refactor things a bit to separate out the PTE flags from
> the paddr being targeted, and also make sure we don't munge the flags
> conversion from the large-page entries to 4k PTEs.  The PAT and PSE bits
> cause a bit of trouble here.

I'm confused.  I mean something like:

unsigned long start = (unsigned long)get_cpu_entry_area(cpu);
for (unsigned long addr = start; addr < start + sizeof(struct
cpu_entry_area); addr += PAGE_SIZE) {
  pte_t pte = *pte_offset_k(addr);  /* or however you do this */
  kaiser_add_mapping(pte_pfn(pte), pte_prot(pte));
}

modulo the huge pile of typos in there that surely exist.

But I still prefer my approach of just sharing the cpu_entry_area pmd
entries between the user and kernel tables.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-22  0:17                   ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-22  0:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, richard.fellner,
	Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Tue, Nov 21, 2017 at 3:42 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/21/2017 03:32 PM, Andy Lutomirski wrote:
>>> To do this, we need to special-case the kernel page table walker to deal
>>> with PTEs only since we can't just grab PMD or PUD flags and stick them
>>> in a PTE.  We would only be able to use this path when populating things
>>> that we know are 4k-mapped in the kernel.
>> I'm not sure I'm understanding the issue.  We'd promise to map the
>> cpu_entry_area without using large pages, but I'm not sure I know what
>> you're referring to.  The only issue I see is that we'd have to be
>> quite careful when tearing down the user tables to avoid freeing the
>> shared part.
>
> It's just that it currently handles large and small pages in the kernel
> mapping that it's copying.  If we want to have it just copy the PTE,
> we've got to refactor things a bit to separate out the PTE flags from
> the paddr being targeted, and also make sure we don't munge the flags
> conversion from the large-page entries to 4k PTEs.  The PAT and PSE bits
> cause a bit of trouble here.

I'm confused.  I mean something like:

unsigned long start = (unsigned long)get_cpu_entry_area(cpu);
for (unsigned long addr = start; addr < start + sizeof(struct
cpu_entry_area); addr += PAGE_SIZE) {
  pte_t pte = *pte_offset_k(addr);  /* or however you do this */
  kaiser_add_mapping(pte_pfn(pte), pte_prot(pte));
}

modulo the huge pile of typos in there that surely exist.

But I still prefer my approach of just sharing the cpu_entry_area pmd
entries between the user and kernel tables.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
  2017-11-22  0:17                   ` Andy Lutomirski
@ 2017-11-22  0:37                     ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22  0:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 04:17 PM, Andy Lutomirski wrote:
> On Tue, Nov 21, 2017 at 3:42 PM, Dave Hansen
> unsigned long start = (unsigned long)get_cpu_entry_area(cpu);
> for (unsigned long addr = start; addr < start + sizeof(struct
> cpu_entry_area); addr += PAGE_SIZE) {
>   pte_t pte = *pte_offset_k(addr);  /* or however you do this */
>   kaiser_add_mapping(pte_pfn(pte), pte_prot(pte));
> }
> 
> modulo the huge pile of typos in there that surely exist.

That would work.  I just need to find a suitable pte_offset_k() in the
kernel and make sure it works for these purposes.  We probably have one.

> But I still prefer my approach of just sharing the cpu_entry_area pmd
> entries between the user and kernel tables.

That would be spiffy.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 12/30] x86, kaiser: map GDT into user page tables
@ 2017-11-22  0:37                     ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22  0:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 04:17 PM, Andy Lutomirski wrote:
> On Tue, Nov 21, 2017 at 3:42 PM, Dave Hansen
> unsigned long start = (unsigned long)get_cpu_entry_area(cpu);
> for (unsigned long addr = start; addr < start + sizeof(struct
> cpu_entry_area); addr += PAGE_SIZE) {
>   pte_t pte = *pte_offset_k(addr);  /* or however you do this */
>   kaiser_add_mapping(pte_pfn(pte), pte_prot(pte));
> }
> 
> modulo the huge pile of typos in there that surely exist.

That would work.  I just need to find a suitable pte_offset_k() in the
kernel and make sure it works for these purposes.  We probably have one.

> But I still prefer my approach of just sharing the cpu_entry_area pmd
> entries between the user and kernel tables.

That would be spiffy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
  2017-11-21 22:09       ` Dave Hansen
@ 2017-11-22  3:44         ` Andy Lutomirski
  -1 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-22  3:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner,
	Andrew Lutomirski, Linus Torvalds, Kees Cook, Hugh Dickins,
	X86 ML

On Tue, Nov 21, 2017 at 2:09 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/20/2017 12:12 PM, Thomas Gleixner wrote:
>>> +                     */
>>> +                    native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
>>> +                    /*
>>> +                     * For the copy of the pgd that the kernel
>>> +                     * uses, make it unusable to userspace.  This
>>> +                     * ensures if we get out to userspace with the
>>> +                     * wrong CR3 value, userspace will crash
>>> +                     * instead of running.
>>> +                     */
>>> +                    pgd.pgd |= _PAGE_NX;
>>> +            }
>>> +    } else if (!pgd.pgd) {
>>> +            /*
>>> +             * We are clearing the PGD and can not check  _PAGE_USER
>>> +             * in the zero'd PGD.
>>
>> Just the argument cannot be checked because it's clearing the entry. The
>> pgd entry itself is not yet modified, so it could be checked.
>
> So, I guess we could enforce that only PGDs with _PAGE_USER set can ever
> be cleared.  That has a nice symmetry to it because we set the shadow
> when we see _PAGE_USER and we would then clear the shadow when we see
> _PAGE_USER.

Is this code path ever hit in any case other than tearing down an LDT?

I'm tempted to suggest that KAISER just disable the MODIFY_LDT config
option for now...

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
@ 2017-11-22  3:44         ` Andy Lutomirski
  0 siblings, 0 replies; 149+ messages in thread
From: Andy Lutomirski @ 2017-11-22  3:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner,
	Andrew Lutomirski, Linus Torvalds, Kees Cook, Hugh Dickins,
	X86 ML

On Tue, Nov 21, 2017 at 2:09 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 11/20/2017 12:12 PM, Thomas Gleixner wrote:
>>> +                     */
>>> +                    native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
>>> +                    /*
>>> +                     * For the copy of the pgd that the kernel
>>> +                     * uses, make it unusable to userspace.  This
>>> +                     * ensures if we get out to userspace with the
>>> +                     * wrong CR3 value, userspace will crash
>>> +                     * instead of running.
>>> +                     */
>>> +                    pgd.pgd |= _PAGE_NX;
>>> +            }
>>> +    } else if (!pgd.pgd) {
>>> +            /*
>>> +             * We are clearing the PGD and can not check  _PAGE_USER
>>> +             * in the zero'd PGD.
>>
>> Just the argument cannot be checked because it's clearing the entry. The
>> pgd entry itself is not yet modified, so it could be checked.
>
> So, I guess we could enforce that only PGDs with _PAGE_USER set can ever
> be cleared.  That has a nice symmetry to it because we set the shadow
> when we see _PAGE_USER and we would then clear the shadow when we see
> _PAGE_USER.

Is this code path ever hit in any case other than tearing down an LDT?

I'm tempted to suggest that KAISER just disable the MODIFY_LDT config
option for now...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-11-20 17:21     ` Thomas Gleixner
@ 2017-11-22 22:45       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 22:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
>> +	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
>> +		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;
> This looks wrong. The kernel address space gets incremented by PGDIR_SIZE
> and does not make a jump from PAGE_OFFSET to PAGE_OFFSET + 256 * PGDIR_SIZE
> 
> 	int i, j;
> 
> 	for (i = PTRS_PER_PGD / 2, j = 0; i < PTRS_PER_PGD; i++, j++) {
> 		unsigned long addr = PAGE_OFFSET + j * PGDIR_SIZE;
> 
> Not that is has any effect right now. Neither p4d_alloc_one() nor
> pud_alloc_one() are using the 'addr' argument.

Ahh, you're saying that 'i' is effectively starting *at* PAGE_OFFSET
since it's halfway up the address space already doing PTRS_PER_PGD/2.
Adding PAGE_OFFSET to PAGE_OFFSET is nonsense.

Would it just be simpler to do:

>         for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
>                 unsigned long addr = i * PGDIR_SIZE;

?

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
@ 2017-11-22 22:45       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 22:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
>> +	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
>> +		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;
> This looks wrong. The kernel address space gets incremented by PGDIR_SIZE
> and does not make a jump from PAGE_OFFSET to PAGE_OFFSET + 256 * PGDIR_SIZE
> 
> 	int i, j;
> 
> 	for (i = PTRS_PER_PGD / 2, j = 0; i < PTRS_PER_PGD; i++, j++) {
> 		unsigned long addr = PAGE_OFFSET + j * PGDIR_SIZE;
> 
> Not that is has any effect right now. Neither p4d_alloc_one() nor
> pud_alloc_one() are using the 'addr' argument.

Ahh, you're saying that 'i' is effectively starting *at* PAGE_OFFSET
since it's halfway up the address space already doing PTRS_PER_PGD/2.
Adding PAGE_OFFSET to PAGE_OFFSET is nonsense.

Would it just be simpler to do:

>         for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
>                 unsigned long addr = i * PGDIR_SIZE;

?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-11-20 17:21     ` Thomas Gleixner
@ 2017-11-22 22:50       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 22:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +}
>> +
>>  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
>>  {
>> +#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
>> +	/*
>> +	 * set_pgd() does not get called when we are running
>> +	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
>> +	 * know here that we have a p4d but that it is really at
>> +	 * the top level of the page tables; it is really just a
>> +	 * pgd.
>> +	 */
>> +	/* Do we need to also populate the shadow p4d? */
>> +	if (is_userspace_pgd(p4dp))
>> +		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
> native_get_shadow_p4d() is kinda confusing, as it suggest that we get the
> entry not the pointer to it. native_get_shadow_p4d_ptr() is what it
> actually wants to be, but a setter e.g. native_set_shadow...(), we also
> have set_pgd() would be more obvious I think.

How about "kernel_to_shadow_pgdp()"? ... and friends

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
@ 2017-11-22 22:50       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 22:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +}
>> +
>>  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
>>  {
>> +#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
>> +	/*
>> +	 * set_pgd() does not get called when we are running
>> +	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
>> +	 * know here that we have a p4d but that it is really at
>> +	 * the top level of the page tables; it is really just a
>> +	 * pgd.
>> +	 */
>> +	/* Do we need to also populate the shadow p4d? */
>> +	if (is_userspace_pgd(p4dp))
>> +		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
> native_get_shadow_p4d() is kinda confusing, as it suggest that we get the
> entry not the pointer to it. native_get_shadow_p4d_ptr() is what it
> actually wants to be, but a setter e.g. native_set_shadow...(), we also
> have set_pgd() would be more obvious I think.

How about "kernel_to_shadow_pgdp()"? ... and friends

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-11-20 17:21     ` Thomas Gleixner
@ 2017-11-22 22:54       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 22:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +page tables are switched to the full "kernel" copy.  When the
>> +system switches back to user mode, the user/shadow copy is used.
>> +
>> +The minimalistic kernel portion of the user page tables try to
>> +map only what is needed to enter/exit the kernel such as the
>> +entry/exit functions themselves and the interrupt descriptor
>> +table (IDT).
> s/try to//

Actually, they do _aspire_ "to map only what is needed".  But, there
*is* some non-necessary cruft (like the first C function in an
interrupt).  So, removing this language actually makes the description
less precise.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
@ 2017-11-22 22:54       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 22:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +page tables are switched to the full "kernel" copy.  When the
>> +system switches back to user mode, the user/shadow copy is used.
>> +
>> +The minimalistic kernel portion of the user page tables try to
>> +map only what is needed to enter/exit the kernel such as the
>> +entry/exit functions themselves and the interrupt descriptor
>> +table (IDT).
> s/try to//

Actually, they do _aspire_ "to map only what is needed".  But, there
*is* some non-necessary cruft (like the first C function in an
interrupt).  So, removing this language actually makes the description
less precise.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-11-20 17:21     ` Thomas Gleixner
@ 2017-11-22 23:11       ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 23:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +KAISER logically keeps a "copy" of the page tables which unmap
>> +the kernel while in userspace.  The kernel manages the page
>> +tables as normal, but the "copying" is done with a few tricks
>> +that mean that we do not have to manage two full copies.
>> +The first trick is that for any any new kernel mapping, we
>> +presume that we do not want it mapped to userspace.  That means
>> +we normally have no copying to do.  We only copy the kernel
>> +entries over to the shadow in response to a kaiser_add_*()
>> +call which is rare.
>  When KAISER is enabled the kernel manages two page tables for the kernel
>  mappings. The regular page table which is used while executing in kernel
>  space and a shadow copy which only contains the mapping entries which are
>  required for the kernel-userspace transition. These mappings have to be
>  copied into the shadow page tables explicitely with the kaiser_add_*()
>  functions.

This misses a few important points that I think the original text
touches on.  I gave it another go:

> Page Table Management
> =====================
> 
> When KAISER is enabled, the kernel manages two sets of page
> tables.  The first copy is very similar to what would be present
> for a kernel without KAISER.  This includes a complete mapping of
> userspace that the kernel can use for things like copy_to_user().
> 
> The second (shadow) is used when running userspace and mirrors the
> mapping of userspace present in the kernel copy.  It maps a only
> the kernel data needed to enter and exit the kernel.
> 
> The shadow is populated by the kaiser_add_*() functions.  Only
> kernel data which has been explicity mapped will appear in the
> shadow copy.  These calls are rare at runtime.
> 
> For a new userspace mapping, the kernel makes the entries in its
> page tables like normal.  The only difference is when the kernel
> makes entries in the top (PGD) level.  In addition to setting the
> entry in the main kernel PGD, a copy if the entry is made in the
> shadow PGD.
> 
> For user space mappings the kernel creates an entry in the kernel
> PGD and the same entry in the shadow PGD, so the underlying page
> table to which the PGD entry points is shared down to the PTE
> level.  This leaves a single, shared set of userspace page tables
> to manage.  One PTE to lock, one set set of accessed bits, dirty
> bits, etc...

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch)
@ 2017-11-22 23:11       ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 23:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, richard.fellner, moritz.lipp,
	daniel.gruss, michael.schwarz, luto, torvalds, keescook, hughd,
	x86

On 11/20/2017 09:21 AM, Thomas Gleixner wrote:
>> +KAISER logically keeps a "copy" of the page tables which unmap
>> +the kernel while in userspace.  The kernel manages the page
>> +tables as normal, but the "copying" is done with a few tricks
>> +that mean that we do not have to manage two full copies.
>> +The first trick is that for any any new kernel mapping, we
>> +presume that we do not want it mapped to userspace.  That means
>> +we normally have no copying to do.  We only copy the kernel
>> +entries over to the shadow in response to a kaiser_add_*()
>> +call which is rare.
>  When KAISER is enabled the kernel manages two page tables for the kernel
>  mappings. The regular page table which is used while executing in kernel
>  space and a shadow copy which only contains the mapping entries which are
>  required for the kernel-userspace transition. These mappings have to be
>  copied into the shadow page tables explicitely with the kaiser_add_*()
>  functions.

This misses a few important points that I think the original text
touches on.  I gave it another go:

> Page Table Management
> =====================
> 
> When KAISER is enabled, the kernel manages two sets of page
> tables.  The first copy is very similar to what would be present
> for a kernel without KAISER.  This includes a complete mapping of
> userspace that the kernel can use for things like copy_to_user().
> 
> The second (shadow) is used when running userspace and mirrors the
> mapping of userspace present in the kernel copy.  It maps a only
> the kernel data needed to enter and exit the kernel.
> 
> The shadow is populated by the kaiser_add_*() functions.  Only
> kernel data which has been explicity mapped will appear in the
> shadow copy.  These calls are rare at runtime.
> 
> For a new userspace mapping, the kernel makes the entries in its
> page tables like normal.  The only difference is when the kernel
> makes entries in the top (PGD) level.  In addition to setting the
> entry in the main kernel PGD, a copy if the entry is made in the
> shadow PGD.
> 
> For user space mappings the kernel creates an entry in the kernel
> PGD and the same entry in the shadow PGD, so the underlying page
> table to which the PGD entry points is shared down to the PTE
> level.  This leaves a single, shared set of userspace page tables
> to manage.  One PTE to lock, one set set of accessed bits, dirty
> bits, etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
  2017-11-22  3:44         ` Andy Lutomirski
@ 2017-11-22 23:30           ` Dave Hansen
  -1 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 23:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 07:44 PM, Andy Lutomirski wrote:
>> So, I guess we could enforce that only PGDs with _PAGE_USER set can ever
>> be cleared.  That has a nice symmetry to it because we set the shadow
>> when we see _PAGE_USER and we would then clear the shadow when we see
>> _PAGE_USER.
> Is this code path ever hit in any case other than tearing down an LDT?

Do you mean the PGD clearing?  We use it for tearing down userspace
PGDs, but that's it.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace
@ 2017-11-22 23:30           ` Dave Hansen
  0 siblings, 0 replies; 149+ messages in thread
From: Dave Hansen @ 2017-11-22 23:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, richard.fellner, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On 11/21/2017 07:44 PM, Andy Lutomirski wrote:
>> So, I guess we could enforce that only PGDs with _PAGE_USER set can ever
>> be cleared.  That has a nice symmetry to it because we set the shadow
>> when we see _PAGE_USER and we would then clear the shadow when we see
>> _PAGE_USER.
> Is this code path ever hit in any case other than tearing down an LDT?

Do you mean the PGD clearing?  We use it for tearing down userspace
PGDs, but that's it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
  2017-11-14 19:38   ` Rik van Riel
@ 2017-11-26 14:48       ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Hansen, linux-kernel, linux-mm, bp, tglx, moritz.lipp,
	daniel.gruss, michael.schwarz, richard.fellner, luto, torvalds,
	keescook, hughd, x86


* Rik van Riel <riel@redhat.com> wrote:

> On Fri, 2017-11-10 at 11:31 -0800, Dave Hansen wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > Global pages stay in the TLB across context switches.  Since all
> > contexts
> > share the same kernel mapping, these mappings are marked as global
> > pages
> > so kernel entries in the TLB are not flushed out on a context switch.
> > 
> > But, even having these entries in the TLB opens up something that an
> > attacker can use [1].
> > 
> > That means that even when KAISER switches page tables on return to
> > user
> > space the global pages would stay in the TLB cache.
> > 
> > Disable global pages so that kernel TLB entries can be flushed before
> > returning to user space. This way, all accesses to kernel addresses
> > from
> > userspace result in a TLB miss independent of the existence of a
> > kernel
> > mapping.
> > 
> > Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
> > available so that it can still be used for a few selected kernel
> > mappings
> > which must be visible to userspace, when KAISER is enabled, like the
> > entry/exit code and data.
> 
> Nice changelog.
> 
> Why am I pointing this out?
> 
> > +++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10
> > 11:22:06.626244956 -0800
> > @@ -179,8 +179,20 @@ enum page_cache_mode {
> >  #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT |
> > _PAGE_USER |	\
> >  					 _PAGE_ACCESSED)
> >  
> > +/*
> > + * Disable global pages for anything using the default
> > + * __PAGE_KERNEL* macros.  PGE will still be enabled
> > + * and _PAGE_GLOBAL may still be used carefully.
> > + */
> > +#ifdef CONFIG_KAISER
> > +#define __PAGE_KERNEL_GLOBAL	0
> > +#else
> > +#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > +#endif
> > +					
> 
> The comment above could use a little more info
> on why things are done that way, though :)

Good point - I've updated these comments to say:

/*
 * Disable global pages for anything using the default
 * __PAGE_KERNEL* macros.
 *
 * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
 * for a few selected kernel mappings which must be visible to userspace,
 * when KAISER is enabled, like the entry/exit code and data.
 */
#ifdef CONFIG_KAISER
#define __PAGE_KERNEL_GLOBAL	0
#else
#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
#endif

... and I've added your Reviewed-by tag which I assume now applies?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
@ 2017-11-26 14:48       ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Hansen, linux-kernel, linux-mm, bp, tglx, moritz.lipp,
	daniel.gruss, michael.schwarz, richard.fellner, luto, torvalds,
	keescook, hughd, x86


* Rik van Riel <riel@redhat.com> wrote:

> On Fri, 2017-11-10 at 11:31 -0800, Dave Hansen wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > Global pages stay in the TLB across context switches.  Since all
> > contexts
> > share the same kernel mapping, these mappings are marked as global
> > pages
> > so kernel entries in the TLB are not flushed out on a context switch.
> > 
> > But, even having these entries in the TLB opens up something that an
> > attacker can use [1].
> > 
> > That means that even when KAISER switches page tables on return to
> > user
> > space the global pages would stay in the TLB cache.
> > 
> > Disable global pages so that kernel TLB entries can be flushed before
> > returning to user space. This way, all accesses to kernel addresses
> > from
> > userspace result in a TLB miss independent of the existence of a
> > kernel
> > mapping.
> > 
> > Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
> > available so that it can still be used for a few selected kernel
> > mappings
> > which must be visible to userspace, when KAISER is enabled, like the
> > entry/exit code and data.
> 
> Nice changelog.
> 
> Why am I pointing this out?
> 
> > +++ b/arch/x86/include/asm/pgtable_types.h	2017-11-10
> > 11:22:06.626244956 -0800
> > @@ -179,8 +179,20 @@ enum page_cache_mode {
> >  #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT |
> > _PAGE_USER |	\
> >  					 _PAGE_ACCESSED)
> >  
> > +/*
> > + * Disable global pages for anything using the default
> > + * __PAGE_KERNEL* macros.  PGE will still be enabled
> > + * and _PAGE_GLOBAL may still be used carefully.
> > + */
> > +#ifdef CONFIG_KAISER
> > +#define __PAGE_KERNEL_GLOBAL	0
> > +#else
> > +#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > +#endif
> > +					
> 
> The comment above could use a little more info
> on why things are done that way, though :)

Good point - I've updated these comments to say:

/*
 * Disable global pages for anything using the default
 * __PAGE_KERNEL* macros.
 *
 * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
 * for a few selected kernel mappings which must be visible to userspace,
 * when KAISER is enabled, like the entry/exit code and data.
 */
#ifdef CONFIG_KAISER
#define __PAGE_KERNEL_GLOBAL	0
#else
#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
#endif

... and I've added your Reviewed-by tag which I assume now applies?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
  2017-11-26 14:48       ` Ingo Molnar
@ 2017-11-27 11:37         ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-27 11:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86

On Sun, 26 Nov 2017, Ingo Molnar wrote:
>  * Disable global pages for anything using the default
>  * __PAGE_KERNEL* macros.
>  *
>  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
>  * for a few selected kernel mappings which must be visible to userspace,
>  * when KAISER is enabled, like the entry/exit code and data.
>  */
> #ifdef CONFIG_KAISER
> #define __PAGE_KERNEL_GLOBAL	0
> #else
> #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> #endif
> 
> ... and I've added your Reviewed-by tag which I assume now applies?

Ideally we replace the whole patch with the __supported_pte_mask one which
I posted as a delta patch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER
@ 2017-11-27 11:37         ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-27 11:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86

On Sun, 26 Nov 2017, Ingo Molnar wrote:
>  * Disable global pages for anything using the default
>  * __PAGE_KERNEL* macros.
>  *
>  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
>  * for a few selected kernel mappings which must be visible to userspace,
>  * when KAISER is enabled, like the entry/exit code and data.
>  */
> #ifdef CONFIG_KAISER
> #define __PAGE_KERNEL_GLOBAL	0
> #else
> #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> #endif
> 
> ... and I've added your Reviewed-by tag which I assume now applies?

Ideally we replace the whole patch with the __supported_pte_mask one which
I posted as a delta patch.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2] x86/mm/kaiser: Disable global pages by default with KAISER
  2017-11-27 11:37         ` Thomas Gleixner
@ 2017-11-27 13:20           ` Ingo Molnar
  -1 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-27 13:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sun, 26 Nov 2017, Ingo Molnar wrote:
> >  * Disable global pages for anything using the default
> >  * __PAGE_KERNEL* macros.
> >  *
> >  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
> >  * for a few selected kernel mappings which must be visible to userspace,
> >  * when KAISER is enabled, like the entry/exit code and data.
> >  */
> > #ifdef CONFIG_KAISER
> > #define __PAGE_KERNEL_GLOBAL	0
> > #else
> > #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > #endif
> > 
> > ... and I've added your Reviewed-by tag which I assume now applies?
> 
> Ideally we replace the whole patch with the __supported_pte_mask one which
> I posted as a delta patch.

Yeah, so I squashed these two patches:

  09d76fc407e0: x86/mm/kaiser: Disable global pages by default with KAISER
  bac79112ee4a: x86/mm/kaiser: Simplify disabling of global pages

into a single patch, which results in the single patch below, with an updated 
changelog that reflects the cleanups. I kept Dave's authorship and credited you 
for the simplification.

Note that the squashed commit had some whitespace noise which I skipped, further 
simplifying the patch.

Is it OK this way? If yes then I'll reshuffle the tree with this variant.

Thanks,

	Ingo

====================>
>From 12cffe1598c3ebdad76453c72acb8c606f22a747 Mon Sep 17 00:00:00 2001
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Wed, 22 Nov 2017 16:34:41 -0800
Subject: [PATCH] x86/mm/kaiser: Disable global pages by default with KAISER

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use, such as the double-page-fault attack:

   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

That means that even when KAISER switches page tables on return to user
space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Supress global pages via the __supported_pte_mask. The shadow mappings
set PAGE_GLOBAL for the minimal kernel mappings which are required
for entry/exit. These mappings are set up manually so the filtering does not
take place.

[ The __supported_pte_mask simplification was written by Thomas Gleixner. ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003441.63DDFC6F@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a22c2b95e513..4a2df8babd29 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,6 +161,13 @@ struct map_range {
 
 static int page_size_mask;
 
+static void enable_global_pages(void)
+{
+#ifndef CONFIG_KAISER
+	__supported_pte_mask |= _PAGE_GLOBAL;
+#endif
+}
+
 static void __init probe_page_size_mask(void)
 {
 	/*
@@ -179,11 +186,11 @@ static void __init probe_page_size_mask(void)
 		cr4_set_bits_and_update_boot(X86_CR4_PSE);
 
 	/* Enable PGE if available */
+	__supported_pte_mask &= ~_PAGE_GLOBAL;
 	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		cr4_set_bits_and_update_boot(X86_CR4_PGE);
-		__supported_pte_mask |= _PAGE_GLOBAL;
-	} else
-		__supported_pte_mask &= ~_PAGE_GLOBAL;
+		enable_global_pages();
+	}
 
 	/* Enable 1 GB linear kernel mappings if available: */
 	if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2] x86/mm/kaiser: Disable global pages by default with KAISER
@ 2017-11-27 13:20           ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-27 13:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sun, 26 Nov 2017, Ingo Molnar wrote:
> >  * Disable global pages for anything using the default
> >  * __PAGE_KERNEL* macros.
> >  *
> >  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
> >  * for a few selected kernel mappings which must be visible to userspace,
> >  * when KAISER is enabled, like the entry/exit code and data.
> >  */
> > #ifdef CONFIG_KAISER
> > #define __PAGE_KERNEL_GLOBAL	0
> > #else
> > #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > #endif
> > 
> > ... and I've added your Reviewed-by tag which I assume now applies?
> 
> Ideally we replace the whole patch with the __supported_pte_mask one which
> I posted as a delta patch.

Yeah, so I squashed these two patches:

  09d76fc407e0: x86/mm/kaiser: Disable global pages by default with KAISER
  bac79112ee4a: x86/mm/kaiser: Simplify disabling of global pages

into a single patch, which results in the single patch below, with an updated 
changelog that reflects the cleanups. I kept Dave's authorship and credited you 
for the simplification.

Note that the squashed commit had some whitespace noise which I skipped, further 
simplifying the patch.

Is it OK this way? If yes then I'll reshuffle the tree with this variant.

Thanks,

	Ingo

====================>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2] x86/mm/kaiser: Disable global pages by default with KAISER
  2017-11-27 13:20           ` Ingo Molnar
@ 2017-11-27 13:23             ` Thomas Gleixner
  -1 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-27 13:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86

On Mon, 27 Nov 2017, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Sun, 26 Nov 2017, Ingo Molnar wrote:
> > >  * Disable global pages for anything using the default
> > >  * __PAGE_KERNEL* macros.
> > >  *
> > >  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
> > >  * for a few selected kernel mappings which must be visible to userspace,
> > >  * when KAISER is enabled, like the entry/exit code and data.
> > >  */
> > > #ifdef CONFIG_KAISER
> > > #define __PAGE_KERNEL_GLOBAL	0
> > > #else
> > > #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > > #endif
> > > 
> > > ... and I've added your Reviewed-by tag which I assume now applies?
> > 
> > Ideally we replace the whole patch with the __supported_pte_mask one which
> > I posted as a delta patch.
> 
> Yeah, so I squashed these two patches:
> 
>   09d76fc407e0: x86/mm/kaiser: Disable global pages by default with KAISER
>   bac79112ee4a: x86/mm/kaiser: Simplify disabling of global pages
> 
> into a single patch, which results in the single patch below, with an updated 
> changelog that reflects the cleanups. I kept Dave's authorship and credited you 
> for the simplification.
> 
> Note that the squashed commit had some whitespace noise which I skipped, further 
> simplifying the patch.
> 
> Is it OK this way? If yes then I'll reshuffle the tree with this variant.

Yes.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2] x86/mm/kaiser: Disable global pages by default with KAISER
@ 2017-11-27 13:23             ` Thomas Gleixner
  0 siblings, 0 replies; 149+ messages in thread
From: Thomas Gleixner @ 2017-11-27 13:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86

On Mon, 27 Nov 2017, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Sun, 26 Nov 2017, Ingo Molnar wrote:
> > >  * Disable global pages for anything using the default
> > >  * __PAGE_KERNEL* macros.
> > >  *
> > >  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
> > >  * for a few selected kernel mappings which must be visible to userspace,
> > >  * when KAISER is enabled, like the entry/exit code and data.
> > >  */
> > > #ifdef CONFIG_KAISER
> > > #define __PAGE_KERNEL_GLOBAL	0
> > > #else
> > > #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > > #endif
> > > 
> > > ... and I've added your Reviewed-by tag which I assume now applies?
> > 
> > Ideally we replace the whole patch with the __supported_pte_mask one which
> > I posted as a delta patch.
> 
> Yeah, so I squashed these two patches:
> 
>   09d76fc407e0: x86/mm/kaiser: Disable global pages by default with KAISER
>   bac79112ee4a: x86/mm/kaiser: Simplify disabling of global pages
> 
> into a single patch, which results in the single patch below, with an updated 
> changelog that reflects the cleanups. I kept Dave's authorship and credited you 
> for the simplification.
> 
> Note that the squashed commit had some whitespace noise which I skipped, further 
> simplifying the patch.
> 
> Is it OK this way? If yes then I'll reshuffle the tree with this variant.

Yes.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2] x86/mm/kaiser: Disable global pages by default with KAISER
  2017-11-27 13:23             ` Thomas Gleixner
@ 2017-11-27 13:27               ` Ingo Molnar
  -1 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-27 13:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 27 Nov 2017, Ingo Molnar wrote:
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Sun, 26 Nov 2017, Ingo Molnar wrote:
> > > >  * Disable global pages for anything using the default
> > > >  * __PAGE_KERNEL* macros.
> > > >  *
> > > >  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
> > > >  * for a few selected kernel mappings which must be visible to userspace,
> > > >  * when KAISER is enabled, like the entry/exit code and data.
> > > >  */
> > > > #ifdef CONFIG_KAISER
> > > > #define __PAGE_KERNEL_GLOBAL	0
> > > > #else
> > > > #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > > > #endif
> > > > 
> > > > ... and I've added your Reviewed-by tag which I assume now applies?
> > > 
> > > Ideally we replace the whole patch with the __supported_pte_mask one which
> > > I posted as a delta patch.
> > 
> > Yeah, so I squashed these two patches:
> > 
> >   09d76fc407e0: x86/mm/kaiser: Disable global pages by default with KAISER
> >   bac79112ee4a: x86/mm/kaiser: Simplify disabling of global pages
> > 
> > into a single patch, which results in the single patch below, with an updated 
> > changelog that reflects the cleanups. I kept Dave's authorship and credited you 
> > for the simplification.
> > 
> > Note that the squashed commit had some whitespace noise which I skipped, further 
> > simplifying the patch.
> > 
> > Is it OK this way? If yes then I'll reshuffle the tree with this variant.
> 
> Yes.

Ok, new version pushed out to -tip:WIP.x86/mm.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2] x86/mm/kaiser: Disable global pages by default with KAISER
@ 2017-11-27 13:27               ` Ingo Molnar
  0 siblings, 0 replies; 149+ messages in thread
From: Ingo Molnar @ 2017-11-27 13:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, bp,
	moritz.lipp, daniel.gruss, michael.schwarz, richard.fellner,
	luto, torvalds, keescook, hughd, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 27 Nov 2017, Ingo Molnar wrote:
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Sun, 26 Nov 2017, Ingo Molnar wrote:
> > > >  * Disable global pages for anything using the default
> > > >  * __PAGE_KERNEL* macros.
> > > >  *
> > > >  * PGE will still be enabled and _PAGE_GLOBAL may still be used carefully
> > > >  * for a few selected kernel mappings which must be visible to userspace,
> > > >  * when KAISER is enabled, like the entry/exit code and data.
> > > >  */
> > > > #ifdef CONFIG_KAISER
> > > > #define __PAGE_KERNEL_GLOBAL	0
> > > > #else
> > > > #define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
> > > > #endif
> > > > 
> > > > ... and I've added your Reviewed-by tag which I assume now applies?
> > > 
> > > Ideally we replace the whole patch with the __supported_pte_mask one which
> > > I posted as a delta patch.
> > 
> > Yeah, so I squashed these two patches:
> > 
> >   09d76fc407e0: x86/mm/kaiser: Disable global pages by default with KAISER
> >   bac79112ee4a: x86/mm/kaiser: Simplify disabling of global pages
> > 
> > into a single patch, which results in the single patch below, with an updated 
> > changelog that reflects the cleanups. I kept Dave's authorship and credited you 
> > for the simplification.
> > 
> > Note that the squashed commit had some whitespace noise which I skipped, further 
> > simplifying the patch.
> > 
> > Is it OK this way? If yes then I'll reshuffle the tree with this variant.
> 
> Yes.

Ok, new version pushed out to -tip:WIP.x86/mm.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 149+ messages in thread

end of thread, other threads:[~2017-11-27 13:27 UTC | newest]

Thread overview: 149+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-10 19:30 [PATCH 00/30] [v3] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
2017-11-10 19:30 ` Dave Hansen
2017-11-10 19:31 ` [PATCH 01/30] x86, mm: do not set _PAGE_USER for init_mm " Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 02/30] x86, tlb: Make CR4-based TLB flushes more robust Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 03/30] x86/mm: Document X86_CR4_PGE toggling behavior Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 04/30] x86, kaiser: disable global pages by default with KAISER Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-14 19:38   ` Rik van Riel
2017-11-26 14:48     ` Ingo Molnar
2017-11-26 14:48       ` Ingo Molnar
2017-11-27 11:37       ` Thomas Gleixner
2017-11-27 11:37         ` Thomas Gleixner
2017-11-27 13:20         ` [PATCH v2] x86/mm/kaiser: Disable " Ingo Molnar
2017-11-27 13:20           ` Ingo Molnar
2017-11-27 13:23           ` Thomas Gleixner
2017-11-27 13:23             ` Thomas Gleixner
2017-11-27 13:27             ` Ingo Molnar
2017-11-27 13:27               ` Ingo Molnar
2017-11-10 19:31 ` [PATCH 05/30] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-20 12:17   ` Thomas Gleixner
2017-11-20 12:17     ` Thomas Gleixner
2017-11-10 19:31 ` [PATCH 06/30] x86, kaiser: introduce user-mapped per-cpu areas Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 07/30] x86, kaiser: mark per-cpu data structures required for entry/exit Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 08/30] x86, kaiser: unmap kernel from userspace page tables (core patch) Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-20 17:21   ` Thomas Gleixner
2017-11-20 17:21     ` Thomas Gleixner
2017-11-22 22:45     ` Dave Hansen
2017-11-22 22:45       ` Dave Hansen
2017-11-22 22:50     ` Dave Hansen
2017-11-22 22:50       ` Dave Hansen
2017-11-22 22:54     ` Dave Hansen
2017-11-22 22:54       ` Dave Hansen
2017-11-22 23:11     ` Dave Hansen
2017-11-22 23:11       ` Dave Hansen
2017-11-10 19:31 ` [PATCH 09/30] x86, kaiser: only populate shadow page tables for userspace Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-20 20:12   ` Thomas Gleixner
2017-11-20 20:12     ` Thomas Gleixner
2017-11-21  7:05     ` Ingo Molnar
2017-11-21  7:05       ` Ingo Molnar
2017-11-21 22:09     ` Dave Hansen
2017-11-21 22:09       ` Dave Hansen
2017-11-22  3:44       ` Andy Lutomirski
2017-11-22  3:44         ` Andy Lutomirski
2017-11-22 23:30         ` Dave Hansen
2017-11-22 23:30           ` Dave Hansen
2017-11-10 19:31 ` [PATCH 10/30] x86, kaiser: allow NX poison to be set in p4d/pgd Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 11/30] x86, kaiser: make sure static PGDs are 8k in size Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 12/30] x86, kaiser: map GDT into user page tables Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-20 20:22   ` Thomas Gleixner
2017-11-20 20:22     ` Thomas Gleixner
2017-11-20 20:46     ` Andy Lutomirski
2017-11-20 20:46       ` Andy Lutomirski
2017-11-20 20:55       ` Thomas Gleixner
2017-11-20 20:55         ` Thomas Gleixner
2017-11-21 21:19       ` Dave Hansen
2017-11-21 21:19         ` Dave Hansen
2017-11-21 22:46         ` Andy Lutomirski
2017-11-21 22:46           ` Andy Lutomirski
2017-11-21 23:17           ` Dave Hansen
2017-11-21 23:17             ` Dave Hansen
2017-11-21 23:32             ` Andy Lutomirski
2017-11-21 23:32               ` Andy Lutomirski
2017-11-21 23:42               ` Dave Hansen
2017-11-21 23:42                 ` Dave Hansen
2017-11-22  0:17                 ` Andy Lutomirski
2017-11-22  0:17                   ` Andy Lutomirski
2017-11-22  0:37                   ` Dave Hansen
2017-11-22  0:37                     ` Dave Hansen
2017-11-21 22:12     ` Dave Hansen
2017-11-21 22:12       ` Dave Hansen
2017-11-10 19:31 ` [PATCH 13/30] x86, kaiser: map dynamically-allocated LDTs Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 14/30] x86, kaiser: map espfix structures Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 15/30] x86, kaiser: map entry stack variables Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 16/30] x86, kaiser: map trace interrupt entry Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 17/30] x86, kaiser: map debug IDT tables Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-20 20:40   ` Thomas Gleixner
2017-11-20 20:40     ` Thomas Gleixner
2017-11-21 22:16     ` Dave Hansen
2017-11-21 22:16       ` Dave Hansen
2017-11-20 20:44   ` Andy Lutomirski
2017-11-20 20:44     ` Andy Lutomirski
2017-11-20 20:54     ` Thomas Gleixner
2017-11-20 20:54       ` Thomas Gleixner
2017-11-10 19:31 ` [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-14 18:20   ` Peter Zijlstra
2017-11-14 18:20     ` Peter Zijlstra
2017-11-14 18:28     ` Dave Hansen
2017-11-14 18:28       ` Dave Hansen
2017-11-14 19:10       ` Hugh Dickins
2017-11-14 19:10         ` Hugh Dickins
2017-11-14 19:24         ` Andy Lutomirski
2017-11-14 19:24           ` Andy Lutomirski
2017-11-15  9:41         ` Peter Zijlstra
2017-11-15  9:41           ` Peter Zijlstra
2017-11-10 19:31 ` [PATCH 19/30] x86, mm: Move CR3 construction functions Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 20/30] x86, mm: remove hard-coded ASID limit checks Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-20 20:47   ` Thomas Gleixner
2017-11-20 20:47     ` Thomas Gleixner
2017-11-10 19:31 ` [PATCH 21/30] x86, mm: put mmu-to-h/w ASID translation in one place Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 22:03   ` Andy Lutomirski
2017-11-10 22:03     ` Andy Lutomirski
2017-11-10 22:09     ` Dave Hansen
2017-11-10 22:09       ` Dave Hansen
2017-11-10 22:10       ` Andy Lutomirski
2017-11-10 22:10         ` Andy Lutomirski
2017-11-10 19:31 ` [PATCH 22/30] x86, pcid, kaiser: allow flushing for future ASID switches Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-16 19:19   ` Andrea Arcangeli
2017-11-16 19:19     ` Andrea Arcangeli
2017-11-16 19:25     ` Dave Hansen
2017-11-16 19:25       ` Dave Hansen
2017-11-10 19:31 ` [PATCH 24/30] x86, kaiser: disable native VSYSCALL Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 25/30] x86, kaiser: add debugfs file to turn KAISER on/off at runtime Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 26/30] x86, kaiser: add a function to check for KAISER being enabled Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 27/30] x86, kaiser: un-poison PGDs at runtime Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:31 ` [PATCH 28/30] x86, kaiser: allow KAISER to be enabled/disabled " Dave Hansen
2017-11-10 19:31   ` Dave Hansen
2017-11-10 19:32 ` [PATCH 29/30] x86, kaiser: add Kconfig Dave Hansen
2017-11-10 19:32   ` Dave Hansen
2017-11-10 19:32 ` [PATCH 30/30] x86, kaiser, xen: Dynamically disable KAISER when running under Xen PV Dave Hansen
2017-11-10 19:32   ` Dave Hansen
2017-11-20 16:02 ` [PATCH 00/30] [v3] KAISER: unmap most of the kernel from userspace page tables Juerg Haefliger
2017-11-20 16:02   ` Juerg Haefliger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.