linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack
@ 2020-02-05 18:19 Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
                   ` (27 more replies)
  0 siblings, 28 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Control-flow Enforcement (CET) is a new Intel processor feature that blocks
return/jump-oriented programming attacks.  Details can be found in "Intel
64 and IA-32 Architectures Software Developer's Manual" [1].

This series depends on the XSAVES supervisor state series that was split
out and submitted earlier [2].

Changes from v8:

- Simplify signal handling code.
- Add guard pages around a Shadow Stack.
- Replace ELF parser with Dave Martin's patch [3].

The goal of this posting is to seek additional comments.

[1] Intel 64 and IA-32 Architectures Software Developer's Manual:

    https://software.intel.com/en-us/download/intel-64-and-ia-32-
    architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

[2] XSAVES supervisor states patches:
    https://lkml.kernel.org/r/20200121201843.12047-1-yu-cheng.yu@intel.com/

[3] Dave Martin's ELF program property parsing patch:
    https://lkml.kernel.org/r/20200122212144.6409-3-broonie@kernel.org/

[4] CET patches v8:

    https://lkml.kernel.org/r/20190813205225.12032-1-yu-cheng.yu@intel.com/
    https://lkml.kernel.org/r/20190813205359.12196-1-yu-cheng.yu@intel.com/

Dave Martin (1):
  ELF: Add ELF program property parsing support

Yu-cheng Yu (26):
  Documentation/x86: Add CET description
  x86/cpufeatures: Add CET CPU feature flags for Control-flow
    Enforcement Technology (CET)
  x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
  x86/cet: Add control-protection fault handler
  x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack
    protection
  mm: Introduce VM_SHSTK for Shadow Stack memory
  Add guard pages around a Shadow Stack.
  x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
  x86/mm: Introduce _PAGE_DIRTY_SW
  x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for
    _PAGE_DIRTY_SW
  drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for
    _PAGE_DIRTY_SW
  x86/mm: Shadow Stack page fault error checking
  mm: Handle Shadow Stack page fault
  mm: Handle THP/HugeTLB Shadow Stack page fault
  mm: Update can_follow_write_pte() for Shadow Stack
  x86/cet/shstk: User-mode Shadow Stack support
  x86/cet/shstk: Introduce WRUSS instruction
  x86/cet/shstk: Handle signals for Shadow Stack
  ELF: UAPI and Kconfig additions for ELF program properties
  binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND
  ELF: Introduce arch_setup_elf_property()
  x86/cet/shstk: ELF header parsing for Shadow Stack
  x86/cet/shstk: Handle thread Shadow Stack
  mm/mmap: Add Shadow Stack pages to memory accounting
  x86/cet/shstk: Add arch_prctl functions for Shadow Stack

 .../admin-guide/kernel-parameters.txt         |   6 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 294 +++++++++++++++
 arch/x86/Kconfig                              |  24 ++
 arch/x86/Makefile                             |   7 +
 arch/x86/entry/entry_64.S                     |   2 +-
 arch/x86/ia32/ia32_signal.c                   |  17 +
 arch/x86/include/asm/cet.h                    |  44 +++
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/elf.h                    |  13 +
 arch/x86/include/asm/fpu/internal.h           |   2 +
 arch/x86/include/asm/fpu/types.h              |  22 ++
 arch/x86/include/asm/fpu/xstate.h             |   5 +-
 arch/x86/include/asm/mmu_context.h            |   3 +
 arch/x86/include/asm/msr-index.h              |  18 +
 arch/x86/include/asm/pgtable.h                | 197 +++++++++-
 arch/x86/include/asm/pgtable_types.h          |  50 ++-
 arch/x86/include/asm/processor.h              |   5 +
 arch/x86/include/asm/special_insns.h          |  32 ++
 arch/x86/include/asm/traps.h                  |   5 +
 arch/x86/include/uapi/asm/prctl.h             |   5 +
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/include/uapi/asm/sigcontext.h        |   9 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/cet.c                         | 344 ++++++++++++++++++
 arch/x86/kernel/cet_prctl.c                   |  84 +++++
 arch/x86/kernel/cpu/common.c                  |  25 ++
 arch/x86/kernel/cpu/cpuid-deps.c              |   2 +
 arch/x86/kernel/fpu/signal.c                  |  89 +++++
 arch/x86/kernel/fpu/xstate.c                  |  25 +-
 arch/x86/kernel/idt.c                         |   4 +
 arch/x86/kernel/process.c                     |  12 +-
 arch/x86/kernel/process_64.c                  |  31 ++
 arch/x86/kernel/relocate_kernel_64.S          |   2 +-
 arch/x86/kernel/signal.c                      |  10 +
 arch/x86/kernel/signal_compat.c               |   2 +-
 arch/x86/kernel/traps.c                       |  59 +++
 arch/x86/kvm/vmx/vmx.c                        |   2 +-
 arch/x86/mm/fault.c                           |  18 +
 arch/x86/mm/mmap.c                            |   2 +
 arch/x86/mm/pgtable.c                         |  41 +++
 drivers/gpu/drm/i915/gvt/gtt.c                |   2 +-
 fs/Kconfig.binfmt                             |   3 +
 fs/binfmt_elf.c                               | 131 +++++++
 fs/compat_binfmt_elf.c                        |   4 +
 fs/proc/task_mmu.c                            |   3 +
 include/asm-generic/pgtable.h                 |  40 ++
 include/linux/elf.h                           |  33 ++
 include/linux/mm.h                            |  28 +-
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/linux/elf.h                      |  12 +
 mm/gup.c                                      |   8 +-
 mm/huge_memory.c                              |  12 +-
 mm/memory.c                                   |   7 +-
 mm/mmap.c                                     |   5 +
 .../arch/x86/include/asm/disabled-features.h  |   8 +-
 57 files changed, 1779 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/x86/intel_cet.rst
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/cet_prctl.c

-- 
2.21.0



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-06  0:16   ` Randy Dunlap
                     ` (2 more replies)
  2020-02-05 18:19 ` [RFC PATCH v9 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
                   ` (26 subsequent siblings)
  27 siblings, 3 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
document on Control-flow Enforcement Technology (CET).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 294 ++++++++++++++++++
 3 files changed, 301 insertions(+)
 create mode 100644 Documentation/x86/intel_cet.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ade4e6ec23e0..8b69ebf0baed 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3001,6 +3001,12 @@
 			noexec=on: enable non-executable mappings (default)
 			noexec=off: disable non-executable mappings
 
+	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
+			applications
+
+	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
+			applications
+
 	nosmap		[X86,PPC]
 			Disable SMAP (Supervisor Mode Access Prevention)
 			even if it is supported by processor.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index a8de2fbc1caa..81f919801765 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -19,6 +19,7 @@ x86-specific Documentation
    tlb
    mtrr
    pat
+   intel_cet
    intel_mpx
    intel-iommu
    intel_txt
diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
new file mode 100644
index 000000000000..71e2462fea5c
--- /dev/null
+++ b/Documentation/x86/intel_cet.rst
@@ -0,0 +1,294 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+[1] Overview
+============
+
+Control-flow Enforcement Technology (CET) provides protection against
+return/jump-oriented programming (ROP) attacks.  It can be setup to
+protect both applications and the kernel.  In the first phase, only
+user-mode protection is implemented in the 64-bit kernel; 32-bit
+applications are supported in compatibility mode.
+
+CET introduces Shadow Stack (SHSTK) and Indirect Branch Tracking
+(IBT).  SHSTK is a secondary stack allocated from memory and cannot
+be directly modified by applications.  When executing a CALL, the
+processor pushes a copy of the return address to SHSTK.  Upon
+function return, the processor pops the SHSTK copy and compares it
+to the one from the program stack.  If the two copies differ, the
+processor raises a control-protection fault.  IBT verifies indirect
+CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
+opcodes (see CET instructions below).
+
+There are two kernel configuration options:
+
+    X86_INTEL_SHADOW_STACK_USER, and
+    X86_INTEL_BRANCH_TRACKING_USER.
+
+To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
+are required.  To build a CET-enabled application, GLIBC v2.28 or
+later is also required.
+
+There are two command-line options for disabling CET features::
+
+    no_cet_shstk - disables SHSTK, and
+    no_cet_ibt   - disables IBT.
+
+At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.
+
+[2] CET assembly instructions
+=============================
+
+RDSSP %r
+    Read the SHSTK pointer into %r.
+
+INCSSP %r
+    Unwind (increment) the SHSTK pointer (0 ~ 255) steps as indicated
+    in the operand register.  The GLIBC longjmp uses INCSSP to unwind
+    the SHSTK until that matches the program stack.  When it is
+    necessary to unwind beyond 255 steps, longjmp divides and repeats
+    the process.
+
+RSTORSSP (%r)
+    Switch to the SHSTK indicated in the 'restore token' pointed by
+    the operand register and replace the 'restore token' with a new
+    token to be saved (with SAVEPREVSSP) for the outgoing SHSTK.
+
+::
+
+                                Before RSTORSSP
+
+               Incoming SHSTK                 Current/Outgoing SHSTK
+
+          |----------------------|           |----------------------|
+   addr=x |                      |     ssp-> |                      |
+          |----------------------|           |----------------------|
+   (%r)-> | rstor_token=(x|Lg)   |  addr=y-8 |                      |
+          |----------------------|           |----------------------|
+
+                                After RSTORSSP
+
+          |----------------------|           |----------------------|
+   addr=x |                      |           |                      |
+          |----------------------|           |----------------------|
+    ssp-> | rstor_token=(y|Pv|Lg)|  addr=y-8 |                      |
+          |----------------------|           |----------------------|
+
+    note:
+        1. Only valid addresses and restore tokens can be on the
+           user-mode SHSTK.
+        2. A token is always of type u64 and must align to u64.
+        3. The incoming SHSTK pointer in a rstor_token must point to
+           immediately above the token.
+        4. 'Lg' is bit[0] of a rstor_token indicating a 64-bit SHSTK.
+        5. 'Pv' is bit[1] of a rstor_token indicating the token is to
+           be used only for the next SAVEPREVSSP and invalid for
+           RSTORSSP.
+
+SAVEPREVSSP
+    Pop the SHSTK 'restore token' pointed by current SHSTK pointer
+    and store it at (previous SHSTK pointer - 8).
+
+::
+
+                               After SAVEPREVSSP
+
+          |----------------------|           |----------------------|
+    ssp-> |                      |           |                      |
+          |----------------------|           |----------------------|
+ addr=x-8 | rstor_token=(y|Pv|Lg)|  addr=y-8 | rstor_token(y|Lg)    |
+          |----------------------|           |----------------------|
+
+WRUSS %r0, (%r1)
+    Write the value in %r0 to the SHSTK address pointed by (%r1).
+    This is a kernel-mode only instruction.
+
+ENDBR and NOTRACK prefix
+    When IBT is enabled, an indirect CALL/JMP must either::
+
+        have a NOTRACK prefix,
+        reach an ENDBR, or
+        reach an address within a legacy code page;
+
+    or it results in a control-protection fault.
+
+    When the target address is derived from information that cannot
+    be modified, the compiler uses the NOTRACK prefix.  In other
+    cases, the compiler inserts an ENDBR at the target address.
+
+    A legacy code page is designated in the legacy code bitmap, which
+    is explained below in section [8].
+
+[3] Application Enabling
+========================
+
+An application's CET capability is marked in its ELF header and can
+be verified from the following command output, in the
+NT_GNU_PROPERTY_TYPE_0 field:
+
+    readelf -n <application>
+
+If an application supports CET and is statically linked, it will run
+with CET protection.  If the application needs any shared libraries,
+the loader checks all dependencies and enables CET only when all
+requirements are met.
+
+[4] Legacy Libraries
+====================
+
+GLIBC provides a few tunables for backward compatibility.
+
+GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
+    Turn off SHSTK/IBT for the current shell.
+
+GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
+    This controls how dlopen() handles SHSTK legacy libraries::
+
+        on         - continue with SHSTK enabled;
+        permissive - continue with SHSTK off.
+
+[5] CET system calls
+====================
+
+The following arch_prctl() system calls are added for CET:
+
+arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
+    Return CET feature status.
+
+    The parameter 'addr' is a pointer to a user buffer.
+    On returning to the caller, the kernel fills the following
+    information::
+
+        *addr       = SHSTK/IBT status
+        *(addr + 1) = SHSTK base address
+        *(addr + 2) = SHSTK size
+
+arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
+    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
+    if CET is locked.
+
+arch_prctl(ARCH_X86_CET_LOCK)
+    Lock in CET feature.
+
+arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
+    Allocate a new SHSTK and put a restore token at top.
+
+    The parameter 'addr' is a pointer to a user buffer and indicates
+    the desired SHSTK size to allocate.  On returning to the caller,
+    the kernel fills '*addr' with the base address of the new SHSTK.
+
+arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
+    Mark an address range as IBT legacy code.
+
+    The parameter 'addr' is a pointer to a user buffer that has the
+    following information::
+
+        *addr       = starting linear address of the legacy code
+        *(addr + 1) = size of the legacy code
+        *(addr + 2) = set (1); clear (0)
+
+Note:
+  There is no CET-enabling arch_prctl function.  By design, CET is
+  enabled automatically if the binary and the system can support it.
+
+  The parameters passed are always unsigned 64-bit.  When an IA32
+  application passing pointers, it should only use the lower 32 bits.
+
+[6] The implementation of the SHSTK
+===================================
+
+SHSTK size
+----------
+
+A task's SHSTK is allocated from memory to a fixed size of
+RLIMIT_STACK.  A compat-mode thread's SHSTK size is 1/4 of
+RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
+share a 32-bit address space.
+
+Signal
+------
+
+The main program and its signal handlers use the same SHSTK.  Because
+the SHSTK stores only return addresses, a large SHSTK will cover the
+condition that both the program stack and the sigaltstack run out.
+
+The kernel creates a restore token at the SHSTK restoring address and
+verifies that token when restoring from the signal handler.
+
+IBT for signal delivering and sigreturn is the same as the main
+program's setup; except for WAIT_ENDBR status, which can be read from
+MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
+indirect CALL/JMP and before the next instruction starts.
+
+A task's WAIT_ENDBR is reset for its signal handler, but preserved on
+the task's stack; and then restored from sigreturn.
+
+Fork
+----
+
+The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
+read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
+a SHSTK access triggers a page fault with an additional SHSTK bit set
+in the page fault error code.
+
+When a task forks a child, its SHSTK PTEs are copied and both the
+parent's and the child's SHSTK PTEs are cleared of the dirty bit.
+Upon the next SHSTK access, the resulting SHSTK page fault is handled
+by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new SHSTK for
+the new thread.
+
+Setjmp/Longjmp
+--------------
+
+Longjmp unwinds SHSTK until it matches the program stack.
+
+Ucontext
+--------
+
+In GLIBC, getcontext/setcontext is implemented in similar way as
+setjmp/longjmp.
+
+When makecontext creates a new ucontext, a new SHSTK is allocated for
+that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel
+creates a restore token at the top of the new SHSTK and the user-mode
+code switches to the new SHSTK with the RSTORSSP instruction.
+
+[7] The management of read-only & dirty PTEs for SHSTK
+======================================================
+
+A RO and dirty PTE exists in the following cases:
+
+(a) A page is modified and then shared with a fork()'ed child;
+(b) A R/O page that has been COW'ed;
+(c) A SHSTK page.
+
+The processor only checks the dirty bit for (c).  To prevent the use
+of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
+DIRTY_SW for (a) and (b) above.  This results to the following PTE
+settings::
+
+    Modified PTE:             (R/W + DIRTY_HW)
+    Modified and shared PTE:  (R/O + DIRTY_SW)
+    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
+    SHSTK PTE:                (R/O + DIRTY_HW)
+    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
+    SHSTK PTE, shared:        (R/O + DIRTY_SW)
+
+Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
+
+[8] The implementation of IBT legacy bitmap
+===========================================
+
+When IBT is active, a non-IBT-capable legacy library can be executed
+if its address ranges are specified in the legacy code bitmap.  The
+bitmap covers the whole user-space address, which is TASK_SIZE_MAX
+for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB
+legacy code page.  It is read-only from an application, and setup by
+the kernel as a special mapping when the first time the application
+calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
+manages the bitmap through the arch_prctl.
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:02   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 03/27] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu, Borislav Petkov

Add CPU feature flags for Control-flow Enforcement Technology (CET).

CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/cpufeatures.h | 2 ++
 arch/x86/kernel/cpu/cpuid-deps.c   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index e9b62498fe75..a2c6b1b5c026 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -336,6 +336,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* Shadow Stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
@@ -361,6 +362,7 @@
 #define X86_FEATURE_MD_CLEAR		(18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_TSX_FORCE_ABORT	(18*32+13) /* "" TSX_FORCE_ABORT */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
+#define X86_FEATURE_IBT			(18*32+20) /* Indirect Branch Tracking */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
 #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 3cbe24ca80ab..fec83cc74b9e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -69,6 +69,8 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_CQM_MBM_TOTAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_CQM_MBM_LOCAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_AVX512_BF16,		X86_FEATURE_AVX512VL  },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
+	{ X86_FEATURE_IBT,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 03/27] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:04   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler Yu-cheng Yu
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Control-flow Enforcement Technology (CET) adds five MSRs.  Introduce them
and their XSAVES supervisor states:

    MSR_IA32_U_CET (user-mode CET settings),
    MSR_IA32_PL3_SSP (user-mode Shadow Stack pointer),
    MSR_IA32_PL0_SSP (kernel-mode Shadow Stack pointer),
    MSR_IA32_PL1_SSP (Privilege Level 1 Shadow Stack pointer),
    MSR_IA32_PL2_SSP (Privilege Level 2 Shadow Stack pointer).

v6:
- Remove __packed from struct cet_user_state, struct cet_kernel_state.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/fpu/types.h            | 22 ++++++++++++++++++
 arch/x86/include/asm/fpu/xstate.h           |  5 +++--
 arch/x86/include/asm/msr-index.h            | 18 +++++++++++++++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/fpu/xstate.c                | 25 +++++++++++++++++++--
 5 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f098f6cab94b..d7ef4d9c7ad5 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -114,6 +114,9 @@ enum xfeature {
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
+	XFEATURE_RESERVED,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL,
 
 	XFEATURE_MAX,
 };
@@ -128,6 +131,8 @@ enum xfeature {
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -229,6 +234,23 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	u64 user_cet;			/* user control-flow settings */
+	u64 user_ssp;			/* user shadow stack pointer */
+};
+
+/*
+ * State component 12 is Control-flow Enforcement kernel states
+ */
+struct cet_kernel_state {
+	u64 kernel_ssp;			/* kernel shadow stack */
+	u64 pl1_ssp;			/* privilege level 1 shadow stack */
+	u64 pl2_ssp;			/* privilege level 2 shadow stack */
+};
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9ebfdd543576..952d2515dae4 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -33,13 +33,14 @@
 				       XFEATURE_MASK_BNDCSR)
 
 /* All currently supported supervisor features */
-#define SUPPORTED_XFEATURES_MASK_SUPERVISOR (0)
+#define SUPPORTED_XFEATURES_MASK_SUPERVISOR (XFEATURE_MASK_CET_USER)
 
 /*
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define UNSUPPORTED_XFEATURES_MASK_SUPERVISOR (XFEATURE_MASK_PT)
+#define UNSUPPORTED_XFEATURES_MASK_SUPERVISOR (XFEATURE_MASK_PT | \
+					       XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define ALL_XFEATURES_MASK_SUPERVISOR (SUPPORTED_XFEATURES_MASK_SUPERVISOR | \
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 084e98da04a7..114e77f5bb6b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -872,4 +872,22 @@
 #define MSR_VM_IGNNE                    0xc0010115
 #define MSR_VM_HSAVE_PA                 0xc0010117
 
+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET		0x6a0 /* user mode cet setting */
+#define MSR_IA32_S_CET		0x6a2 /* kernel mode cet setting */
+#define MSR_IA32_PL0_SSP	0x6a4 /* kernel shstk pointer */
+#define MSR_IA32_PL1_SSP	0x6a5 /* ring-1 shstk pointer */
+#define MSR_IA32_PL2_SSP	0x6a6 /* ring-2 shstk pointer */
+#define MSR_IA32_PL3_SSP	0x6a7 /* user shstk pointer */
+#define MSR_IA32_INT_SSP_TAB	0x6a8 /* exception shstk table */
+
+/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
+#define MSR_IA32_CET_SHSTK_EN		0x0000000000000001ULL
+#define MSR_IA32_CET_WRSS_EN		0x0000000000000002ULL
+#define MSR_IA32_CET_ENDBR_EN		0x0000000000000004ULL
+#define MSR_IA32_CET_LEG_IW_EN		0x0000000000000008ULL
+#define MSR_IA32_CET_NO_TRACK_EN	0x0000000000000010ULL
+#define MSR_IA32_CET_WAIT_ENDBR	0x00000000000000800UL
+#define MSR_IA32_CET_BITMAP_MASK	0xfffffffffffff000ULL
+
 #endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..a8df907e8017 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_CET_BIT		23 /* enable Control-flow Enforcement */
+#define X86_CR4_CET		_BITUL(X86_CR4_CET_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 04f7c6b8dbbc..ec08a2b6feca 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -38,6 +38,9 @@ static const char *xfeature_names[] =
 	"Processor Trace (unused)"	,
 	"Protection Keys User registers",
 	"unknown xstate feature"	,
+	"Control-flow User registers"	,
+	"Control-flow Kernel registers"	,
+	"unknown xstate feature"	,
 };
 
 static short xsave_cpuid_features[] __initdata = {
@@ -51,6 +54,9 @@ static short xsave_cpuid_features[] __initdata = {
 	X86_FEATURE_AVX512F,
 	X86_FEATURE_INTEL_PT,
 	X86_FEATURE_PKU,
+	-1,		   /* Unused */
+	X86_FEATURE_SHSTK, /* XFEATURE_CET_USER */
+	X86_FEATURE_SHSTK, /* XFEATURE_CET_KERNEL */
 };
 
 /*
@@ -316,6 +322,8 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
+	print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
 }
 
 /*
@@ -563,6 +571,8 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
+	XCHECK_SZ(sz, nr, XFEATURE_CET_USER,   struct cet_user_state);
+	XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
@@ -770,8 +780,19 @@ void __init fpu__init_system_xstate(void)
 	 * Clear XSAVE features that are disabled in the normal CPUID.
 	 */
 	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
-		if (!boot_cpu_has(xsave_cpuid_features[i]))
-			xfeatures_mask_all &= ~BIT_ULL(i);
+		if (xsave_cpuid_features[i] == X86_FEATURE_SHSTK) {
+			/*
+			 * X86_FEATURE_SHSTK and X86_FEATURE_IBT share
+			 * same states, but can be enabled separately.
+			 */
+			if (!boot_cpu_has(X86_FEATURE_SHSTK) &&
+			    !boot_cpu_has(X86_FEATURE_IBT))
+				xfeatures_mask_all &= ~BIT_ULL(i);
+		} else {
+			if ((xsave_cpuid_features[i] == -1) ||
+			    !boot_cpu_has(xsave_cpuid_features[i]))
+				xfeatures_mask_all &= ~BIT_ULL(i);
+		}
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (2 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 03/27] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:06   ` Kees Cook
  2020-02-26 17:10   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection Yu-cheng Yu
                   ` (23 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the Shadow Stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

v9:
- Add Shadow Stack pointer to the fault printout.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/entry/entry_64.S          |  2 +-
 arch/x86/include/asm/traps.h       |  3 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 59 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 6 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 76942cbd95a1..6ca77312d008 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1034,7 +1034,7 @@ idtentry spurious_interrupt_bug		do_spurious_interrupt_bug	has_error_code=0
 idtentry coprocessor_error		do_coprocessor_error		has_error_code=0
 idtentry alignment_check		do_alignment_check		has_error_code=1
 idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
-
+idtentry control_protection		do_control_protection		has_error_code=1
 
 	/*
 	 * Reload gs selector with exception handling
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index ffa0dc8a535e..7ac26bbd0bef 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -26,6 +26,7 @@ asmlinkage void invalid_TSS(void);
 asmlinkage void segment_not_present(void);
 asmlinkage void stack_segment(void);
 asmlinkage void general_protection(void);
+asmlinkage void control_protection(void);
 asmlinkage void page_fault(void);
 asmlinkage void async_page_fault(void);
 asmlinkage void spurious_interrupt_bug(void);
@@ -84,6 +85,7 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code);
+dotraplinkage void do_control_protection(struct pt_regs *regs, long error_code);
 dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
 dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
 dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code);
@@ -154,6 +156,7 @@ enum {
 	X86_TRAP_AC,		/* 17, Alignment Check */
 	X86_TRAP_MC,		/* 18, Machine Check */
 	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
+	X86_TRAP_CP = 21,	/* 21 Control Protection Fault */
 	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
 };
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 87ef69a72c52..8ed406f469e7 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -102,6 +102,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_64
+	INTG(X86_TRAP_CP,		control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf0576cd0..c572a3de1037 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 7);
+	BUILD_BUG_ON(NSIGSEGV != 8);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 05da6b5b167b..99c83ee522ed 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,6 +570,65 @@ do_general_protection(struct pt_regs *regs, long error_code)
 }
 NOKPROBE_SYMBOL(do_general_protection);
 
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+};
+
+/*
+ * When a control protection exception occurs, send a signal
+ * to the responsible application.  Currently, control
+ * protection is only enabled for the user mode.  This
+ * exception should not come from the kernel mode.
+ */
+dotraplinkage void
+do_control_protection(struct pt_regs *regs, long error_code)
+{
+	struct task_struct *tsk;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+	if (notify_die(DIE_TRAP, "control protection fault", regs,
+		       error_code, X86_TRAP_CP, SIGSEGV) == NOTIFY_STOP)
+		return;
+	cond_local_irq_enable(regs);
+
+	if (!user_mode(regs))
+		die("kernel control protection fault", regs, error_code);
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK) &&
+	    !static_cpu_has(X86_FEATURE_IBT))
+		WARN_ONCE(1, "CET is disabled but got control protection fault\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    printk_ratelimit()) {
+		unsigned int max_err;
+		unsigned long ssp;
+
+		max_err = ARRAY_SIZE(control_protection_err) - 1;
+		if ((error_code < 0) || (error_code > max_err))
+			error_code = 0;
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_info("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			tsk->comm, task_pid_nr(tsk),
+			regs->ip, regs->sp, ssp, error_code,
+			control_protection_err[error_code]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR,
+			(void __user *)uprobe_get_trap_addr(regs));
+}
+NOKPROBE_SYMBOL(do_control_protection);
+
 dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c267181..693071dbe641 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -229,7 +229,8 @@ typedef struct siginfo {
 #define SEGV_ACCADI	5	/* ADI not enabled for mapped object */
 #define SEGV_ADIDERR	6	/* Disrupting MCD error */
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
-#define NSIGSEGV	7
+#define SEGV_CPERR	8
+#define NSIGSEGV	8
 
 /*
  * SIGBUS si_codes
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (3 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:07   ` Kees Cook
                     ` (2 more replies)
  2020-02-05 18:19 ` [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory Yu-cheng Yu
                   ` (22 subsequent siblings)
  27 siblings, 3 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Introduce Kconfig option: X86_INTEL_SHADOW_STACK_USER.

Shadow Stack (SHSTK) provides protection against function return address
corruption.  It is active when the kernel has this feature enabled, and
both the processor and the application support it.  When this feature is
enabled, legacy non-SHSTK applications continue to work, but without SHSTK
protection.

The user-mode SHSTK protection is only implemented for the 64-bit kernel.
IA32 applications are supported under the compatibility mode.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/Kconfig  | 22 ++++++++++++++++++++++
 arch/x86/Makefile |  7 +++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5e8949953660..6c34b701c588 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1974,6 +1974,28 @@ config X86_INTEL_TSX_MODE_AUTO
 	  side channel attacks- equals the tsx=auto command line parameter.
 endchoice
 
+config X86_INTEL_CET
+	def_bool n
+
+config ARCH_HAS_SHSTK
+	def_bool n
+
+config X86_INTEL_SHADOW_STACK_USER
+	prompt "Intel Shadow Stack for user-mode"
+	def_bool n
+	depends on CPU_SUP_INTEL && X86_64
+	select ARCH_USES_HIGH_VMA_FLAGS
+	select X86_INTEL_CET
+	select ARCH_HAS_SHSTK
+	---help---
+	  Shadow Stack (SHSTK) provides protection against program
+	  stack corruption.  It is active when the kernel has this
+	  feature enabled, and the processor and the application
+	  support it.  When this feature is enabled, legacy non-SHSTK
+	  applications continue to work, but without SHSTK protection.
+
+	  If unsure, say y.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 94df0868804b..c34f5befa4c8 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -149,6 +149,13 @@ ifdef CONFIG_X86_X32
 endif
 export CONFIG_X86_X32_ABI
 
+# Check assembler Shadow Stack suppot
+ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+  ifeq ($(call as-instr, saveprevssp, y),)
+      $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
+  endif
+endif
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (4 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:07   ` Kees Cook
  2020-02-26 18:07   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack Yu-cheng Yu
                   ` (21 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

A Shadow Stack (SHSTK) PTE must be read-only and have _PAGE_DIRTY set.
However, read-only and Dirty PTEs also exist for copy-on-write (COW) pages.
These two cases are handled differently for page faults and a new VM flag
is necessary for tracking SHSTK VMAs.

v9:
- Add VM_SHSTK case to arch_vma_name().
- Revise the commit log to explain why a new VM flag is needed.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/mm/mmap.c | 2 ++
 fs/proc/task_mmu.c | 3 +++
 include/linux/mm.h | 8 ++++++++
 3 files changed, 13 insertions(+)

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index aae9a933dfd4..482813b4c659 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ const char *arch_vma_name(struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & VM_MPX)
 		return "[mpx]";
+	else if (vma->vm_flags & VM_SHSTK)
+		return "[shadow stack]";
 	return NULL;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9442631fd4af..590b58ee008a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -687,6 +687,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT4)]	= "",
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+		[ilog2(VM_SHSTK)]	= "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cfaa8feecfe8..b5145fbe102e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_NONE
 #endif
 
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+# define VM_SHSTK	VM_HIGH_ARCH_5
+#else
+# define VM_SHSTK	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack.
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (5 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:11   ` Kees Cook
  2020-02-26 18:17   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
                   ` (20 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

INCSSPD/INCSSPQ instruction is used to unwind a Shadow Stack (SHSTK).  It
performs 'pop and discard' of the first and last element from SHSTK in the
range specified in the operand.  The maximum value of the operand is 255,
and the maximum moving distance of the SHSTK pointer is 255 * 4 for
INCSSPD, 255 * 8 for INCSSPQ.

Since SHSTK has a fixed size, creating a guard page above prevents
INCSSP/RET from moving beyond.  Likewise, creating a guard page below
prevents CALL from underflowing the SHSTK.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 include/linux/mm.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b5145fbe102e..75de07674649 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2464,9 +2464,15 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
 	unsigned long vm_start = vma->vm_start;
+	unsigned long gap = 0;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
+	if (vma->vm_flags & VM_GROWSDOWN)
+		gap = stack_guard_gap;
+	else if (vma->vm_flags & VM_SHSTK)
+		gap = PAGE_SIZE;
+
+	if (gap != 0) {
+		vm_start -= gap;
 		if (vm_start > vma->vm_start)
 			vm_start = 0;
 	}
@@ -2476,9 +2482,15 @@ static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
 {
 	unsigned long vm_end = vma->vm_end;
+	unsigned long gap = 0;
+
+	if (vma->vm_flags & VM_GROWSUP)
+		gap = stack_guard_gap;
+	else if (vma->vm_flags & VM_SHSTK)
+		gap = PAGE_SIZE;
 
-	if (vma->vm_flags & VM_GROWSUP) {
-		vm_end += stack_guard_gap;
+	if (gap != 0) {
+		vm_end += gap;
 		if (vm_end < vma->vm_end)
 			vm_end = -PAGE_SIZE;
 	}
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (6 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:12   ` Kees Cook
  2020-02-26 18:20   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW Yu-cheng Yu
                   ` (19 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Before introducing _PAGE_DIRTY_SW for non-hardware memory management
purposes in the next patch, rename _PAGE_DIRTY to _PAGE_DIRTY_HW and
_PAGE_BIT_DIRTY to _PAGE_BIT_DIRTY_HW to make these PTE dirty bits
more clear.  There are no functional changes from this patch.

v9:
- At some places _PAGE_DIRTY were not changed to _PAGE_DIRTY_HW, because
  they will be changed again in the next patch to _PAGE_DIRTY_BITS.
  However, this causes compile issues if the next patch is not yet applied.
  Fix it by changing all _PAGE_DIRTY to _PAGE_DRITY_HW.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/pgtable.h       | 18 +++++++++---------
 arch/x86/include/asm/pgtable_types.h | 17 +++++++++--------
 arch/x86/kernel/relocate_kernel_64.S |  2 +-
 arch/x86/kvm/vmx/vmx.c               |  2 +-
 4 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ad97dc155195..ab50d25f9afc 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -122,7 +122,7 @@ extern pmdval_t early_pmd_flags;
  */
 static inline int pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_HW;
 }
 
 
@@ -161,7 +161,7 @@ static inline int pte_young(pte_t pte)
 
 static inline int pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_HW;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -171,7 +171,7 @@ static inline int pmd_young(pmd_t pmd)
 
 static inline int pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_HW;
 }
 
 static inline int pud_young(pud_t pud)
@@ -312,7 +312,7 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_HW);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -332,7 +332,7 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -396,7 +396,7 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
@@ -406,7 +406,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -450,7 +450,7 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_HW);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
@@ -460,7 +460,7 @@ static inline pud_t pud_wrprotect(pud_t pud)
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b5e49e6bac63..e647e3c75578 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -15,7 +15,7 @@
 #define _PAGE_BIT_PWT		3	/* page write through */
 #define _PAGE_BIT_PCD		4	/* page cache disabled */
 #define _PAGE_BIT_ACCESSED	5	/* was accessed (raised by CPU) */
-#define _PAGE_BIT_DIRTY		6	/* was written to (raised by CPU) */
+#define _PAGE_BIT_DIRTY_HW	6	/* was written to (raised by CPU) */
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
@@ -45,7 +45,7 @@
 #define _PAGE_PWT	(_AT(pteval_t, 1) << _PAGE_BIT_PWT)
 #define _PAGE_PCD	(_AT(pteval_t, 1) << _PAGE_BIT_PCD)
 #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
-#define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
+#define _PAGE_DIRTY_HW	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_HW)
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_SOFTW1	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
@@ -73,7 +73,7 @@
 			 _PAGE_PKEY_BIT3)
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
-#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY_HW | _PAGE_ACCESSED)
 #else
 #define _PAGE_KNL_ERRATUM_MASK 0
 #endif
@@ -111,9 +111,9 @@
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 #define _PAGE_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
-				 _PAGE_ACCESSED | _PAGE_DIRTY)
+				 _PAGE_ACCESSED | _PAGE_DIRTY_HW)
 #define _KERNPG_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW |		\
-				 _PAGE_ACCESSED | _PAGE_DIRTY)
+				 _PAGE_ACCESSED | _PAGE_DIRTY_HW)
 
 /*
  * Set of bits not changed in pte_modify.  The pte's
@@ -122,7 +122,7 @@
  * pte_modify() does modify it.
  */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
-			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
+			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW |	\
 			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
@@ -167,7 +167,8 @@ enum page_cache_mode {
 					 _PAGE_ACCESSED)
 
 #define __PAGE_KERNEL_EXEC						\
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY_HW | _PAGE_ACCESSED | \
+	 _PAGE_GLOBAL)
 #define __PAGE_KERNEL		(__PAGE_KERNEL_EXEC | _PAGE_NX)
 
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
@@ -186,7 +187,7 @@ enum page_cache_mode {
 #define _PAGE_ENC	(_AT(pteval_t, sme_me_mask))
 
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
-			 _PAGE_DIRTY | _PAGE_ENC)
+			 _PAGE_DIRTY_HW | _PAGE_ENC)
 #define _PAGE_TABLE	(_KERNPG_TABLE | _PAGE_USER)
 
 #define __PAGE_KERNEL_ENC	(__PAGE_KERNEL | _PAGE_ENC)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ef3ba99068d3..3acd75f97b61 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -15,7 +15,7 @@
  */
 
 #define PTR(x) (x << 3)
-#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY_HW)
 
 /*
  * control_page + KEXEC_CONTROL_CODE_MAX_SIZE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e3394c839dea..fbbbf621b0d9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3503,7 +3503,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	/* Set up identity-mapping pagetable for EPT in real mode */
 	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
 		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
-			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+			_PAGE_ACCESSED | _PAGE_DIRTY_HW | _PAGE_PSE);
 		r = kvm_write_guest_page(kvm, identity_map_pfn,
 				&tmp, i * sizeof(tmp), sizeof(tmp));
 		if (r < 0)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (7 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:12   ` Kees Cook
  2020-02-26 21:35   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 10/27] x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for _PAGE_DIRTY_SW Yu-cheng Yu
                   ` (18 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

When Shadow Stack (SHSTK) is introduced, a R/O and Dirty PTE exists in the
following cases:

(a) A modified, copy-on-write (COW) page;
(b) A R/O page that has been COW'ed;
(c) A SHSTK page.

To separate non-SHSTK memory from SHSTK, introduce a spare bit of the
64-bit PTE as _PAGE_BIT_DIRTY_SW and use that for case (a) and (b).
This results in the following possible settings:

Modified PTE:         (R/W + DIRTY_HW)
Modified and COW PTE: (R/O + DIRTY_SW)
R/O PTE COW'ed:       (R/O + DIRTY_SW)
SHSTK PTE:            (R/O + DIRTY_HW)
SHSTK shared PTE[1]:  (R/O + DIRTY_SW)
SHSTK PTE COW'ed:     (R/O + DIRTY_HW)

[1] When a SHSTK page is being shared among threads, its PTE is cleared of
    _PAGE_DIRTY_HW, so the next SHSTK access causes a fault, and the page
    is duplicated and _PAGE_DIRTY_HW is set again.

With this, in pte_wrprotect(), if SHSTK is active, use _PAGE_DIRTY_SW for
the Dirty bit, and in pte_mkwrite() use _PAGE_DIRTY_HW.  The same changes
apply to pmd and pud.

When this patch is applied, there are six free bits left in the 64-bit PTE.
There are no more free bits in the 32-bit PTE (except for PAE) and SHSTK is
not implemented for the 32-bit kernel.

v9:
- Remove pte_move_flags() etc. and put the logic directly in
  pte_wrprotect()/pte_mkwrite() etc.
- Change compile-time conditionals to run-time checks.
- Split out pte_modify()/pmd_modify() to a new patch.
- Update comments.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/pgtable.h       | 111 ++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h |  31 +++++++-
 2 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ab50d25f9afc..62aeb118bc36 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -120,9 +120,9 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY_HW;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
 }
 
 
@@ -159,9 +159,9 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY_HW;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -169,9 +169,9 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY_HW;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -312,7 +312,7 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY_HW);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -322,6 +322,17 @@ static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+	/*
+	 * Use _PAGE_DIRTY_SW on a R/O PTE to set it apart from
+	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
+	 */
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (pte_flags(pte) & _PAGE_DIRTY_HW) {
+			pte = pte_clear_flags(pte, _PAGE_DIRTY_HW);
+			pte = pte_set_flags(pte, _PAGE_DIRTY_SW);
+		}
+	}
+
 	return pte_clear_flags(pte, _PAGE_RW);
 }
 
@@ -332,9 +343,25 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+	pteval_t dirty = _PAGE_DIRTY_HW;
+
+	if (static_cpu_has(X86_FEATURE_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_DIRTY_SW;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkdirty_shstk(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
 	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
+static inline bool pte_dirty_hw(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_DIRTY_HW;
+}
+
 static inline pte_t pte_mkyoung(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_ACCESSED);
@@ -342,6 +369,13 @@ static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (pte_flags(pte) & _PAGE_DIRTY_SW) {
+			pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
+			pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
+		}
+	}
+
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
@@ -396,19 +430,46 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
+	/*
+	 * Use _PAGE_DIRTY_SW on a R/O PMD to set it apart from
+	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
+	 */
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (pmd_flags(pmd) & _PAGE_DIRTY_HW) {
+			pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
+			pmd = pmd_set_flags(pmd, _PAGE_DIRTY_SW);
+		}
+	}
+
 	return pmd_clear_flags(pmd, _PAGE_RW);
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
+	pmdval_t dirty = _PAGE_DIRTY_HW;
+
+	if (static_cpu_has(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW))
+		dirty = _PAGE_DIRTY_SW;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkdirty_shstk(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_SW);
 	return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
+static inline bool pmd_dirty_hw(pmd_t pmd)
+{
+	return  pmd_flags(pmd) & _PAGE_DIRTY_HW;
+}
+
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_DEVMAP);
@@ -426,6 +487,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (pmd_flags(pmd) & _PAGE_DIRTY_SW) {
+			pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_SW);
+			pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
+		}
+	}
+
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
 
@@ -450,17 +518,33 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY_HW);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
+	/*
+	 * Use _PAGE_DIRTY_SW on a R/O PUD to set it apart from
+	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
+	 */
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (pud_flags(pud) & _PAGE_DIRTY_HW) {
+			pud = pud_clear_flags(pud, _PAGE_DIRTY_HW);
+			pud = pud_set_flags(pud, _PAGE_DIRTY_SW);
+		}
+	}
+
 	return pud_clear_flags(pud, _PAGE_RW);
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY_HW;
+
+	if (static_cpu_has(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW))
+		dirty = _PAGE_DIRTY_SW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -480,6 +564,13 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (pud_flags(pud) & _PAGE_DIRTY_SW) {
+			pud = pud_clear_flags(pud, _PAGE_DIRTY_SW);
+			pud = pud_set_flags(pud, _PAGE_DIRTY_HW);
+		}
+	}
+
 	return pud_set_flags(pud, _PAGE_RW);
 }
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index e647e3c75578..826823df917f 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,7 +23,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -35,6 +36,12 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * This bit indicates a copy-on-write page, and is different from
+ * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
+ */
+#define _PAGE_BIT_DIRTY_SW	_PAGE_BIT_SOFTW5 /* was written to */
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -108,6 +115,28 @@
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #endif
 
+/* A R/O and dirty PTE exists in the following cases:
+ *	(a) A modified, copy-on-write (COW) page;
+ *	(b) A R/O page that has been COW'ed;
+ *	(c) A SHSTK page.
+ * _PAGE_DIRTY_SW is used to separate case (c) from others.
+ * This results in the following settings:
+ *
+ *	Modified PTE:         (R/W + DIRTY_HW)
+ *	Modified and COW PTE: (R/O + DIRTY_SW)
+ *	R/O PTE COW'ed:       (R/O + DIRTY_SW)
+ *	SHSTK PTE:            (R/O + DIRTY_HW)
+ *	SHSTK PTE COW'ed:     (R/O + DIRTY_HW)
+ *	SHSTK PTE being shared among threads: (R/O + DIRTY_SW)
+ */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define _PAGE_DIRTY_SW	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_SW)
+#else
+#define _PAGE_DIRTY_SW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_DIRTY_SW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 #define _PAGE_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 10/27] x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for _PAGE_DIRTY_SW
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (8 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-26 22:02   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

After the introduction of _PAGE_DIRTY_SW, pte_modify and pmd_modify need to
set the Dirty bit accordingly: if Shadow Stack is enabled and _PAGE_RW is
cleared, use _PAGE_DIRTY_SW; otherwise _PAGE_DIRTY_HW.

Since the Dirty bit is modify by pte_modify(), remove _PAGE_DIRTY_HW from
PAGE_CHG_MASK.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/pgtable.h       | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_types.h |  4 ++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62aeb118bc36..2733e7ec16b3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -702,6 +702,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
+
+	if (pte_dirty(pte)) {
+		if (static_cpu_has(X86_FEATURE_SHSTK) && !(val & _PAGE_RW))
+			val |= _PAGE_DIRTY_SW;
+		else
+			val |= _PAGE_DIRTY_HW;
+	}
+
 	return __pte(val);
 }
 
@@ -712,6 +720,14 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	val &= _HPAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
+
+	if (pmd_dirty(pmd)) {
+		if (static_cpu_has(X86_FEATURE_SHSTK) && !(val & _PAGE_RW))
+			val |= _PAGE_DIRTY_SW;
+		else
+			val |= _PAGE_DIRTY_HW;
+	}
+
 	return __pmd(val);
 }
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 826823df917f..e7e28bf7e919 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -150,8 +150,8 @@
  * instance, and is *not* included in this mask since
  * pte_modify() does modify it.
  */
-#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
-			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW |	\
+#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |	\
+			 _PAGE_SPECIAL | _PAGE_ACCESSED |	\
 			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (9 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 10/27] x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for _PAGE_DIRTY_SW Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:13   ` Kees Cook
  2020-02-26 22:04   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW Yu-cheng Yu
                   ` (16 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

After the introduction of _PAGE_DIRTY_SW, a dirty PTE can have either
_PAGE_DIRTY_HW or _PAGE_DIRTY_SW.  Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index 4b04af569c05..e467ca182633 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1201,7 +1201,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
 	}
 
 	/* Clear dirty field. */
-	se->val64 &= ~_PAGE_DIRTY;
+	se->val64 &= ~_PAGE_DIRTY_BITS;
 
 	ops->clear_pse(se);
 	ops->clear_ips(se);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (10 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:14   ` Kees Cook
  2020-02-26 22:20   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
                   ` (15 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

When Shadow Stack (SHSTK) is enabled, the [R/O + PAGE_DIRTY_HW] setting is
reserved only for SHSTK.  Non-Shadow Stack R/O PTEs are
[R/O + PAGE_DIRTY_SW].

When a PTE goes from [R/W + PAGE_DIRTY_HW] to [R/O + PAGE_DIRTY_SW], it
could become a transient SHSTK PTE in two cases.

The first case is that some processors can start a write but end up seeing
a read-only PTE by the time they get to the Dirty bit, creating a transient
SHSTK PTE.  However, this will not occur on processors supporting SHSTK
therefore we don't need a TLB flush here.

The second case is that when the software, without atomic, tests & replaces
PAGE_DIRTY_HW with PAGE_DIRTY_SW, a transient SHSTK PTE can exist.  This is
prevented with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue.  Jann Horn provided the cmpxchg solution.

v9:
- Change compile-time conditionals to runtime checks.
- Fix parameters of try_cmpxchg(): change pte_t/pmd_t to
  pte_t.pte/pmd_t.pmd.

v4:
- Implement try_cmpxchg().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/pgtable.h | 66 ++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2733e7ec16b3..43cb27379208 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1253,6 +1253,39 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+	/*
+	 * Some processors can start a write, but end up seeing a read-only
+	 * PTE by the time they get to the Dirty bit.  In this case, they
+	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
+	 * looks like a Shadow Stack PTE.
+	 *
+	 * However, this behavior has been improved and will not occur on
+	 * processors supporting Shadow Stack.  Without this guarantee, a
+	 * transition to a non-present PTE and flush the TLB would be
+	 * needed.
+	 *
+	 * When changing a writable PTE to read-only and if the PTE has
+	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so that
+	 * the PTE is not a valid Shadow Stack PTE.
+	 */
+#ifdef CONFIG_X86_64
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		pte_t new_pte, pte = READ_ONCE(*ptep);
+
+		do {
+			/*
+			 * This is the same as moving _PAGE_DIRTY_HW
+			 * to _PAGE_DIRTY_SW.
+			 */
+			new_pte = pte_wrprotect(pte);
+			new_pte.pte |= (new_pte.pte & _PAGE_DIRTY_HW) >>
+					_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
+			new_pte.pte &= ~_PAGE_DIRTY_HW;
+		} while (!try_cmpxchg(&ptep->pte, &pte.pte, new_pte.pte));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1303,6 +1336,39 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+	/*
+	 * Some processors can start a write, but end up seeing a read-only
+	 * PMD by the time they get to the Dirty bit.  In this case, they
+	 * will set the Dirty bit, leaving a read-only, Dirty PMD which
+	 * looks like a Shadow Stack PMD.
+	 *
+	 * However, this behavior has been improved and will not occur on
+	 * processors supporting Shadow Stack.  Without this guarantee, a
+	 * transition to a non-present PMD and flush the TLB would be
+	 * needed.
+	 *
+	 * When changing a writable PMD to read-only and if the PMD has
+	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so that
+	 * the PMD is not a valid Shadow Stack PMD.
+	 */
+#ifdef CONFIG_X86_64
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		pmd_t new_pmd, pmd = READ_ONCE(*pmdp);
+
+		do {
+			/*
+			 * This is the same as moving _PAGE_DIRTY_HW
+			 * to _PAGE_DIRTY_SW.
+			 */
+			new_pmd = pmd_wrprotect(pmd);
+			new_pmd.pmd |= (new_pmd.pmd & _PAGE_DIRTY_HW) >>
+					_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
+			new_pmd.pmd &= ~_PAGE_DIRTY_HW;
+		} while (!try_cmpxchg(&pmdp->pmd, &pmd.pmd, new_pmd.pmd));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (11 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:16   ` Kees Cook
  2020-02-26 22:47   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault Yu-cheng Yu
                   ` (14 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

If a page fault is triggered by a Shadow Stack (SHSTK) access
(e.g. CALL/RET) or SHSTK management instructions (e.g. WRUSSQ), then bit[6]
of the page fault error code is set.

In access_error(), verify a SHSTK page fault is within a SHSTK memory area.
It is always an error otherwise.

For a valid SHSTK access, set FAULT_FLAG_WRITE to effect copy-on-write.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/traps.h |  2 ++
 arch/x86/mm/fault.c          | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7ac26bbd0bef..8023d177fcd8 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -169,6 +169,7 @@ enum {
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  */
 enum x86_pf_error_code {
 	X86_PF_PROT	=		1 << 0,
@@ -177,5 +178,6 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 };
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 304d31d8cbbc..9c1243302663 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1187,6 +1187,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Verify X86_PF_SHSTK is within a Shadow Stack VMA.
+	 * It is always an error if there is a Shadow Stack
+	 * fault outside a Shadow Stack VMA.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (!(vma->vm_flags & VM_SHSTK))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1344,6 +1355,13 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * If the fault is caused by a Shadow Stack access,
+	 * i.e. CALL/RET/SAVEPREVSSP/RSTORSSP, then set
+	 * FAULT_FLAG_WRITE to effect copy-on-write.
+	 */
+	if (hw_error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (hw_error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (hw_error_code & X86_PF_INSTR)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (12 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:20   ` Kees Cook
  2020-02-27  0:08   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB " Yu-cheng Yu
                   ` (13 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

When a task does fork(), its Shadow Stack (SHSTK) must be duplicated for
the child.  This patch implements a flow similar to copy-on-write of an
anonymous page, but for SHSTK.

A SHSTK PTE must be RO and Dirty.  This Dirty bit requirement is used to
effect the copying.  In copy_one_pte(), clear the Dirty bit from a SHSTK
PTE to cause a page fault upon the next SHSTK access.  At that time, fix
the PTE and copy/re-use the page.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/mm/pgtable.c         | 15 +++++++++++++++
 include/asm-generic/pgtable.h | 17 +++++++++++++++++
 mm/memory.c                   |  7 ++++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd2c3a52297..2eb33794c08d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -872,3 +872,18 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #endif /* CONFIG_X86_64 */
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+inline bool arch_copy_pte_mapping(vm_flags_t vm_flags)
+{
+	return (vm_flags & VM_SHSTK);
+}
+
+inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHSTK)
+		return pte_mkdirty_shstk(pte);
+	else
+		return pte;
+}
+#endif /* CONFIG_X86_INTEL_SHADOW_STACK_USER */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 798ea36a0549..9cb2f9ba5895 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1190,6 +1190,23 @@ static inline bool arch_has_pfn_modify_check(void)
 }
 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
 
+#ifdef CONFIG_MMU
+#ifndef CONFIG_ARCH_HAS_SHSTK
+static inline bool arch_copy_pte_mapping(vm_flags_t vm_flags)
+{
+	return false;
+}
+
+static inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
+{
+	return pte;
+}
+#else
+bool arch_copy_pte_mapping(vm_flags_t vm_flags);
+pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma);
+#endif
+#endif /* CONFIG_MMU */
+
 /*
  * Architecture PAGE_KERNEL_* fallbacks
  *
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9a4f52..6daa28614327 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -772,7 +772,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
-	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
+	if ((is_cow_mapping(vm_flags) && pte_write(pte)) ||
+	    arch_copy_pte_mapping(vm_flags)) {
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
@@ -2417,6 +2418,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 	entry = pte_mkyoung(vmf->orig_pte);
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	entry = pte_set_vma_features(entry, vma);
 	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2504,6 +2506,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		entry = pte_set_vma_features(entry, vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -3023,6 +3026,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte = mk_pte(page, vma->vm_page_prot);
 	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = pte_set_vma_features(pte, vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = RMAP_EXCLUSIVE;
@@ -3165,6 +3169,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	entry = mk_pte(page, vma->vm_page_prot);
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
+	entry = pte_set_vma_features(entry, vma);
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB Shadow Stack page fault
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (13 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 20:59   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 16/27] mm: Update can_follow_write_pte() for Shadow Stack Yu-cheng Yu
                   ` (12 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

This patch implements THP Shadow Stack (SHSTK) copying in the same way as
in the previous patch for regular PTE.

In copy_huge_pmd(), clear the dirty bit from the PMD to cause a page fault
upon the next SHSTK access to the PMD.  At that time, fix the PMD and
copy/re-use the page.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/mm/pgtable.c         |  8 ++++++++
 include/asm-generic/pgtable.h | 11 +++++++++++
 mm/huge_memory.c              |  4 ++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 2eb33794c08d..3340b1d4e9da 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -886,4 +886,12 @@ inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
 	else
 		return pte;
 }
+
+inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHSTK)
+		return pmd_mkdirty_shstk(pmd);
+	else
+		return pmd;
+}
 #endif /* CONFIG_X86_INTEL_SHADOW_STACK_USER */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 9cb2f9ba5895..a9df093fdf45 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1201,9 +1201,20 @@ static inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
 {
 	return pte;
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
+{
+	return pmd;
+}
+#endif
 #else
 bool arch_copy_pte_mapping(vm_flags_t vm_flags);
 pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma);
+#endif
 #endif
 #endif /* CONFIG_MMU */
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a88093213674..93ef368df2dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -636,6 +636,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_set_vma_features(entry, vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
@@ -1278,6 +1279,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 		pte_t entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		entry = pte_set_vma_features(entry, vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -1360,6 +1362,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_set_vma_features(entry, vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry,  1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		ret |= VM_FAULT_WRITE;
@@ -1432,6 +1435,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmd_t entry;
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_set_vma_features(entry, vma);
 		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 16/27] mm: Update can_follow_write_pte() for Shadow Stack
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (14 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB " Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-27  0:34   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support Yu-cheng Yu
                   ` (11 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Can_follow_write_pte() verifies that a read-only page is the task's own
copy by ensuring the page has gone through faultin_page() and the PTE is
Dirty.

A Shadow Stack (SHSTK) PTE must be (read-only + _PAGE_DIRTY_HW).  When a
task does fork(), its SHSTK PTEs become (read-only + _PAGE_DIRTY_SW).  This
causes the next SHSTK access (i.e. CALL, RET, INCSSP) to trigger a fault;
the page is then copied, and (read-only + _PAGE_DIRTY_HW) is restored.

To update can_follow_write_pte() for SHSTK, introduce pte_exclusive().  It
verifies a data PTE is Dirty and a SHSTK PTE has _PAGE_DIRTY_HW.

Also rename can_follow_write_pte() to can_follow_write() to make its
meaning clear; i.e. "Can we write to the page?", not "Is the PTE writable?"

Also apply same changes to the huge memory case.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/mm/pgtable.c         | 18 ++++++++++++++++++
 include/asm-generic/pgtable.h | 12 ++++++++++++
 mm/gup.c                      |  8 +++++---
 mm/huge_memory.c              |  8 +++++---
 4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3340b1d4e9da..fa8133f37918 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -887,6 +887,15 @@ inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
 		return pte;
 }
 
+inline bool pte_exclusive(pte_t pte, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHSTK)
+		return pte_dirty_hw(pte);
+	else
+		return pte_dirty(pte);
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & VM_SHSTK)
@@ -894,4 +903,13 @@ inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
 	else
 		return pmd;
 }
+
+inline bool pmd_exclusive(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHSTK)
+		return pmd_dirty_hw(pmd);
+	else
+		return pmd_dirty(pmd);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_X86_INTEL_SHADOW_STACK_USER */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index a9df093fdf45..ae9a84fffc25 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1202,18 +1202,30 @@ static inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
 	return pte;
 }
 
+static inline bool pte_exclusive(pte_t pte, struct vm_area_struct *vma)
+{
+	return pte_dirty(pte);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
 {
 	return pmd;
 }
+
+static inline bool pmd_exclusive(pmd_t pmd, struct vm_area_struct *vma)
+{
+	return pmd_dirty(pmd);
+}
 #endif
 #else
 bool arch_copy_pte_mapping(vm_flags_t vm_flags);
 pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma);
+bool pte_exclusive(pte_t pte, struct vm_area_struct *vma);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma);
+bool pmd_exclusive(pmd_t pmd, struct vm_area_struct *vma);
 #endif
 #endif
 #endif /* CONFIG_MMU */
diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..d1dbfbde8443 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -164,10 +164,12 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write(pte_t pte, unsigned int flags,
+				    struct vm_area_struct *vma)
 {
 	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+		((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+		 pte_exclusive(pte, vma));
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -205,7 +207,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+	if ((flags & FOLL_WRITE) && !can_follow_write(pte, flags, vma)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 93ef368df2dd..baad346e9f4a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1469,10 +1469,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
  * FOLL_FORCE can write to even unwritable pmd's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write(pmd_t pmd, unsigned int flags,
+				    struct vm_area_struct *vma)
 {
 	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+		pmd_exclusive(pmd, vma));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1485,7 +1487,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+	if (flags & FOLL_WRITE && !can_follow_write(*pmd, flags, vma))
 		goto out;
 
 	/* Avoid dumping huge zero page */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (15 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 16/27] mm: Update can_follow_write_pte() for Shadow Stack Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:07   ` Kees Cook
  2020-02-27  0:55   ` Dave Hansen
  2020-02-05 18:19 ` [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction Yu-cheng Yu
                   ` (10 subsequent siblings)
  27 siblings, 2 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

This patch adds basic Shadow Stack (SHSTK) enabling/disabling routines.
A task's SHSTK is allocated from memory with VM_SHSTK flag and read-only
protection.  It has a fixed size of RLIMIT_STACK.

v9:
- Change cpu_feature_enabled() to static_cpu_has().
- Merge cet_disable_shstk to cet_disable_free_shstk.
- Remove the empty slot at the top of the SHSTK, as it is not needed.
- Move do_mmap_locked() to alloc_shstk(), which is a static function.

v6:
- Create a function do_mmap_locked() for SHSTK allocation.

v2:
- Change noshstk to no_cet_shstk.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h                    |  31 +++++
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/processor.h              |   5 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/cet.c                         | 121 ++++++++++++++++++
 arch/x86/kernel/cpu/common.c                  |  25 ++++
 arch/x86/kernel/process.c                     |   1 +
 .../arch/x86/include/asm/disabled-features.h  |   8 +-
 8 files changed, 199 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..c44c991ca91f
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+/*
+ * Per-thread CET status
+ */
+struct cet_status {
+	unsigned long	shstk_base;
+	unsigned long	shstk_size;
+	unsigned int	shstk_enabled:1;
+};
+
+#ifdef CONFIG_X86_INTEL_CET
+int cet_setup_shstk(void);
+void cet_disable_free_shstk(struct task_struct *p);
+#else
+static inline void cet_disable_free_shstk(struct task_struct *p) {}
+#endif
+
+#define cpu_x86_cet_enabled() \
+	(static_cpu_has(X86_FEATURE_SHSTK) || \
+	 static_cpu_has(X86_FEATURE_IBT))
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8e1d0bb46361..e1454509ad83 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -62,6 +62,12 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK	0
+#else
+#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -81,7 +87,7 @@
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
-#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 0340aad3f2fc..793d210e64da 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -25,6 +25,7 @@ struct vm86;
 #include <asm/special_insns.h>
 #include <asm/fpu/types.h>
 #include <asm/unwind_hints.h>
+#include <asm/cet.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -539,6 +540,10 @@ struct thread_struct {
 	unsigned int		sig_on_uaccess_err:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
 
+#ifdef CONFIG_X86_INTEL_CET
+	struct cet_status	cet;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 6175e370ee4a..b8c1ea4ab7eb 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -142,6 +142,8 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
+obj-$(CONFIG_X86_INTEL_CET)		+= cet.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..b4c7d88e9a8f
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * cet.c - Control-flow Enforcement (CET)
+ *
+ * Copyright (c) 2019, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <asm/msr.h>
+#include <asm/user.h>
+#include <asm/fpu/internal.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+
+static void start_update_msrs(void)
+{
+	fpregs_lock();
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		__fpregs_load_activate();
+}
+
+static void end_update_msrs(void)
+{
+	fpregs_unlock();
+}
+
+static unsigned long cet_get_shstk_addr(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	unsigned long ssp = 0;
+
+	fpregs_lock();
+
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+	} else {
+		struct cet_user_state *p;
+
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
+		if (p)
+			ssp = p->user_ssp;
+	}
+
+	fpregs_unlock();
+	return ssp;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, populate;
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap(NULL, 0, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE,
+		       VM_SHSTK, 0, &populate, NULL);
+	up_write(&mm->mmap_sem);
+
+	if (populate)
+		mm_populate(addr, populate);
+
+	return addr;
+}
+
+int cet_setup_shstk(void)
+{
+	unsigned long addr, size;
+	struct cet_status *cet = &current->thread.cet;
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK))
+		return -EOPNOTSUPP;
+
+	size = rlimit(RLIMIT_STACK);
+	addr = alloc_shstk(size);
+
+	if (IS_ERR((void *)addr))
+		return PTR_ERR((void *)addr);
+
+	cet->shstk_base = addr;
+	cet->shstk_size = size;
+	cet->shstk_enabled = 1;
+
+	start_update_msrs();
+	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+	wrmsrl(MSR_IA32_U_CET, MSR_IA32_CET_SHSTK_EN);
+	end_update_msrs();
+	return 0;
+}
+
+void cet_disable_free_shstk(struct task_struct *tsk)
+{
+	struct cet_status *cet = &tsk->thread.cet;
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK) ||
+	    !cet->shstk_enabled || !cet->shstk_base)
+		return;
+
+	if (!tsk->mm || (tsk->mm != current->mm))
+		return;
+
+	if (tsk == current) {
+		u64 msr_val;
+
+		start_update_msrs();
+		rdmsrl(MSR_IA32_U_CET, msr_val);
+		wrmsrl(MSR_IA32_U_CET, msr_val & ~MSR_IA32_CET_SHSTK_EN);
+		end_update_msrs();
+	}
+
+	vm_munmap(cet->shstk_base, cet->shstk_size);
+	cet->shstk_base = 0;
+	cet->shstk_size = 0;
+	cet->shstk_enabled = 0;
+}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 2e4d90294fe6..40498ec72fda 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -54,6 +54,7 @@
 #include <asm/microcode_intel.h>
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
+#include <asm/cet.h>
 #include <asm/uv/uv.h>
 
 #include "cpu.h"
@@ -486,6 +487,29 @@ static __init int setup_disable_pku(char *arg)
 __setup("nopku", setup_disable_pku);
 #endif /* CONFIG_X86_64 */
 
+static __always_inline void setup_cet(struct cpuinfo_x86 *c)
+{
+	if (cpu_x86_cet_enabled())
+		cr4_set_bits(X86_CR4_CET);
+}
+
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+static __init int setup_disable_shstk(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (s[0] != '\0')
+		return 0;
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+	pr_info("x86: 'no_cet_shstk' specified, disabling Shadow Stack\n");
+	return 1;
+}
+__setup("no_cet_shstk", setup_disable_shstk);
+#endif
+
 /*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
@@ -1510,6 +1534,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
 	setup_pku(c);
+	setup_cet(c);
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 8d0b9442202e..e102e63de641 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,6 +43,7 @@
 #include <asm/spec-ctrl.h>
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
+#include <asm/cet.h>
 
 #include "process.h"
 
diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
index 8e1d0bb46361..e1454509ad83 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -62,6 +62,12 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK	0
+#else
+#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -81,7 +87,7 @@
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
-#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (16 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:10   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 19/27] x86/cet/shstk: Handle signals for Shadow Stack Yu-cheng Yu
                   ` (9 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

WRUSS is a new kernel-mode instruction but writes directly to user Shadow
Stack (SHSTK) memory.  This is used to construct a return address on SHSTK
for the signal handler.

This instruction can fault if the user SHSTK is not valid SHSTK memory.
In that case, the kernel does a fixup.

v4:
- Change to asm goto.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/special_insns.h | 32 ++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 6d37b8fcfc77..1b9b2e79c353 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -222,6 +222,38 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+	asm_volatile_goto("1: wrussd %1, (%0)\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: "r" (addr), "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EPERM;
+}
+#else
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+	WARN_ONCE(1, "%s used but not supported.\n", __func__);
+	return -EFAULT;
+}
+#endif
+
+static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
+{
+	asm_volatile_goto("1: wrussq %1, (%0)\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: "r" (addr), "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EPERM;
+}
+#endif /* CONFIG_X86_INTEL_CET */
+
 #define nop() asm volatile ("nop")
 
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 19/27] x86/cet/shstk: Handle signals for Shadow Stack
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (17 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:17   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 20/27] ELF: UAPI and Kconfig additions for ELF program properties Yu-cheng Yu
                   ` (8 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

To deliver a signal, create a Shadow Stack (SHSTK) restore token and put
the token and the signal restorer address on the SHSTK.  For sigreturn,
verify the token and restore the SHSTK pointer.

Introduce a signal context extension struct 'sc_ext', which is used to save
SHSTK restore token address and WAIT_ENDBR status.  WAIT_ENDBR will be
introduced later in the Indirect Branch Tracking (IBT) series, but add that
into sc_ext now to keep the struct stable in case the IBT series is applied
later.

v9:
- Update CET MSR access according to XSAVES supervisor state changes.
- Add 'wait_endbr' to struct 'sc_ext'.
- Update and simplify signal frame allocation, setup, and restoration.
- Update commit log text.

v2:
- Move CET status from sigcontext to a separate struct sc_ext, which is
  located above the fpstate on the signal frame.
- Add a restore token for sigreturn address.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/ia32/ia32_signal.c            |  17 +++
 arch/x86/include/asm/cet.h             |   7 ++
 arch/x86/include/asm/fpu/internal.h    |   2 +
 arch/x86/include/uapi/asm/sigcontext.h |   9 ++
 arch/x86/kernel/cet.c                  | 153 +++++++++++++++++++++++++
 arch/x86/kernel/fpu/signal.c           |  89 ++++++++++++++
 arch/x86/kernel/signal.c               |  10 ++
 7 files changed, 287 insertions(+)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 30416d7f19d4..c0bb350a3d2d 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -35,6 +35,7 @@
 #include <asm/sigframe.h>
 #include <asm/sighandling.h>
 #include <asm/smap.h>
+#include <asm/cet.h>
 
 /*
  * Do a signal return; undo the signal stack.
@@ -223,6 +224,7 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
 				 void __user **fpstate)
 {
 	unsigned long sp, fx_aligned, math_size;
+	void __user *restorer = NULL;
 
 	/* Default to using normal stack */
 	sp = regs->sp;
@@ -236,8 +238,23 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
 		 ksig->ka.sa.sa_restorer)
 		sp = (unsigned long) ksig->ka.sa.sa_restorer;
 
+	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+		restorer = ksig->ka.sa.sa_restorer;
+	} else if (current->mm->context.vdso) {
+		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
+			restorer = current->mm->context.vdso +
+				vdso_image_32.sym___kernel_rt_sigreturn;
+		else
+			restorer = current->mm->context.vdso +
+				vdso_image_32.sym___kernel_sigreturn;
+	}
+
 	sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
 	*fpstate = (struct _fpstate_32 __user *) sp;
+
+	if (save_cet_to_sigframe(*fpstate, (unsigned long)restorer, 1))
+		return (void __user *) -1L;
+
 	if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
 				     math_size) < 0)
 		return (void __user *) -1L;
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index c44c991ca91f..409d4f91a0dc 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct sc_ext;
+
 /*
  * Per-thread CET status
  */
@@ -18,8 +20,13 @@ struct cet_status {
 #ifdef CONFIG_X86_INTEL_CET
 int cet_setup_shstk(void);
 void cet_disable_free_shstk(struct task_struct *p);
+int cet_restore_signal(bool ia32, struct sc_ext *sc);
+int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 #else
 static inline void cet_disable_free_shstk(struct task_struct *p) {}
+static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EINVAL; }
+static inline int cet_setup_signal(bool ia32, unsigned long rstor,
+				   struct sc_ext *sc) { return -EINVAL; }
 #endif
 
 #define cpu_x86_cet_enabled() \
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 42159f45bf9c..241521c0ed02 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -476,6 +476,8 @@ static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
 	__copy_kernel_to_fpregs(fpstate, -1);
 }
 
+extern int save_cet_to_sigframe(void __user *fp, unsigned long restorer,
+				int is_ia32);
 extern int copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size);
 
 /*
diff --git a/arch/x86/include/uapi/asm/sigcontext.h b/arch/x86/include/uapi/asm/sigcontext.h
index 844d60eb1882..cf2d55db3be4 100644
--- a/arch/x86/include/uapi/asm/sigcontext.h
+++ b/arch/x86/include/uapi/asm/sigcontext.h
@@ -196,6 +196,15 @@ struct _xstate {
 	/* New processor state extensions go here: */
 };
 
+/*
+ * Located at the end of sigcontext->fpstate, aligned to 8.
+ */
+struct sc_ext {
+	unsigned long total_size;
+	unsigned long ssp;
+	unsigned long wait_endbr;
+};
+
 /*
  * The 32-bit signal frame:
  */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index b4c7d88e9a8f..cba5c7656aab 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -19,6 +19,8 @@
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
 #include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <uapi/asm/sigcontext.h>
 
 static void start_update_msrs(void)
 {
@@ -69,6 +71,80 @@ static unsigned long alloc_shstk(unsigned long size)
 	return addr;
 }
 
+#define TOKEN_MODE_MASK	3UL
+#define TOKEN_MODE_64	1UL
+#define IS_TOKEN_64(token) ((token & TOKEN_MODE_MASK) == TOKEN_MODE_64)
+#define IS_TOKEN_32(token) ((token & TOKEN_MODE_MASK) == 0)
+
+/*
+ * Verify the restore token at the address of 'ssp' is
+ * valid and then set shadow stack pointer according to the
+ * token.
+ */
+static int verify_rstor_token(bool ia32, unsigned long ssp,
+			      unsigned long *new_ssp)
+{
+	unsigned long token;
+
+	*new_ssp = 0;
+
+	if (!IS_ALIGNED(ssp, 8))
+		return -EINVAL;
+
+	if (get_user(token, (unsigned long __user *)ssp))
+		return -EFAULT;
+
+	/* Is 64-bit mode flag correct? */
+	if (!ia32 && !IS_TOKEN_64(token))
+		return -EINVAL;
+	else if (ia32 && !IS_TOKEN_32(token))
+		return -EINVAL;
+
+	token &= ~TOKEN_MODE_MASK;
+
+	/*
+	 * Restore address properly aligned?
+	 */
+	if ((!ia32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
+		return -EINVAL;
+
+	/*
+	 * Token was placed properly?
+	 */
+	if ((ALIGN_DOWN(token, 8) - 8) != ssp)
+		return -EINVAL;
+
+	*new_ssp = token;
+	return 0;
+}
+
+/*
+ * Create a restore token on the shadow stack.
+ * A token is always 8-byte and aligned to 8.
+ */
+static int create_rstor_token(bool ia32, unsigned long ssp,
+			      unsigned long *new_ssp)
+{
+	unsigned long addr;
+
+	*new_ssp = 0;
+
+	if ((!ia32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
+		return -EINVAL;
+
+	addr = ALIGN_DOWN(ssp, 8) - 8;
+
+	/* Is the token for 64-bit? */
+	if (!ia32)
+		ssp |= TOKEN_MODE_64;
+
+	if (write_user_shstk_64(addr, ssp))
+		return -EFAULT;
+
+	*new_ssp = addr;
+	return 0;
+}
+
 int cet_setup_shstk(void)
 {
 	unsigned long addr, size;
@@ -119,3 +195,80 @@ void cet_disable_free_shstk(struct task_struct *tsk)
 	cet->shstk_size = 0;
 	cet->shstk_enabled = 0;
 }
+
+/*
+ * Called from __fpu__restore_sig() and XSAVES buffer is protected by
+ * set_thread_flag(TIF_NEED_FPU_LOAD).
+ */
+int cet_restore_signal(bool ia32, struct sc_ext *sc_ext)
+{
+	struct cet_user_state *cet_user_state;
+	struct cet_status *cet = &current->thread.cet;
+	unsigned long new_ssp = 0;
+	u64 msr_val = 0;
+	int err;
+
+	if (!cet->shstk_enabled)
+		return 0;
+
+	cet_user_state = get_xsave_addr(&current->thread.fpu.state.xsave,
+					XFEATURE_CET_USER);
+	if (!cet_user_state)
+		return -1;
+
+	if (cet->shstk_enabled) {
+		err = verify_rstor_token(ia32, sc_ext->ssp, &new_ssp);
+		if (err)
+			return err;
+
+		cet_user_state->user_ssp = new_ssp;
+		msr_val |= MSR_IA32_CET_SHSTK_EN;
+	}
+
+	cet_user_state->user_cet = msr_val;
+	return 0;
+}
+
+/*
+ * Setup the shadow stack for the signal handler: first,
+ * create a restore token to keep track of the current ssp,
+ * and then the return address of the signal handler.
+ */
+int cet_setup_signal(bool ia32, unsigned long rstor_addr, struct sc_ext *sc_ext)
+{
+	struct cet_status *cet = &current->thread.cet;
+	unsigned long ssp = 0, new_ssp = 0;
+	int err;
+
+	if (!cet->shstk_enabled)
+		return 0;
+
+	if (cet->shstk_enabled) {
+		if (!rstor_addr)
+			return -EINVAL;
+
+		ssp = cet_get_shstk_addr();
+		err = create_rstor_token(ia32, ssp, &new_ssp);
+		if (err)
+			return err;
+
+		if (ia32) {
+			ssp = new_ssp - sizeof(u32);
+			err = write_user_shstk_32(ssp, (unsigned int)rstor_addr);
+		} else {
+			ssp = new_ssp - sizeof(u64);
+			err = write_user_shstk_64(ssp, rstor_addr);
+		}
+
+		if (err)
+			return err;
+
+		sc_ext->ssp = new_ssp;
+	}
+
+	start_update_msrs();
+	if (cet->shstk_enabled)
+		wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	end_update_msrs();
+
+	return 0;
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 0d3e06a772b0..875cc0fadce3 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -52,6 +52,69 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
 	return 0;
 }
 
+int save_cet_to_sigframe(void __user *fp, unsigned long restorer, int is_ia32)
+{
+	int err = 0;
+
+#ifdef CONFIG_X86_INTEL_CET
+	if (!current->thread.cet.shstk_enabled)
+		return 0;
+
+	if (fp) {
+		struct sc_ext ext = {0, 0, 0};
+
+		err = cet_setup_signal(is_ia32, restorer, &ext);
+		if (!err) {
+			void __user *p = fp;
+
+			ext.total_size = sizeof(ext);
+
+			if (is_ia32)
+				p += sizeof(struct fregs_state);
+
+			p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+			p = (void __user *)ALIGN((unsigned long)p, 8);
+
+			if (copy_to_user(p, &ext, sizeof(ext)))
+				return -EFAULT;
+		}
+	}
+#endif
+
+	return err;
+}
+
+static int restore_cet_from_sigframe(int is_ia32, void __user *fp)
+{
+	int err = 0;
+
+#ifdef CONFIG_X86_INTEL_CET
+	if (!current->thread.cet.shstk_enabled)
+		return 0;
+
+	if (fp) {
+		struct sc_ext ext = {0, 0, 0};
+		void __user *p = fp;
+
+		if (is_ia32)
+			p += sizeof(struct fregs_state);
+
+		p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+		p = (void __user *)ALIGN((unsigned long)p, 8);
+
+		if (copy_from_user(&ext, p, sizeof(ext)))
+			return -EFAULT;
+
+		if (ext.total_size != sizeof(ext))
+			return -EFAULT;
+
+		err = cet_restore_signal(is_ia32, &ext);
+	}
+#endif
+
+	return err;
+}
+
 /*
  * Signal frame handlers.
  */
@@ -367,6 +430,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		pagefault_disable();
 		ret = copy_user_to_fpregs_zeroing(buf_fx, xfeatures_user, fx_only);
 		pagefault_enable();
+
+		if (!ret)
+			ret = restore_cet_from_sigframe(0, buf);
+
 		if (!ret) {
 			if (xfeatures_mask_supervisor())
 				copy_kernel_to_xregs(&fpu->state.xsave,
@@ -397,6 +464,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		sanitize_restored_user_xstate(&fpu->state, envp, xfeatures_user,
 					      fx_only);
 
+		ret = restore_cet_from_sigframe((int)ia32_fxstate, buf);
+		if (ret)
+			goto err_out;
+
 		fpregs_lock();
 		if (unlikely(init_bv))
 			copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
@@ -468,12 +539,30 @@ int fpu__restore_sig(void __user *buf, int ia32_frame)
 	return __fpu__restore_sig(buf, buf_fx, size);
 }
 
+static unsigned long fpu__alloc_sigcontext_ext(unsigned long sp)
+{
+	/*
+	 * sigcontext_ext is at: fpu + fpu_user_xstate_size +
+	 * FP_XSTATE_MAGIC2_SIZE, then aligned to 8.
+	 */
+	if (cpu_x86_cet_enabled()) {
+		struct cet_status *cet = &current->thread.cet;
+
+		if (cet->shstk_enabled)
+			sp -= (sizeof(struct sc_ext) + 8);
+	}
+
+	return sp;
+}
+
 unsigned long
 fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
 		     unsigned long *buf_fx, unsigned long *size)
 {
 	unsigned long frame_size = xstate_sigframe_size();
 
+	sp = fpu__alloc_sigcontext_ext(sp);
+
 	*buf_fx = sp = round_down(sp - frame_size, 64);
 	if (ia32_frame && use_fxsr()) {
 		frame_size += sizeof(struct fregs_state);
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ce9421ec285f..b26f5084a8a1 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -46,6 +46,7 @@
 
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/cet.h>
 
 #define COPY(x)			do {			\
 	get_user_ex(regs->x, &sc->x);			\
@@ -246,6 +247,9 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
 	unsigned long buf_fx = 0;
 	int onsigstack = on_sig_stack(sp);
 	int ret;
+#ifdef CONFIG_X86_64
+	void __user *restorer = NULL;
+#endif
 
 	/* redzone */
 	if (IS_ENABLED(CONFIG_X86_64))
@@ -277,6 +281,12 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
 	if (onsigstack && !likely(on_sig_stack(sp)))
 		return (void __user *)-1L;
 
+#ifdef CONFIG_X86_64
+	if (ka->sa.sa_flags & SA_RESTORER)
+		restorer = ka->sa.sa_restorer;
+	ret = save_cet_to_sigframe(*fpstate, (unsigned long)restorer, 0);
+#endif
+
 	/* save i387 and extended state */
 	ret = copy_fpstate_to_sigframe(*fpstate, (void __user *)buf_fx, math_size);
 	if (ret < 0)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 20/27] ELF: UAPI and Kconfig additions for ELF program properties
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (18 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 19/27] x86/cet/shstk: Handle signals for Shadow Stack Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 21/27] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND Yu-cheng Yu
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Introduce basic ELF definitions relating to the NT_GNU_PROPERTY_TYPE_0
note.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/Kconfig.binfmt        | 3 +++
 include/linux/elf.h      | 8 ++++++++
 include/uapi/linux/elf.h | 1 +
 3 files changed, 12 insertions(+)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 62dc4f577ba1..d2cfe0729a73 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
 config ARCH_BINFMT_ELF_STATE
 	bool
 
+config ARCH_USE_GNU_PROPERTY
+	bool
+
 config BINFMT_ELF_FDPIC
 	bool "Kernel support for FDPIC ELF binaries"
 	default y if !BINFMT_ELF
diff --git a/include/linux/elf.h b/include/linux/elf.h
index e3649b3e970e..459cddcceaac 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_ELF_H
 #define _LINUX_ELF_H
 
+#include <linux/types.h>
 #include <asm/elf.h>
 #include <uapi/linux/elf.h>
 
@@ -56,4 +57,11 @@ static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) {
 extern int elf_coredump_extra_notes_size(void);
 extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
 #endif
+
+/* NT_GNU_PROPERTY_TYPE_0 header */
+struct gnu_property {
+	u32 pr_type;
+	u32 pr_datasz;
+};
+
 #endif /* _LINUX_ELF_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 34c02e4290fe..c37731407074 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -36,6 +36,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_LOPROC  0x70000000
 #define PT_HIPROC  0x7fffffff
 #define PT_GNU_EH_FRAME		0x6474e550
+#define PT_GNU_PROPERTY		0x6474e553
 
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 21/27] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (19 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 20/27] ELF: UAPI and Kconfig additions for ELF program properties Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:18   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 22/27] ELF: Add ELF program property parsing support Yu-cheng Yu
                   ` (6 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

An ELF file's .note.gnu.property indicates architecture features of
the file.  Introduce feature definitions for Control-flow Enforcement
Technology (CET): Shadow Stack and Indirect Branch Tracking.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 include/uapi/linux/elf.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index c37731407074..61251ecabdd7 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -444,4 +444,11 @@ typedef struct elf64_note {
   Elf64_Word n_type;	/* Content type */
 } Elf64_Nhdr;
 
+/* .note.gnu.property types */
+#define GNU_PROPERTY_X86_FEATURE_1_AND		0xc0000002
+
+/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
+#define GNU_PROPERTY_X86_FEATURE_1_IBT		0x00000001
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK	0x00000002
+
 #endif /* _UAPI_LINUX_ELF_H */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 22/27] ELF: Add ELF program property parsing support
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (20 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 21/27] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:20   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 23/27] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
                   ` (5 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

From: Dave Martin <Dave.Martin@arm.com>

ELF program properties will be needed for detecting whether to
enable optional architecture or ABI features for a new ELF process.

For now, there are no generic properties that we care about, so do
nothing unless CONFIG_ARCH_USE_GNU_PROPERTY=y.

Otherwise, the presence of properties using the PT_PROGRAM_PROPERTY
phdrs entry (if any), and notify each property to the arch code.

For now, the added code is not used.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 fs/binfmt_elf.c          | 127 +++++++++++++++++++++++++++++++++++++++
 fs/compat_binfmt_elf.c   |   4 ++
 include/linux/elf.h      |  19 ++++++
 include/uapi/linux/elf.h |   4 ++
 4 files changed, 154 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index ecd8d2698515..054446f93442 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -39,12 +39,18 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
+#include <linux/sizes.h>
+#include <linux/types.h>
 #include <linux/cred.h>
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
+#ifndef ELF_COMPAT
+#define ELF_COMPAT 0
+#endif
+
 #ifndef user_long_t
 #define user_long_t long
 #endif
@@ -678,6 +684,111 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
  * libraries.  There is no binary dependent code anywhere else.
  */
 
+static int parse_elf_property(const char *data, size_t *off, size_t datasz,
+			      struct arch_elf_state *arch,
+			      bool have_prev_type, u32 *prev_type)
+{
+	size_t o, step;
+	const struct gnu_property *pr;
+	int ret;
+
+	if (*off == datasz)
+		return -ENOENT;
+
+	if (WARN_ON(*off > datasz || *off % ELF_GNU_PROPERTY_ALIGN))
+		return -EIO;
+	o = *off;
+	datasz -= *off;
+
+	if (datasz < sizeof(*pr))
+		return -EIO;
+	pr = (const struct gnu_property *)(data + o);
+	o += sizeof(*pr);
+	datasz -= sizeof(*pr);
+
+	if (pr->pr_datasz > datasz)
+		return -EIO;
+
+	WARN_ON(o % ELF_GNU_PROPERTY_ALIGN);
+	step = round_up(pr->pr_datasz, ELF_GNU_PROPERTY_ALIGN);
+	if (step > datasz)
+		return -EIO;
+
+	/* Properties are supposed to be unique and sorted on pr_type: */
+	if (have_prev_type && pr->pr_type <= *prev_type)
+		return -EIO;
+	*prev_type = pr->pr_type;
+
+	ret = arch_parse_elf_property(pr->pr_type, data + o,
+				      pr->pr_datasz, ELF_COMPAT, arch);
+	if (ret)
+		return ret;
+
+	*off = o + step;
+	return 0;
+}
+
+#define NOTE_DATA_SZ SZ_1K
+#define GNU_PROPERTY_TYPE_0_NAME "GNU"
+#define NOTE_NAME_SZ (sizeof(GNU_PROPERTY_TYPE_0_NAME))
+
+static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
+				struct arch_elf_state *arch)
+{
+	union {
+		struct elf_note nhdr;
+		char data[NOTE_DATA_SZ];
+	} note;
+	loff_t pos;
+	ssize_t n;
+	size_t off, datasz;
+	int ret;
+	bool have_prev_type;
+	u32 prev_type;
+
+	if (!IS_ENABLED(CONFIG_ARCH_USE_GNU_PROPERTY) || !phdr)
+		return 0;
+
+	/* load_elf_binary() shouldn't call us unless this is true... */
+	if (WARN_ON(phdr->p_type != PT_GNU_PROPERTY))
+		return -EIO;
+
+	/* If the properties are crazy large, that's too bad (for now): */
+	if (phdr->p_filesz > sizeof(note))
+		return -ENOEXEC;
+
+	pos = phdr->p_offset;
+	n = kernel_read(f, &note, phdr->p_filesz, &pos);
+
+	BUILD_BUG_ON(sizeof(note) < sizeof(note.nhdr) + NOTE_NAME_SZ);
+	if (n < 0 || n < sizeof(note.nhdr) + NOTE_NAME_SZ)
+		return -EIO;
+
+	if (note.nhdr.n_type != NT_GNU_PROPERTY_TYPE_0 ||
+	    note.nhdr.n_namesz != NOTE_NAME_SZ ||
+	    strncmp(note.data + sizeof(note.nhdr),
+		    GNU_PROPERTY_TYPE_0_NAME, n - sizeof(note.nhdr)))
+		return -EIO;
+
+	off = round_up(sizeof(note.nhdr) + NOTE_NAME_SZ,
+		       ELF_GNU_PROPERTY_ALIGN);
+	if (off > n)
+		return -EIO;
+
+	if (note.nhdr.n_descsz > n - off)
+		return -EIO;
+	datasz = off + note.nhdr.n_descsz;
+
+	have_prev_type = false;
+	do {
+		ret = parse_elf_property(note.data, &off, datasz, arch,
+					 have_prev_type, &prev_type);
+		have_prev_type = true;
+	} while (!ret);
+
+	return ret == -ENOENT ? 0 : ret;
+}
+
 static int load_elf_binary(struct linux_binprm *bprm)
 {
 	struct file *interpreter = NULL; /* to shut gcc up */
@@ -685,6 +796,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int load_addr_set = 0;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL;
 	unsigned long elf_bss, elf_brk;
 	int bss_prot = 0;
 	int retval, i;
@@ -731,6 +843,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
 		char *elf_interpreter;
 
+		if (elf_ppnt->p_type == PT_GNU_PROPERTY) {
+			elf_property_phdata = elf_ppnt;
+			continue;
+		}
+
 		if (elf_ppnt->p_type != PT_INTERP)
 			continue;
 
@@ -818,9 +935,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			goto out_free_dentry;
 
 		/* Pass PT_LOPROC..PT_HIPROC headers to arch code */
+		elf_property_phdata = NULL;
 		elf_ppnt = interp_elf_phdata;
 		for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++)
 			switch (elf_ppnt->p_type) {
+			case PT_GNU_PROPERTY:
+				elf_property_phdata = elf_ppnt;
+				break;
+
 			case PT_LOPROC ... PT_HIPROC:
 				retval = arch_elf_pt_proc(&loc->interp_elf_ex,
 							  elf_ppnt, interpreter,
@@ -831,6 +953,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			}
 	}
 
+	retval = parse_elf_properties(interpreter ?: bprm->file,
+				      elf_property_phdata, &arch_state);
+	if (retval)
+		goto out_free_dentry;
+
 	/*
 	 * Allow arch code to reject the ELF at this point, whilst it's
 	 * still possible to return an error to the code that invoked
diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c
index aaad4ca1217e..13a087bc816b 100644
--- a/fs/compat_binfmt_elf.c
+++ b/fs/compat_binfmt_elf.c
@@ -17,6 +17,8 @@
 #include <linux/elfcore-compat.h>
 #include <linux/time.h>
 
+#define ELF_COMPAT	1
+
 /*
  * Rename the basic ELF layout types to refer to the 32-bit class of files.
  */
@@ -28,11 +30,13 @@
 #undef	elf_shdr
 #undef	elf_note
 #undef	elf_addr_t
+#undef	ELF_GNU_PROPERTY_ALIGN
 #define elfhdr		elf32_hdr
 #define elf_phdr	elf32_phdr
 #define elf_shdr	elf32_shdr
 #define elf_note	elf32_note
 #define elf_addr_t	Elf32_Addr
+#define ELF_GNU_PROPERTY_ALIGN	ELF32_GNU_PROPERTY_ALIGN
 
 /*
  * Some data types as stored in coredump.
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 459cddcceaac..7bdc6da160c7 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -22,6 +22,9 @@
 	SET_PERSONALITY(ex)
 #endif
 
+#define ELF32_GNU_PROPERTY_ALIGN	4
+#define ELF64_GNU_PROPERTY_ALIGN	8
+
 #if ELF_CLASS == ELFCLASS32
 
 extern Elf32_Dyn _DYNAMIC [];
@@ -32,6 +35,7 @@ extern Elf32_Dyn _DYNAMIC [];
 #define elf_addr_t	Elf32_Off
 #define Elf_Half	Elf32_Half
 #define Elf_Word	Elf32_Word
+#define ELF_GNU_PROPERTY_ALIGN	ELF32_GNU_PROPERTY_ALIGN
 
 #else
 
@@ -43,6 +47,7 @@ extern Elf64_Dyn _DYNAMIC [];
 #define elf_addr_t	Elf64_Off
 #define Elf_Half	Elf64_Half
 #define Elf_Word	Elf64_Word
+#define ELF_GNU_PROPERTY_ALIGN	ELF64_GNU_PROPERTY_ALIGN
 
 #endif
 
@@ -64,4 +69,18 @@ struct gnu_property {
 	u32 pr_datasz;
 };
 
+struct arch_elf_state;
+
+#ifndef CONFIG_ARCH_USE_GNU_PROPERTY
+static inline int arch_parse_elf_property(u32 type, const void *data,
+					  size_t datasz, bool compat,
+					  struct arch_elf_state *arch)
+{
+	return 0;
+}
+#else
+extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+				   bool compat, struct arch_elf_state *arch);
+#endif
+
 #endif /* _LINUX_ELF_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 61251ecabdd7..518651708d8f 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -368,6 +368,7 @@ typedef struct elf64_shdr {
  * Notes used in ET_CORE. Architectures export some of the arch register sets
  * using the corresponding note types via the PTRACE_GETREGSET and
  * PTRACE_SETREGSET requests.
+ * The note name for all these is "LINUX".
  */
 #define NT_PRSTATUS	1
 #define NT_PRFPREG	2
@@ -430,6 +431,9 @@ typedef struct elf64_shdr {
 #define NT_MIPS_FP_MODE	0x801		/* MIPS floating-point mode */
 #define NT_MIPS_MSA	0x802		/* MIPS SIMD registers */
 
+/* Note types with note name "GNU" */
+#define NT_GNU_PROPERTY_TYPE_0	5
+
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
   Elf32_Word	n_namesz;	/* Name size */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 23/27] ELF: Introduce arch_setup_elf_property()
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (21 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 22/27] ELF: Add ELF program property parsing support Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 24/27] x86/cet/shstk: ELF header parsing for Shadow Stack Yu-cheng Yu
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

An ELF file's .note.gnu.property indicates architecture features of the
file.  These features are extracted earlier and stored in the struct
'arch_elf_state'.  Introduce arch_setup_elf_property() to setup and enable
these features.  The first use-case of this function is Shadow Stack and
Indirect Branch Tracking, which are introduced later.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 fs/binfmt_elf.c     | 4 ++++
 include/linux/elf.h | 6 ++++++
 2 files changed, 10 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 054446f93442..56fe6cd437fe 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1213,6 +1213,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 	set_binfmt(&elf_format);
 
+	retval = arch_setup_elf_property(&arch_state);
+	if (retval < 0)
+		goto out;
+
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
 	retval = arch_setup_additional_pages(bprm, !!interpreter);
 	if (retval < 0)
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 7bdc6da160c7..81f2161fa4a8 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -78,9 +78,15 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
 {
 	return 0;
 }
+
+static inline int arch_setup_elf_property(struct arch_elf_state *arch)
+{
+	return 0;
+}
 #else
 extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
 				   bool compat, struct arch_elf_state *arch);
+extern int arch_setup_elf_property(struct arch_elf_state *arch);
 #endif
 
 #endif /* _LINUX_ELF_H */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 24/27] x86/cet/shstk: ELF header parsing for Shadow Stack
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (22 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 23/27] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:22   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread " Yu-cheng Yu
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Check an ELF file's .note.gnu.property, and setup Shadow Stack if the
application supports it.

v9:
- Change cpu_feature_enabled() to static_cpu_has().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/Kconfig             |  2 ++
 arch/x86/include/asm/elf.h   | 13 +++++++++++++
 arch/x86/kernel/process_64.c | 31 +++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6c34b701c588..d1447380e02e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1987,6 +1987,8 @@ config X86_INTEL_SHADOW_STACK_USER
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select X86_INTEL_CET
 	select ARCH_HAS_SHSTK
+	select ARCH_USE_GNU_PROPERTY
+	select ARCH_BINFMT_ELF_STATE
 	---help---
 	  Shadow Stack (SHSTK) provides protection against program
 	  stack corruption.  It is active when the kernel has this
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 69c0f892e310..fac79b621e0a 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -367,6 +367,19 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
 					      int uses_interp);
 #define compat_arch_setup_additional_pages compat_arch_setup_additional_pages
 
+#ifdef CONFIG_ARCH_BINFMT_ELF_STATE
+struct arch_elf_state {
+	unsigned int gnu_property;
+};
+
+#define INIT_ARCH_ELF_STATE {	\
+	.gnu_property = 0,	\
+}
+
+#define arch_elf_pt_proc(ehdr, phdr, elf, interp, state) (0)
+#define arch_check_elf(ehdr, interp, interp_ehdr, state) (0)
+#endif
+
 /* Do not change the values. See get_align_mask() */
 enum align_flags {
 	ALIGN_VA_32	= BIT(0),
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 506d66830d4d..99548cde0cc6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -732,3 +732,34 @@ unsigned long KSTK_ESP(struct task_struct *task)
 {
 	return task_pt_regs(task)->sp;
 }
+
+#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
+int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+			     bool compat, struct arch_elf_state *state)
+{
+	if (type != GNU_PROPERTY_X86_FEATURE_1_AND)
+		return 0;
+
+	if (datasz != sizeof(unsigned int))
+		return -ENOEXEC;
+
+	state->gnu_property = *(unsigned int *)data;
+	return 0;
+}
+
+int arch_setup_elf_property(struct arch_elf_state *state)
+{
+	int r = 0;
+
+	memset(&current->thread.cet, 0, sizeof(struct cet_status));
+
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (state->gnu_property & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+			r = cet_setup_shstk();
+		if (r < 0)
+			return r;
+	}
+
+	return r;
+}
+#endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread Shadow Stack
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (23 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 24/27] x86/cet/shstk: ELF header parsing for Shadow Stack Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:29   ` Kees Cook
  2020-02-05 18:19 ` [RFC PATCH v9 26/27] mm/mmap: Add Shadow Stack pages to memory accounting Yu-cheng Yu
                   ` (2 subsequent siblings)
  27 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

The Shadow Stack (SHSTK) for clone/fork is handled as the following:

(1) If ((clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM),
    the kernel allocates (and frees on thread exit) a new SHSTK for the
    child.

    It is possible for the kernel to complete the clone syscall and set the
    child's SHSTK pointer to NULL and let the child thread allocate a SHSTK
    for itself.  There are two issues in this approach: It is not
    compatible with existing code that does inline syscall and it cannot
    handle signals before the child can successfully allocate a SHSTK.

(2) For (clone_flags & CLONE_VFORK), the child uses the existing SHSTK.

(3) For all other cases, the SHSTK is copied/reused whenever the parent or
    the child does a call/ret.

This patch handles cases (1) & (2).  Case (3) is handled in the SHSTK page
fault patches.

A 64-bit SHSTK has a fixed size of RLIMIT_STACK. A compat-mode thread SHSTK
has a fixed size of 1/4 RLIMIT_STACK.  This allows more threads to share a
32-bit address space.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h         |  2 ++
 arch/x86/include/asm/mmu_context.h |  3 +++
 arch/x86/kernel/cet.c              | 41 ++++++++++++++++++++++++++++++
 arch/x86/kernel/process.c          |  7 +++++
 4 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 409d4f91a0dc..9a3e2da9c1c4 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -19,10 +19,12 @@ struct cet_status {
 
 #ifdef CONFIG_X86_INTEL_CET
 int cet_setup_shstk(void);
+int cet_setup_thread_shstk(struct task_struct *p);
 void cet_disable_free_shstk(struct task_struct *p);
 int cet_restore_signal(bool ia32, struct sc_ext *sc);
 int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 #else
+static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
 static inline void cet_disable_free_shstk(struct task_struct *p) {}
 static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EINVAL; }
 static inline int cet_setup_signal(bool ia32, unsigned long rstor,
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 5f33924e200f..6a8189308823 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -13,6 +13,7 @@
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
 #include <asm/mpx.h>
+#include <asm/cet.h>
 #include <asm/debugreg.h>
 
 extern atomic64_t last_mm_ctx_id;
@@ -230,6 +231,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		cet_disable_free_shstk(tsk);	\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index cba5c7656aab..5b45abda80a1 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -170,6 +170,47 @@ int cet_setup_shstk(void)
 	return 0;
 }
 
+int cet_setup_thread_shstk(struct task_struct *tsk)
+{
+	unsigned long addr, size;
+	struct cet_user_state *state;
+	struct cet_status *cet = &tsk->thread.cet;
+
+	if (!cet->shstk_enabled)
+		return 0;
+
+	state = get_xsave_addr(&tsk->thread.fpu.state.xsave,
+			       XFEATURE_CET_USER);
+
+	if (!state)
+		return -EINVAL;
+
+	size = rlimit(RLIMIT_STACK);
+
+	/*
+	 * Compat-mode pthreads share a limited address space.
+	 * If each function call takes an average of four slots
+	 * stack space, we need 1/4 of stack size for shadow stack.
+	 */
+	if (in_compat_syscall())
+		size /= 4;
+
+	addr = alloc_shstk(size);
+
+	if (IS_ERR((void *)addr)) {
+		cet->shstk_base = 0;
+		cet->shstk_size = 0;
+		cet->shstk_enabled = 0;
+		return PTR_ERR((void *)addr);
+	}
+
+	fpu__prepare_write(&tsk->thread.fpu);
+	state->user_ssp = (u64)(addr + size);
+	cet->shstk_base = addr;
+	cet->shstk_size = size;
+	return 0;
+}
+
 void cet_disable_free_shstk(struct task_struct *tsk)
 {
 	struct cet_status *cet = &tsk->thread.cet;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e102e63de641..7098618142f2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -110,6 +110,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	cet_disable_free_shstk(tsk);
 	fpu__drop(fpu);
 }
 
@@ -180,6 +181,12 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	if (clone_flags & CLONE_SETTLS)
 		ret = set_new_tls(p, tls);
 
+#ifdef CONFIG_X86_64
+	/* Allocate a new shadow stack for pthread */
+	if (!ret && (clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM)
+		ret = cet_setup_thread_shstk(p);
+#endif
+
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 26/27] mm/mmap: Add Shadow Stack pages to memory accounting
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (24 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread " Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-05 18:19 ` [RFC PATCH v9 27/27] x86/cet/shstk: Add arch_prctl functions for Shadow Stack Yu-cheng Yu
  2020-02-25 21:31 ` [RFC PATCH v9 00/27] Control-flow Enforcement: " Kees Cook
  27 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

Add Shadow Stack pages to memory accounting.

v8:
- Change Shadow Stake pages from data_vm to stack_vm.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 mm/mmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 71e4ffc83bcd..acfa04e2a5dd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1687,6 +1687,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	if (file && is_file_hugepages(file))
 		return 0;
 
+	if (arch_copy_pte_mapping(vm_flags))
+		return 1;
+
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
@@ -3302,6 +3305,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->stack_vm += npages;
 	else if (is_data_mapping(flags))
 		mm->data_vm += npages;
+	else if (arch_copy_pte_mapping(flags))
+		mm->stack_vm += npages;
 }
 
 static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC PATCH v9 27/27] x86/cet/shstk: Add arch_prctl functions for Shadow Stack
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (25 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 26/27] mm/mmap: Add Shadow Stack pages to memory accounting Yu-cheng Yu
@ 2020-02-05 18:19 ` Yu-cheng Yu
  2020-02-25 21:31 ` [RFC PATCH v9 00/27] Control-flow Enforcement: " Kees Cook
  27 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-05 18:19 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review
  Cc: Yu-cheng Yu

arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
    Return CET feature status.

    The parameter 'addr' is a pointer to a user buffer.  On returning to
    the caller, the kernel fills the following information:

    *addr = SHSTK/IBT status
    *(addr + 1) = SHSTK base address
    *(addr + 2) = SHSTK size

arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
    Disable CET features specified in 'features'.  Return -EPERM if CET is
    locked.

arch_prctl(ARCH_X86_CET_LOCK)
    Lock in CET feature.

arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
    Allocate a new SHSTK.

    The parameter 'addr' is a pointer to a user buffer and indicates the
    desired SHSTK size to allocate.  On returning to the caller the buffer
    contains the address of the new SHSTK.

There is no CET enabling arch_prctl function.  By design, CET is enabled
automatically if the binary and the system can support it.

The parameters passed are always unsigned 64-bit.  When an IA32 application
passing pointers, it should only use the lower 32 bits.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h        |  4 ++
 arch/x86/include/uapi/asm/prctl.h |  5 ++
 arch/x86/kernel/Makefile          |  2 +-
 arch/x86/kernel/cet.c             | 29 +++++++++++
 arch/x86/kernel/cet_prctl.c       | 84 +++++++++++++++++++++++++++++++
 arch/x86/kernel/process.c         |  4 +-
 6 files changed, 125 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cet_prctl.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 9a3e2da9c1c4..b64f6d810ae0 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -14,16 +14,20 @@ struct sc_ext;
 struct cet_status {
 	unsigned long	shstk_base;
 	unsigned long	shstk_size;
+	unsigned int	locked:1;
 	unsigned int	shstk_enabled:1;
 };
 
 #ifdef CONFIG_X86_INTEL_CET
+int prctl_cet(int option, unsigned long arg2);
 int cet_setup_shstk(void);
 int cet_setup_thread_shstk(struct task_struct *p);
+int cet_alloc_shstk(unsigned long *arg);
 void cet_disable_free_shstk(struct task_struct *p);
 int cet_restore_signal(bool ia32, struct sc_ext *sc);
 int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 #else
+static inline int prctl_cet(int option, unsigned long arg2) { return -EINVAL; }
 static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
 static inline void cet_disable_free_shstk(struct task_struct *p) {}
 static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EINVAL; }
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..d962f0ec9ccf 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,9 @@
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
 
+#define ARCH_X86_CET_STATUS		0x3001
+#define ARCH_X86_CET_DISABLE		0x3002
+#define ARCH_X86_CET_LOCK		0x3003
+#define ARCH_X86_CET_ALLOC_SHSTK	0x3004
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index b8c1ea4ab7eb..69a19957e200 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -142,7 +142,7 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
-obj-$(CONFIG_X86_INTEL_CET)		+= cet.o
+obj-$(CONFIG_X86_INTEL_CET)		+= cet.o cet_prctl.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 5b45abda80a1..01aa24c40a5d 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -145,6 +145,35 @@ static int create_rstor_token(bool ia32, unsigned long ssp,
 	return 0;
 }
 
+int cet_alloc_shstk(unsigned long *arg)
+{
+	unsigned long len = *arg;
+	unsigned long addr;
+	unsigned long token;
+	unsigned long ssp;
+
+	addr = alloc_shstk(len);
+
+	if (IS_ERR((void *)addr))
+		return PTR_ERR((void *)addr);
+
+	/* Restore token is 8 bytes and aligned to 8 bytes */
+	ssp = addr + len;
+	token = ssp;
+
+	if (!in_ia32_syscall())
+		token |= TOKEN_MODE_64;
+	ssp -= 8;
+
+	if (write_user_shstk_64(ssp, token)) {
+		vm_munmap(addr, len);
+		return -EINVAL;
+	}
+
+	*arg = addr;
+	return 0;
+}
+
 int cet_setup_shstk(void)
 {
 	unsigned long addr, size;
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
new file mode 100644
index 000000000000..6cf8f87e3d98
--- /dev/null
+++ b/arch/x86/kernel/cet_prctl.c
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/errno.h>
+#include <linux/uaccess.h>
+#include <linux/prctl.h>
+#include <linux/compat.h>
+#include <linux/mman.h>
+#include <linux/elfcore.h>
+#include <asm/processor.h>
+#include <asm/prctl.h>
+#include <asm/cet.h>
+
+/* See Documentation/x86/intel_cet.rst. */
+
+static int handle_get_status(unsigned long arg2)
+{
+	struct cet_status *cet = &current->thread.cet;
+	unsigned int features = 0;
+	unsigned long buf[3];
+
+	if (cet->shstk_enabled)
+		features |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
+
+	buf[0] = (unsigned long)features;
+	buf[1] = cet->shstk_base;
+	buf[2] = cet->shstk_size;
+	return copy_to_user((unsigned long __user *)arg2, buf,
+			    sizeof(buf));
+}
+
+static int handle_alloc_shstk(unsigned long arg2)
+{
+	int err = 0;
+	unsigned long arg;
+	unsigned long addr = 0;
+	unsigned long size = 0;
+
+	if (get_user(arg, (unsigned long __user *)arg2))
+		return -EFAULT;
+
+	size = arg;
+	err = cet_alloc_shstk(&arg);
+	if (err)
+		return err;
+
+	addr = arg;
+	if (put_user(addr, (unsigned long __user *)arg2)) {
+		vm_munmap(addr, size);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+int prctl_cet(int option, unsigned long arg2)
+{
+	struct cet_status *cet = &current->thread.cet;
+
+	if (!cpu_x86_cet_enabled())
+		return -EINVAL;
+
+	switch (option) {
+	case ARCH_X86_CET_STATUS:
+		return handle_get_status(arg2);
+
+	case ARCH_X86_CET_DISABLE:
+		if (cet->locked)
+			return -EPERM;
+		if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+			cet_disable_free_shstk(current);
+
+		return 0;
+
+	case ARCH_X86_CET_LOCK:
+		cet->locked = 1;
+		return 0;
+
+	case ARCH_X86_CET_ALLOC_SHSTK:
+		return handle_alloc_shstk(arg2);
+
+	default:
+		return -EINVAL;
+	}
+}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 7098618142f2..63dc88070923 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -998,7 +998,7 @@ long do_arch_prctl_common(struct task_struct *task, int option,
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
 		return set_cpuid_mode(task, cpuid_enabled);
+	default:
+		return prctl_cet(option, cpuid_enabled);
 	}
-
-	return -EINVAL;
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
@ 2020-02-06  0:16   ` Randy Dunlap
  2020-02-06 20:17     ` Yu-cheng Yu
  2020-02-25 20:02   ` Kees Cook
  2020-02-26 17:57   ` Dave Hansen
  2 siblings, 1 reply; 107+ messages in thread
From: Randy Dunlap @ 2020-02-06  0:16 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

Hi,

I have a few comments and a question (please see inline below).


On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
> document on Control-flow Enforcement Technology (CET).
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  Documentation/x86/index.rst                   |   1 +
>  Documentation/x86/intel_cet.rst               | 294 ++++++++++++++++++
>  3 files changed, 301 insertions(+)
>  create mode 100644 Documentation/x86/intel_cet.rst
> 

> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index 000000000000..71e2462fea5c
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,294 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +

...

> +
> +[5] CET system calls
> +====================
> +
> +The following arch_prctl() system calls are added for CET:
> +
> +arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
> +    Return CET feature status.
> +
> +    The parameter 'addr' is a pointer to a user buffer.
> +    On returning to the caller, the kernel fills the following
> +    information::
> +
> +        *addr       = SHSTK/IBT status
> +        *(addr + 1) = SHSTK base address
> +        *(addr + 2) = SHSTK size
> +
> +arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
> +    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
> +    if CET is locked.
> +
> +arch_prctl(ARCH_X86_CET_LOCK)
> +    Lock in CET feature.

which feature?

> +
> +arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
> +    Allocate a new SHSTK and put a restore token at top.
> +
> +    The parameter 'addr' is a pointer to a user buffer and indicates
> +    the desired SHSTK size to allocate.  On returning to the caller,
> +    the kernel fills '*addr' with the base address of the new SHSTK.
> +
> +arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
> +    Mark an address range as IBT legacy code.
> +
> +    The parameter 'addr' is a pointer to a user buffer that has the
> +    following information::
> +
> +        *addr       = starting linear address of the legacy code
> +        *(addr + 1) = size of the legacy code
> +        *(addr + 2) = set (1); clear (0)
> +
> +Note:
> +  There is no CET-enabling arch_prctl function.  By design, CET is
> +  enabled automatically if the binary and the system can support it.
> +
> +  The parameters passed are always unsigned 64-bit.  When an IA32
> +  application passing pointers, it should only use the lower 32 bits.
> +
> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size of
> +RLIMIT_STACK.  A compat-mode thread's SHSTK size is 1/4 of
> +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> +share a 32-bit address space.
> +
> +Signal
> +------
> +
> +The main program and its signal handlers use the same SHSTK.  Because
> +the SHSTK stores only return addresses, a large SHSTK will cover the
> +condition that both the program stack and the sigaltstack run out.
> +
> +The kernel creates a restore token at the SHSTK restoring address and
> +verifies that token when restoring from the signal handler.
> +
> +IBT for signal delivering and sigreturn is the same as the main
> +program's setup; except for WAIT_ENDBR status, which can be read from

s/;/,/

> +MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
> +indirect CALL/JMP and before the next instruction starts.
> +
> +A task's WAIT_ENDBR is reset for its signal handler, but preserved on
> +the task's stack; and then restored from sigreturn.

s/;/,/

> +
> +Fork
> +----
> +
> +The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
> +read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
> +a SHSTK access triggers a page fault with an additional SHSTK bit set
> +in the page fault error code.
> +
> +When a task forks a child, its SHSTK PTEs are copied and both the
> +parent's and the child's SHSTK PTEs are cleared of the dirty bit.
> +Upon the next SHSTK access, the resulting SHSTK page fault is handled
> +by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new SHSTK for
> +the new thread.
> +
> +Setjmp/Longjmp
> +--------------
> +
> +Longjmp unwinds SHSTK until it matches the program stack.
> +
> +Ucontext
> +--------
> +
> +In GLIBC, getcontext/setcontext is implemented in similar way as
> +setjmp/longjmp.
> +
> +When makecontext creates a new ucontext, a new SHSTK is allocated for
> +that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel
> +creates a restore token at the top of the new SHSTK and the user-mode
> +code switches to the new SHSTK with the RSTORSSP instruction.
> +
> +[7] The management of read-only & dirty PTEs for SHSTK
> +======================================================
> +
> +A RO and dirty PTE exists in the following cases:
> +
> +(a) A page is modified and then shared with a fork()'ed child;
> +(b) A R/O page that has been COW'ed;
> +(c) A SHSTK page.
> +
> +The processor only checks the dirty bit for (c).  To prevent the use
> +of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
> +DIRTY_SW for (a) and (b) above.  This results to the following PTE
> +settings::
> +
> +    Modified PTE:             (R/W + DIRTY_HW)
> +    Modified and shared PTE:  (R/O + DIRTY_SW)
> +    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
> +    SHSTK PTE:                (R/O + DIRTY_HW)
> +    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
> +    SHSTK PTE, shared:        (R/O + DIRTY_SW)
> +
> +Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
> +
> +[8] The implementation of IBT legacy bitmap
> +===========================================
> +
> +When IBT is active, a non-IBT-capable legacy library can be executed
> +if its address ranges are specified in the legacy code bitmap.  The
> +bitmap covers the whole user-space address, which is TASK_SIZE_MAX
> +for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB

confusing:
                                          its each bit

> +legacy code page.  It is read-only from an application, and setup by
> +the kernel as a special mapping when the first time the application

                           drop:   when

> +calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
> +manages the bitmap through the arch_prctl.

                      through the arch_prctl() interface.


cheers.
-- 
~Randy


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-06  0:16   ` Randy Dunlap
@ 2020-02-06 20:17     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-06 20:17 UTC (permalink / raw)
  To: Randy Dunlap, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, 2020-02-05 at 16:16 -0800, Randy Dunlap wrote:
> Hi,
> 
> I have a few comments and a question (please see inline below).
> 
> 
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
[...]
> > +arch_prctl(ARCH_X86_CET_LOCK)
> > +    Lock in CET feature.
> 
> which feature?

Both SHSTK and IBT are locked.  They cannot be turned off afterwards.

I will check things you pointed out.

Thanks,
Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
  2020-02-06  0:16   ` Randy Dunlap
@ 2020-02-25 20:02   ` Kees Cook
  2020-02-28 15:55     ` Yu-cheng Yu
  2020-02-26 17:57   ` Dave Hansen
  2 siblings, 1 reply; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:02 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:09AM -0800, Yu-cheng Yu wrote:
> Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
> document on Control-flow Enforcement Technology (CET).
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

I'm not a huge fan of the boot param names, but I can't suggest anything
better. ;) I love the extensive docs!

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  Documentation/x86/index.rst                   |   1 +
>  Documentation/x86/intel_cet.rst               | 294 ++++++++++++++++++
>  3 files changed, 301 insertions(+)
>  create mode 100644 Documentation/x86/intel_cet.rst
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index ade4e6ec23e0..8b69ebf0baed 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3001,6 +3001,12 @@
>  			noexec=on: enable non-executable mappings (default)
>  			noexec=off: disable non-executable mappings
>  
> +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> +			applications
> +
> +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
> +			applications
> +
>  	nosmap		[X86,PPC]
>  			Disable SMAP (Supervisor Mode Access Prevention)
>  			even if it is supported by processor.
> diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
> index a8de2fbc1caa..81f919801765 100644
> --- a/Documentation/x86/index.rst
> +++ b/Documentation/x86/index.rst
> @@ -19,6 +19,7 @@ x86-specific Documentation
>     tlb
>     mtrr
>     pat
> +   intel_cet
>     intel_mpx
>     intel-iommu
>     intel_txt
> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index 000000000000..71e2462fea5c
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,294 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +
> +[1] Overview
> +============
> +
> +Control-flow Enforcement Technology (CET) provides protection against
> +return/jump-oriented programming (ROP) attacks.  It can be setup to
> +protect both applications and the kernel.  In the first phase, only
> +user-mode protection is implemented in the 64-bit kernel; 32-bit
> +applications are supported in compatibility mode.
> +
> +CET introduces Shadow Stack (SHSTK) and Indirect Branch Tracking
> +(IBT).  SHSTK is a secondary stack allocated from memory and cannot
> +be directly modified by applications.  When executing a CALL, the
> +processor pushes a copy of the return address to SHSTK.  Upon
> +function return, the processor pops the SHSTK copy and compares it
> +to the one from the program stack.  If the two copies differ, the
> +processor raises a control-protection fault.  IBT verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes (see CET instructions below).
> +
> +There are two kernel configuration options:
> +
> +    X86_INTEL_SHADOW_STACK_USER, and
> +    X86_INTEL_BRANCH_TRACKING_USER.
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> +are required.  To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +There are two command-line options for disabling CET features::
> +
> +    no_cet_shstk - disables SHSTK, and
> +    no_cet_ibt   - disables IBT.
> +
> +At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.
> +
> +[2] CET assembly instructions
> +=============================
> +
> +RDSSP %r
> +    Read the SHSTK pointer into %r.
> +
> +INCSSP %r
> +    Unwind (increment) the SHSTK pointer (0 ~ 255) steps as indicated
> +    in the operand register.  The GLIBC longjmp uses INCSSP to unwind
> +    the SHSTK until that matches the program stack.  When it is
> +    necessary to unwind beyond 255 steps, longjmp divides and repeats
> +    the process.
> +
> +RSTORSSP (%r)
> +    Switch to the SHSTK indicated in the 'restore token' pointed by
> +    the operand register and replace the 'restore token' with a new
> +    token to be saved (with SAVEPREVSSP) for the outgoing SHSTK.
> +
> +::
> +
> +                                Before RSTORSSP
> +
> +               Incoming SHSTK                 Current/Outgoing SHSTK
> +
> +          |----------------------|           |----------------------|
> +   addr=x |                      |     ssp-> |                      |
> +          |----------------------|           |----------------------|
> +   (%r)-> | rstor_token=(x|Lg)   |  addr=y-8 |                      |
> +          |----------------------|           |----------------------|
> +
> +                                After RSTORSSP
> +
> +          |----------------------|           |----------------------|
> +   addr=x |                      |           |                      |
> +          |----------------------|           |----------------------|
> +    ssp-> | rstor_token=(y|Pv|Lg)|  addr=y-8 |                      |
> +          |----------------------|           |----------------------|
> +
> +    note:
> +        1. Only valid addresses and restore tokens can be on the
> +           user-mode SHSTK.
> +        2. A token is always of type u64 and must align to u64.
> +        3. The incoming SHSTK pointer in a rstor_token must point to
> +           immediately above the token.
> +        4. 'Lg' is bit[0] of a rstor_token indicating a 64-bit SHSTK.
> +        5. 'Pv' is bit[1] of a rstor_token indicating the token is to
> +           be used only for the next SAVEPREVSSP and invalid for
> +           RSTORSSP.
> +
> +SAVEPREVSSP
> +    Pop the SHSTK 'restore token' pointed by current SHSTK pointer
> +    and store it at (previous SHSTK pointer - 8).
> +
> +::
> +
> +                               After SAVEPREVSSP
> +
> +          |----------------------|           |----------------------|
> +    ssp-> |                      |           |                      |
> +          |----------------------|           |----------------------|
> + addr=x-8 | rstor_token=(y|Pv|Lg)|  addr=y-8 | rstor_token(y|Lg)    |
> +          |----------------------|           |----------------------|
> +
> +WRUSS %r0, (%r1)
> +    Write the value in %r0 to the SHSTK address pointed by (%r1).
> +    This is a kernel-mode only instruction.
> +
> +ENDBR and NOTRACK prefix
> +    When IBT is enabled, an indirect CALL/JMP must either::
> +
> +        have a NOTRACK prefix,
> +        reach an ENDBR, or
> +        reach an address within a legacy code page;
> +
> +    or it results in a control-protection fault.
> +
> +    When the target address is derived from information that cannot
> +    be modified, the compiler uses the NOTRACK prefix.  In other
> +    cases, the compiler inserts an ENDBR at the target address.
> +
> +    A legacy code page is designated in the legacy code bitmap, which
> +    is explained below in section [8].
> +
> +[3] Application Enabling
> +========================
> +
> +An application's CET capability is marked in its ELF header and can
> +be verified from the following command output, in the
> +NT_GNU_PROPERTY_TYPE_0 field:
> +
> +    readelf -n <application>
> +
> +If an application supports CET and is statically linked, it will run
> +with CET protection.  If the application needs any shared libraries,
> +the loader checks all dependencies and enables CET only when all
> +requirements are met.
> +
> +[4] Legacy Libraries
> +====================
> +
> +GLIBC provides a few tunables for backward compatibility.
> +
> +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
> +    Turn off SHSTK/IBT for the current shell.
> +
> +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> +    This controls how dlopen() handles SHSTK legacy libraries::
> +
> +        on         - continue with SHSTK enabled;
> +        permissive - continue with SHSTK off.
> +
> +[5] CET system calls
> +====================
> +
> +The following arch_prctl() system calls are added for CET:
> +
> +arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
> +    Return CET feature status.
> +
> +    The parameter 'addr' is a pointer to a user buffer.
> +    On returning to the caller, the kernel fills the following
> +    information::
> +
> +        *addr       = SHSTK/IBT status
> +        *(addr + 1) = SHSTK base address
> +        *(addr + 2) = SHSTK size
> +
> +arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
> +    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
> +    if CET is locked.
> +
> +arch_prctl(ARCH_X86_CET_LOCK)
> +    Lock in CET feature.
> +
> +arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
> +    Allocate a new SHSTK and put a restore token at top.
> +
> +    The parameter 'addr' is a pointer to a user buffer and indicates
> +    the desired SHSTK size to allocate.  On returning to the caller,
> +    the kernel fills '*addr' with the base address of the new SHSTK.
> +
> +arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
> +    Mark an address range as IBT legacy code.
> +
> +    The parameter 'addr' is a pointer to a user buffer that has the
> +    following information::
> +
> +        *addr       = starting linear address of the legacy code
> +        *(addr + 1) = size of the legacy code
> +        *(addr + 2) = set (1); clear (0)
> +
> +Note:
> +  There is no CET-enabling arch_prctl function.  By design, CET is
> +  enabled automatically if the binary and the system can support it.
> +
> +  The parameters passed are always unsigned 64-bit.  When an IA32
> +  application passing pointers, it should only use the lower 32 bits.
> +
> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size of
> +RLIMIT_STACK.  A compat-mode thread's SHSTK size is 1/4 of
> +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> +share a 32-bit address space.
> +
> +Signal
> +------
> +
> +The main program and its signal handlers use the same SHSTK.  Because
> +the SHSTK stores only return addresses, a large SHSTK will cover the
> +condition that both the program stack and the sigaltstack run out.
> +
> +The kernel creates a restore token at the SHSTK restoring address and
> +verifies that token when restoring from the signal handler.
> +
> +IBT for signal delivering and sigreturn is the same as the main
> +program's setup; except for WAIT_ENDBR status, which can be read from
> +MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
> +indirect CALL/JMP and before the next instruction starts.
> +
> +A task's WAIT_ENDBR is reset for its signal handler, but preserved on
> +the task's stack; and then restored from sigreturn.
> +
> +Fork
> +----
> +
> +The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
> +read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
> +a SHSTK access triggers a page fault with an additional SHSTK bit set
> +in the page fault error code.
> +
> +When a task forks a child, its SHSTK PTEs are copied and both the
> +parent's and the child's SHSTK PTEs are cleared of the dirty bit.
> +Upon the next SHSTK access, the resulting SHSTK page fault is handled
> +by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new SHSTK for
> +the new thread.
> +
> +Setjmp/Longjmp
> +--------------
> +
> +Longjmp unwinds SHSTK until it matches the program stack.
> +
> +Ucontext
> +--------
> +
> +In GLIBC, getcontext/setcontext is implemented in similar way as
> +setjmp/longjmp.
> +
> +When makecontext creates a new ucontext, a new SHSTK is allocated for
> +that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel
> +creates a restore token at the top of the new SHSTK and the user-mode
> +code switches to the new SHSTK with the RSTORSSP instruction.
> +
> +[7] The management of read-only & dirty PTEs for SHSTK
> +======================================================
> +
> +A RO and dirty PTE exists in the following cases:
> +
> +(a) A page is modified and then shared with a fork()'ed child;
> +(b) A R/O page that has been COW'ed;
> +(c) A SHSTK page.
> +
> +The processor only checks the dirty bit for (c).  To prevent the use
> +of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
> +DIRTY_SW for (a) and (b) above.  This results to the following PTE
> +settings::
> +
> +    Modified PTE:             (R/W + DIRTY_HW)
> +    Modified and shared PTE:  (R/O + DIRTY_SW)
> +    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
> +    SHSTK PTE:                (R/O + DIRTY_HW)
> +    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
> +    SHSTK PTE, shared:        (R/O + DIRTY_SW)
> +
> +Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
> +
> +[8] The implementation of IBT legacy bitmap
> +===========================================
> +
> +When IBT is active, a non-IBT-capable legacy library can be executed
> +if its address ranges are specified in the legacy code bitmap.  The
> +bitmap covers the whole user-space address, which is TASK_SIZE_MAX
> +for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB
> +legacy code page.  It is read-only from an application, and setup by
> +the kernel as a special mapping when the first time the application
> +calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
> +manages the bitmap through the arch_prctl.
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)
  2020-02-05 18:19 ` [RFC PATCH v9 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
@ 2020-02-25 20:02   ` Kees Cook
  0 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:02 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review, Borislav Petkov

On Wed, Feb 05, 2020 at 10:19:10AM -0800, Yu-cheng Yu wrote:
> Add CPU feature flags for Control-flow Enforcement Technology (CET).
> 
> CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
> CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> Reviewed-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/include/asm/cpufeatures.h | 2 ++
>  arch/x86/kernel/cpu/cpuid-deps.c   | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index e9b62498fe75..a2c6b1b5c026 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -336,6 +336,7 @@
>  #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
>  #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
>  #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
> +#define X86_FEATURE_SHSTK		(16*32+ 7) /* Shadow Stack */
>  #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
>  #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
>  #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
> @@ -361,6 +362,7 @@
>  #define X86_FEATURE_MD_CLEAR		(18*32+10) /* VERW clears CPU buffers */
>  #define X86_FEATURE_TSX_FORCE_ABORT	(18*32+13) /* "" TSX_FORCE_ABORT */
>  #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
> +#define X86_FEATURE_IBT			(18*32+20) /* Indirect Branch Tracking */
>  #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
>  #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
>  #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index 3cbe24ca80ab..fec83cc74b9e 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -69,6 +69,8 @@ static const struct cpuid_dep cpuid_deps[] = {
>  	{ X86_FEATURE_CQM_MBM_TOTAL,		X86_FEATURE_CQM_LLC   },
>  	{ X86_FEATURE_CQM_MBM_LOCAL,		X86_FEATURE_CQM_LLC   },
>  	{ X86_FEATURE_AVX512_BF16,		X86_FEATURE_AVX512VL  },
> +	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
> +	{ X86_FEATURE_IBT,			X86_FEATURE_XSAVES    },
>  	{}
>  };
>  
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 03/27] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
  2020-02-05 18:19 ` [RFC PATCH v9 03/27] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
@ 2020-02-25 20:04   ` Kees Cook
  0 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:04 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:11AM -0800, Yu-cheng Yu wrote:
> Control-flow Enforcement Technology (CET) adds five MSRs.  Introduce them
> and their XSAVES supervisor states:
> 
>     MSR_IA32_U_CET (user-mode CET settings),
>     MSR_IA32_PL3_SSP (user-mode Shadow Stack pointer),
>     MSR_IA32_PL0_SSP (kernel-mode Shadow Stack pointer),
>     MSR_IA32_PL1_SSP (Privilege Level 1 Shadow Stack pointer),
>     MSR_IA32_PL2_SSP (Privilege Level 2 Shadow Stack pointer).
> 
> v6:
> - Remove __packed from struct cet_user_state, struct cet_kernel_state.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/fpu/types.h            | 22 ++++++++++++++++++
>  arch/x86/include/asm/fpu/xstate.h           |  5 +++--
>  arch/x86/include/asm/msr-index.h            | 18 +++++++++++++++
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/fpu/xstate.c                | 25 +++++++++++++++++++--
>  5 files changed, 68 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index f098f6cab94b..d7ef4d9c7ad5 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -114,6 +114,9 @@ enum xfeature {
>  	XFEATURE_Hi16_ZMM,
>  	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
>  	XFEATURE_PKRU,
> +	XFEATURE_RESERVED,
> +	XFEATURE_CET_USER,
> +	XFEATURE_CET_KERNEL,
>  
>  	XFEATURE_MAX,
>  };
> @@ -128,6 +131,8 @@ enum xfeature {
>  #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
>  #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
>  #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
> +#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
> +#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL)
>  
>  #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
>  #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
> @@ -229,6 +234,23 @@ struct pkru_state {
>  	u32				pad;
>  } __packed;
>  
> +/*
> + * State component 11 is Control-flow Enforcement user states
> + */
> +struct cet_user_state {
> +	u64 user_cet;			/* user control-flow settings */
> +	u64 user_ssp;			/* user shadow stack pointer */
> +};
> +
> +/*
> + * State component 12 is Control-flow Enforcement kernel states
> + */
> +struct cet_kernel_state {
> +	u64 kernel_ssp;			/* kernel shadow stack */
> +	u64 pl1_ssp;			/* privilege level 1 shadow stack */
> +	u64 pl2_ssp;			/* privilege level 2 shadow stack */
> +};
> +
>  struct xstate_header {
>  	u64				xfeatures;
>  	u64				xcomp_bv;
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index 9ebfdd543576..952d2515dae4 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -33,13 +33,14 @@
>  				       XFEATURE_MASK_BNDCSR)
>  
>  /* All currently supported supervisor features */
> -#define SUPPORTED_XFEATURES_MASK_SUPERVISOR (0)
> +#define SUPPORTED_XFEATURES_MASK_SUPERVISOR (XFEATURE_MASK_CET_USER)
>  
>  /*
>   * Unsupported supervisor features. When a supervisor feature in this mask is
>   * supported in the future, move it to the supported supervisor feature mask.
>   */
> -#define UNSUPPORTED_XFEATURES_MASK_SUPERVISOR (XFEATURE_MASK_PT)
> +#define UNSUPPORTED_XFEATURES_MASK_SUPERVISOR (XFEATURE_MASK_PT | \
> +					       XFEATURE_MASK_CET_KERNEL)
>  
>  /* All supervisor states including supported and unsupported states. */
>  #define ALL_XFEATURES_MASK_SUPERVISOR (SUPPORTED_XFEATURES_MASK_SUPERVISOR | \
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 084e98da04a7..114e77f5bb6b 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -872,4 +872,22 @@
>  #define MSR_VM_IGNNE                    0xc0010115
>  #define MSR_VM_HSAVE_PA                 0xc0010117
>  
> +/* Control-flow Enforcement Technology MSRs */
> +#define MSR_IA32_U_CET		0x6a0 /* user mode cet setting */
> +#define MSR_IA32_S_CET		0x6a2 /* kernel mode cet setting */
> +#define MSR_IA32_PL0_SSP	0x6a4 /* kernel shstk pointer */
> +#define MSR_IA32_PL1_SSP	0x6a5 /* ring-1 shstk pointer */
> +#define MSR_IA32_PL2_SSP	0x6a6 /* ring-2 shstk pointer */
> +#define MSR_IA32_PL3_SSP	0x6a7 /* user shstk pointer */
> +#define MSR_IA32_INT_SSP_TAB	0x6a8 /* exception shstk table */
> +
> +/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
> +#define MSR_IA32_CET_SHSTK_EN		0x0000000000000001ULL
> +#define MSR_IA32_CET_WRSS_EN		0x0000000000000002ULL
> +#define MSR_IA32_CET_ENDBR_EN		0x0000000000000004ULL
> +#define MSR_IA32_CET_LEG_IW_EN		0x0000000000000008ULL
> +#define MSR_IA32_CET_NO_TRACK_EN	0x0000000000000010ULL
> +#define MSR_IA32_CET_WAIT_ENDBR	0x00000000000000800UL
> +#define MSR_IA32_CET_BITMAP_MASK	0xfffffffffffff000ULL
> +
>  #endif /* _ASM_X86_MSR_INDEX_H */
> diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
> index bcba3c643e63..a8df907e8017 100644
> --- a/arch/x86/include/uapi/asm/processor-flags.h
> +++ b/arch/x86/include/uapi/asm/processor-flags.h
> @@ -130,6 +130,8 @@
>  #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
>  #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
>  #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
> +#define X86_CR4_CET_BIT		23 /* enable Control-flow Enforcement */
> +#define X86_CR4_CET		_BITUL(X86_CR4_CET_BIT)
>  
>  /*
>   * x86-64 Task Priority Register, CR8
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 04f7c6b8dbbc..ec08a2b6feca 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -38,6 +38,9 @@ static const char *xfeature_names[] =
>  	"Processor Trace (unused)"	,
>  	"Protection Keys User registers",
>  	"unknown xstate feature"	,
> +	"Control-flow User registers"	,
> +	"Control-flow Kernel registers"	,
> +	"unknown xstate feature"	,
>  };
>  
>  static short xsave_cpuid_features[] __initdata = {
> @@ -51,6 +54,9 @@ static short xsave_cpuid_features[] __initdata = {
>  	X86_FEATURE_AVX512F,
>  	X86_FEATURE_INTEL_PT,
>  	X86_FEATURE_PKU,
> +	-1,		   /* Unused */
> +	X86_FEATURE_SHSTK, /* XFEATURE_CET_USER */
> +	X86_FEATURE_SHSTK, /* XFEATURE_CET_KERNEL */
>  };
>  
>  /*
> @@ -316,6 +322,8 @@ static void __init print_xstate_features(void)
>  	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
>  	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
>  	print_xstate_feature(XFEATURE_MASK_PKRU);
> +	print_xstate_feature(XFEATURE_MASK_CET_USER);
> +	print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
>  }
>  
>  /*
> @@ -563,6 +571,8 @@ static void check_xstate_against_struct(int nr)
>  	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
>  	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
>  	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
> +	XCHECK_SZ(sz, nr, XFEATURE_CET_USER,   struct cet_user_state);
> +	XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);
>  
>  	/*
>  	 * Make *SURE* to add any feature numbers in below if
> @@ -770,8 +780,19 @@ void __init fpu__init_system_xstate(void)
>  	 * Clear XSAVE features that are disabled in the normal CPUID.
>  	 */
>  	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
> -		if (!boot_cpu_has(xsave_cpuid_features[i]))
> -			xfeatures_mask_all &= ~BIT_ULL(i);
> +		if (xsave_cpuid_features[i] == X86_FEATURE_SHSTK) {
> +			/*
> +			 * X86_FEATURE_SHSTK and X86_FEATURE_IBT share
> +			 * same states, but can be enabled separately.
> +			 */
> +			if (!boot_cpu_has(X86_FEATURE_SHSTK) &&
> +			    !boot_cpu_has(X86_FEATURE_IBT))
> +				xfeatures_mask_all &= ~BIT_ULL(i);
> +		} else {
> +			if ((xsave_cpuid_features[i] == -1) ||
> +			    !boot_cpu_has(xsave_cpuid_features[i]))
> +				xfeatures_mask_all &= ~BIT_ULL(i);
> +		}
>  	}
>  
>  	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler
  2020-02-05 18:19 ` [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler Yu-cheng Yu
@ 2020-02-25 20:06   ` Kees Cook
  2020-02-26 17:10   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:06 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:12AM -0800, Yu-cheng Yu wrote:
> A control-protection fault is triggered when a control-flow transfer
> attempt violates Shadow Stack or Indirect Branch Tracking constraints.
> For example, the return address for a RET instruction differs from the copy
> on the Shadow Stack; or an indirect JMP instruction, without the NOTRACK
> prefix, arrives at a non-ENDBR opcode.
> 
> The control-protection fault handler works in a similar way as the general
> protection fault handler.  It provides the si_code SEGV_CPERR to the signal
> handler.
> 
> v9:
> - Add Shadow Stack pointer to the fault printout.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/entry/entry_64.S          |  2 +-
>  arch/x86/include/asm/traps.h       |  3 ++
>  arch/x86/kernel/idt.c              |  4 ++
>  arch/x86/kernel/signal_compat.c    |  2 +-
>  arch/x86/kernel/traps.c            | 59 ++++++++++++++++++++++++++++++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  6 files changed, 70 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 76942cbd95a1..6ca77312d008 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1034,7 +1034,7 @@ idtentry spurious_interrupt_bug		do_spurious_interrupt_bug	has_error_code=0
>  idtentry coprocessor_error		do_coprocessor_error		has_error_code=0
>  idtentry alignment_check		do_alignment_check		has_error_code=1
>  idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
> -
> +idtentry control_protection		do_control_protection		has_error_code=1
>  
>  	/*
>  	 * Reload gs selector with exception handling
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index ffa0dc8a535e..7ac26bbd0bef 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -26,6 +26,7 @@ asmlinkage void invalid_TSS(void);
>  asmlinkage void segment_not_present(void);
>  asmlinkage void stack_segment(void);
>  asmlinkage void general_protection(void);
> +asmlinkage void control_protection(void);
>  asmlinkage void page_fault(void);
>  asmlinkage void async_page_fault(void);
>  asmlinkage void spurious_interrupt_bug(void);
> @@ -84,6 +85,7 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
>  void __init trap_init(void);
>  #endif
>  dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code);
> +dotraplinkage void do_control_protection(struct pt_regs *regs, long error_code);
>  dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
>  dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
>  dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code);
> @@ -154,6 +156,7 @@ enum {
>  	X86_TRAP_AC,		/* 17, Alignment Check */
>  	X86_TRAP_MC,		/* 18, Machine Check */
>  	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
> +	X86_TRAP_CP = 21,	/* 21 Control Protection Fault */
>  	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
>  };
>  
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index 87ef69a72c52..8ed406f469e7 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -102,6 +102,10 @@ static const __initconst struct idt_data def_idts[] = {
>  #elif defined(CONFIG_X86_32)
>  	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
>  #endif
> +
> +#ifdef CONFIG_X86_64
> +	INTG(X86_TRAP_CP,		control_protection),
> +#endif
>  };
>  
>  /*
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf0576cd0..c572a3de1037 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
>  	 */
>  	BUILD_BUG_ON(NSIGILL  != 11);
>  	BUILD_BUG_ON(NSIGFPE  != 15);
> -	BUILD_BUG_ON(NSIGSEGV != 7);
> +	BUILD_BUG_ON(NSIGSEGV != 8);
>  	BUILD_BUG_ON(NSIGBUS  != 5);
>  	BUILD_BUG_ON(NSIGTRAP != 5);
>  	BUILD_BUG_ON(NSIGCHLD != 6);
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 05da6b5b167b..99c83ee522ed 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -570,6 +570,65 @@ do_general_protection(struct pt_regs *regs, long error_code)
>  }
>  NOKPROBE_SYMBOL(do_general_protection);
>  
> +static const char * const control_protection_err[] = {
> +	"unknown",
> +	"near-ret",
> +	"far-ret/iret",
> +	"endbranch",
> +	"rstorssp",
> +	"setssbsy",
> +};
> +
> +/*
> + * When a control protection exception occurs, send a signal
> + * to the responsible application.  Currently, control
> + * protection is only enabled for the user mode.  This
> + * exception should not come from the kernel mode.
> + */
> +dotraplinkage void
> +do_control_protection(struct pt_regs *regs, long error_code)
> +{
> +	struct task_struct *tsk;
> +
> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +	if (notify_die(DIE_TRAP, "control protection fault", regs,
> +		       error_code, X86_TRAP_CP, SIGSEGV) == NOTIFY_STOP)
> +		return;
> +	cond_local_irq_enable(regs);
> +
> +	if (!user_mode(regs))
> +		die("kernel control protection fault", regs, error_code);
> +
> +	if (!static_cpu_has(X86_FEATURE_SHSTK) &&
> +	    !static_cpu_has(X86_FEATURE_IBT))
> +		WARN_ONCE(1, "CET is disabled but got control protection fault\n");
> +
> +	tsk = current;
> +	tsk->thread.error_code = error_code;
> +	tsk->thread.trap_nr = X86_TRAP_CP;
> +
> +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> +	    printk_ratelimit()) {
> +		unsigned int max_err;
> +		unsigned long ssp;
> +
> +		max_err = ARRAY_SIZE(control_protection_err) - 1;
> +		if ((error_code < 0) || (error_code > max_err))
> +			error_code = 0;
> +		rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +		pr_info("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
> +			tsk->comm, task_pid_nr(tsk),
> +			regs->ip, regs->sp, ssp, error_code,
> +			control_protection_err[error_code]);
> +		print_vma_addr(KERN_CONT " in ", regs->ip);
> +		pr_cont("\n");
> +	}
> +
> +	force_sig_fault(SIGSEGV, SEGV_CPERR,
> +			(void __user *)uprobe_get_trap_addr(regs));
> +}
> +NOKPROBE_SYMBOL(do_control_protection);
> +
>  dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>  {
>  #ifdef CONFIG_DYNAMIC_FTRACE
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index cb3d6c267181..693071dbe641 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -229,7 +229,8 @@ typedef struct siginfo {
>  #define SEGV_ACCADI	5	/* ADI not enabled for mapped object */
>  #define SEGV_ADIDERR	6	/* Disrupting MCD error */
>  #define SEGV_ADIPERR	7	/* Precise MCD exception */
> -#define NSIGSEGV	7
> +#define SEGV_CPERR	8
> +#define NSIGSEGV	8
>  
>  /*
>   * SIGBUS si_codes
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-05 18:19 ` [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection Yu-cheng Yu
@ 2020-02-25 20:07   ` Kees Cook
  2020-02-26 17:03   ` Dave Hansen
  2020-02-26 18:05   ` Dave Hansen
  2 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:13AM -0800, Yu-cheng Yu wrote:
> Introduce Kconfig option: X86_INTEL_SHADOW_STACK_USER.
> 
> Shadow Stack (SHSTK) provides protection against function return address
> corruption.  It is active when the kernel has this feature enabled, and
> both the processor and the application support it.  When this feature is
> enabled, legacy non-SHSTK applications continue to work, but without SHSTK
> protection.
> 
> The user-mode SHSTK protection is only implemented for the 64-bit kernel.
> IA32 applications are supported under the compatibility mode.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/Kconfig  | 22 ++++++++++++++++++++++
>  arch/x86/Makefile |  7 +++++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5e8949953660..6c34b701c588 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1974,6 +1974,28 @@ config X86_INTEL_TSX_MODE_AUTO
>  	  side channel attacks- equals the tsx=auto command line parameter.
>  endchoice
>  
> +config X86_INTEL_CET
> +	def_bool n
> +
> +config ARCH_HAS_SHSTK
> +	def_bool n
> +
> +config X86_INTEL_SHADOW_STACK_USER
> +	prompt "Intel Shadow Stack for user-mode"
> +	def_bool n
> +	depends on CPU_SUP_INTEL && X86_64
> +	select ARCH_USES_HIGH_VMA_FLAGS
> +	select X86_INTEL_CET
> +	select ARCH_HAS_SHSTK
> +	---help---
> +	  Shadow Stack (SHSTK) provides protection against program
> +	  stack corruption.  It is active when the kernel has this
> +	  feature enabled, and the processor and the application
> +	  support it.  When this feature is enabled, legacy non-SHSTK
> +	  applications continue to work, but without SHSTK protection.
> +
> +	  If unsure, say y.
> +
>  config EFI
>  	bool "EFI runtime service support"
>  	depends on ACPI
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 94df0868804b..c34f5befa4c8 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -149,6 +149,13 @@ ifdef CONFIG_X86_X32
>  endif
>  export CONFIG_X86_X32_ABI
>  
> +# Check assembler Shadow Stack suppot
> +ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +  ifeq ($(call as-instr, saveprevssp, y),)

This test needs to happen in the Kconfig rather than the Makefile; the
CONFIG should be unavailable if AS doesn't support the feature.

-Kees

> +      $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
> +  endif
> +endif
> +
>  #
>  # If the function graph tracer is used with mcount instead of fentry,
>  # '-maccumulate-outgoing-args' is needed to prevent a GCC bug
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory
  2020-02-05 18:19 ` [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory Yu-cheng Yu
@ 2020-02-25 20:07   ` Kees Cook
  2020-02-26 18:07   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:14AM -0800, Yu-cheng Yu wrote:
> A Shadow Stack (SHSTK) PTE must be read-only and have _PAGE_DIRTY set.
> However, read-only and Dirty PTEs also exist for copy-on-write (COW) pages.
> These two cases are handled differently for page faults and a new VM flag
> is necessary for tracking SHSTK VMAs.
> 
> v9:
> - Add VM_SHSTK case to arch_vma_name().
> - Revise the commit log to explain why a new VM flag is needed.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/mm/mmap.c | 2 ++
>  fs/proc/task_mmu.c | 3 +++
>  include/linux/mm.h | 8 ++++++++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index aae9a933dfd4..482813b4c659 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -165,6 +165,8 @@ const char *arch_vma_name(struct vm_area_struct *vma)
>  {
>  	if (vma->vm_flags & VM_MPX)
>  		return "[mpx]";
> +	else if (vma->vm_flags & VM_SHSTK)
> +		return "[shadow stack]";
>  	return NULL;
>  }
>  
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 9442631fd4af..590b58ee008a 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -687,6 +687,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  		[ilog2(VM_PKEY_BIT4)]	= "",
>  #endif
>  #endif /* CONFIG_ARCH_HAS_PKEYS */
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +		[ilog2(VM_SHSTK)]	= "ss",
> +#endif
>  	};
>  	size_t i;
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cfaa8feecfe8..b5145fbe102e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_MPX		VM_NONE
>  #endif
>  
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +# define VM_SHSTK	VM_HIGH_ARCH_5
> +#else
> +# define VM_SHSTK	VM_NONE
> +#endif
> +
>  #ifndef VM_GROWSUP
>  # define VM_GROWSUP	VM_NONE
>  #endif
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack.
  2020-02-05 18:19 ` [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack Yu-cheng Yu
@ 2020-02-25 20:11   ` Kees Cook
  2020-02-26 18:17   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:11 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:15AM -0800, Yu-cheng Yu wrote:
> INCSSPD/INCSSPQ instruction is used to unwind a Shadow Stack (SHSTK).  It
> performs 'pop and discard' of the first and last element from SHSTK in the
> range specified in the operand.  The maximum value of the operand is 255,
> and the maximum moving distance of the SHSTK pointer is 255 * 4 for
> INCSSPD, 255 * 8 for INCSSPQ.
> 
> Since SHSTK has a fixed size, creating a guard page above prevents
> INCSSP/RET from moving beyond.  Likewise, creating a guard page below
> prevents CALL from underflowing the SHSTK.

This commit log doesn't really explain why the code changes below are
needed? stack_guard_gap is configurable at boot, etc. This appears to be
limiting it? I don't follow this change...

-Kees

> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  include/linux/mm.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b5145fbe102e..75de07674649 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2464,9 +2464,15 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
>  static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
>  {
>  	unsigned long vm_start = vma->vm_start;
> +	unsigned long gap = 0;
>  
> -	if (vma->vm_flags & VM_GROWSDOWN) {
> -		vm_start -= stack_guard_gap;
> +	if (vma->vm_flags & VM_GROWSDOWN)
> +		gap = stack_guard_gap;
> +	else if (vma->vm_flags & VM_SHSTK)
> +		gap = PAGE_SIZE;
> +
> +	if (gap != 0) {
> +		vm_start -= gap;
>  		if (vm_start > vma->vm_start)
>  			vm_start = 0;
>  	}
> @@ -2476,9 +2482,15 @@ static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
>  static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
>  {
>  	unsigned long vm_end = vma->vm_end;
> +	unsigned long gap = 0;
> +
> +	if (vma->vm_flags & VM_GROWSUP)
> +		gap = stack_guard_gap;
> +	else if (vma->vm_flags & VM_SHSTK)
> +		gap = PAGE_SIZE;
>  
> -	if (vma->vm_flags & VM_GROWSUP) {
> -		vm_end += stack_guard_gap;
> +	if (gap != 0) {
> +		vm_end += gap;
>  		if (vm_end < vma->vm_end)
>  			vm_end = -PAGE_SIZE;
>  	}
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
  2020-02-05 18:19 ` [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
@ 2020-02-25 20:12   ` Kees Cook
  2020-02-26 18:20   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:12 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:16AM -0800, Yu-cheng Yu wrote:
> Before introducing _PAGE_DIRTY_SW for non-hardware memory management
> purposes in the next patch, rename _PAGE_DIRTY to _PAGE_DIRTY_HW and
> _PAGE_BIT_DIRTY to _PAGE_BIT_DIRTY_HW to make these PTE dirty bits
> more clear.  There are no functional changes from this patch.
> 
> v9:
> - At some places _PAGE_DIRTY were not changed to _PAGE_DIRTY_HW, because
>   they will be changed again in the next patch to _PAGE_DIRTY_BITS.
>   However, this causes compile issues if the next patch is not yet applied.
>   Fix it by changing all _PAGE_DIRTY to _PAGE_DRITY_HW.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/pgtable.h       | 18 +++++++++---------
>  arch/x86/include/asm/pgtable_types.h | 17 +++++++++--------
>  arch/x86/kernel/relocate_kernel_64.S |  2 +-
>  arch/x86/kvm/vmx/vmx.c               |  2 +-
>  4 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index ad97dc155195..ab50d25f9afc 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -122,7 +122,7 @@ extern pmdval_t early_pmd_flags;
>   */
>  static inline int pte_dirty(pte_t pte)
>  {
> -	return pte_flags(pte) & _PAGE_DIRTY;
> +	return pte_flags(pte) & _PAGE_DIRTY_HW;
>  }
>  
>  
> @@ -161,7 +161,7 @@ static inline int pte_young(pte_t pte)
>  
>  static inline int pmd_dirty(pmd_t pmd)
>  {
> -	return pmd_flags(pmd) & _PAGE_DIRTY;
> +	return pmd_flags(pmd) & _PAGE_DIRTY_HW;
>  }
>  
>  static inline int pmd_young(pmd_t pmd)
> @@ -171,7 +171,7 @@ static inline int pmd_young(pmd_t pmd)
>  
>  static inline int pud_dirty(pud_t pud)
>  {
> -	return pud_flags(pud) & _PAGE_DIRTY;
> +	return pud_flags(pud) & _PAGE_DIRTY_HW;
>  }
>  
>  static inline int pud_young(pud_t pud)
> @@ -312,7 +312,7 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_DIRTY_HW);
>  }
>  
>  static inline pte_t pte_mkold(pte_t pte)
> @@ -332,7 +332,7 @@ static inline pte_t pte_mkexec(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> -	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> +	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }
>  
>  static inline pte_t pte_mkyoung(pte_t pte)
> @@ -396,7 +396,7 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
>  
>  static inline pmd_t pmd_mkclean(pmd_t pmd)
>  {
> -	return pmd_clear_flags(pmd, _PAGE_DIRTY);
> +	return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
>  }
>  
>  static inline pmd_t pmd_wrprotect(pmd_t pmd)
> @@ -406,7 +406,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
>  
>  static inline pmd_t pmd_mkdirty(pmd_t pmd)
>  {
> -	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> +	return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }
>  
>  static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> @@ -450,7 +450,7 @@ static inline pud_t pud_mkold(pud_t pud)
>  
>  static inline pud_t pud_mkclean(pud_t pud)
>  {
> -	return pud_clear_flags(pud, _PAGE_DIRTY);
> +	return pud_clear_flags(pud, _PAGE_DIRTY_HW);
>  }
>  
>  static inline pud_t pud_wrprotect(pud_t pud)
> @@ -460,7 +460,7 @@ static inline pud_t pud_wrprotect(pud_t pud)
>  
>  static inline pud_t pud_mkdirty(pud_t pud)
>  {
> -	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> +	return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }
>  
>  static inline pud_t pud_mkdevmap(pud_t pud)
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b5e49e6bac63..e647e3c75578 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -15,7 +15,7 @@
>  #define _PAGE_BIT_PWT		3	/* page write through */
>  #define _PAGE_BIT_PCD		4	/* page cache disabled */
>  #define _PAGE_BIT_ACCESSED	5	/* was accessed (raised by CPU) */
> -#define _PAGE_BIT_DIRTY		6	/* was written to (raised by CPU) */
> +#define _PAGE_BIT_DIRTY_HW	6	/* was written to (raised by CPU) */
>  #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
>  #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>  #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
> @@ -45,7 +45,7 @@
>  #define _PAGE_PWT	(_AT(pteval_t, 1) << _PAGE_BIT_PWT)
>  #define _PAGE_PCD	(_AT(pteval_t, 1) << _PAGE_BIT_PCD)
>  #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
> -#define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
> +#define _PAGE_DIRTY_HW	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_HW)
>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
>  #define _PAGE_SOFTW1	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
> @@ -73,7 +73,7 @@
>  			 _PAGE_PKEY_BIT3)
>  
>  #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> -#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
> +#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY_HW | _PAGE_ACCESSED)
>  #else
>  #define _PAGE_KNL_ERRATUM_MASK 0
>  #endif
> @@ -111,9 +111,9 @@
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
>  #define _PAGE_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
> -				 _PAGE_ACCESSED | _PAGE_DIRTY)
> +				 _PAGE_ACCESSED | _PAGE_DIRTY_HW)
>  #define _KERNPG_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW |		\
> -				 _PAGE_ACCESSED | _PAGE_DIRTY)
> +				 _PAGE_ACCESSED | _PAGE_DIRTY_HW)
>  
>  /*
>   * Set of bits not changed in pte_modify.  The pte's
> @@ -122,7 +122,7 @@
>   * pte_modify() does modify it.
>   */
>  #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
> -			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
> +			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW |	\
>  			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
>  #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
>  
> @@ -167,7 +167,8 @@ enum page_cache_mode {
>  					 _PAGE_ACCESSED)
>  
>  #define __PAGE_KERNEL_EXEC						\
> -	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
> +	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY_HW | _PAGE_ACCESSED | \
> +	 _PAGE_GLOBAL)
>  #define __PAGE_KERNEL		(__PAGE_KERNEL_EXEC | _PAGE_NX)
>  
>  #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
> @@ -186,7 +187,7 @@ enum page_cache_mode {
>  #define _PAGE_ENC	(_AT(pteval_t, sme_me_mask))
>  
>  #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
> -			 _PAGE_DIRTY | _PAGE_ENC)
> +			 _PAGE_DIRTY_HW | _PAGE_ENC)
>  #define _PAGE_TABLE	(_KERNPG_TABLE | _PAGE_USER)
>  
>  #define __PAGE_KERNEL_ENC	(__PAGE_KERNEL | _PAGE_ENC)
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index ef3ba99068d3..3acd75f97b61 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -15,7 +15,7 @@
>   */
>  
>  #define PTR(x) (x << 3)
> -#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
> +#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY_HW)
>  
>  /*
>   * control_page + KEXEC_CONTROL_CODE_MAX_SIZE
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index e3394c839dea..fbbbf621b0d9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3503,7 +3503,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
>  	/* Set up identity-mapping pagetable for EPT in real mode */
>  	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
>  		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> -			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> +			_PAGE_ACCESSED | _PAGE_DIRTY_HW | _PAGE_PSE);
>  		r = kvm_write_guest_page(kvm, identity_map_pfn,
>  				&tmp, i * sizeof(tmp), sizeof(tmp));
>  		if (r < 0)
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW
  2020-02-05 18:19 ` [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW Yu-cheng Yu
@ 2020-02-25 20:12   ` Kees Cook
  2020-02-26 21:35   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:12 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:17AM -0800, Yu-cheng Yu wrote:
> When Shadow Stack (SHSTK) is introduced, a R/O and Dirty PTE exists in the
> following cases:
> 
> (a) A modified, copy-on-write (COW) page;
> (b) A R/O page that has been COW'ed;
> (c) A SHSTK page.
> 
> To separate non-SHSTK memory from SHSTK, introduce a spare bit of the
> 64-bit PTE as _PAGE_BIT_DIRTY_SW and use that for case (a) and (b).
> This results in the following possible settings:
> 
> Modified PTE:         (R/W + DIRTY_HW)
> Modified and COW PTE: (R/O + DIRTY_SW)
> R/O PTE COW'ed:       (R/O + DIRTY_SW)
> SHSTK PTE:            (R/O + DIRTY_HW)
> SHSTK shared PTE[1]:  (R/O + DIRTY_SW)
> SHSTK PTE COW'ed:     (R/O + DIRTY_HW)
> 
> [1] When a SHSTK page is being shared among threads, its PTE is cleared of
>     _PAGE_DIRTY_HW, so the next SHSTK access causes a fault, and the page
>     is duplicated and _PAGE_DIRTY_HW is set again.
> 
> With this, in pte_wrprotect(), if SHSTK is active, use _PAGE_DIRTY_SW for
> the Dirty bit, and in pte_mkwrite() use _PAGE_DIRTY_HW.  The same changes
> apply to pmd and pud.
> 
> When this patch is applied, there are six free bits left in the 64-bit PTE.
> There are no more free bits in the 32-bit PTE (except for PAE) and SHSTK is
> not implemented for the 32-bit kernel.
> 
> v9:
> - Remove pte_move_flags() etc. and put the logic directly in
>   pte_wrprotect()/pte_mkwrite() etc.
> - Change compile-time conditionals to run-time checks.
> - Split out pte_modify()/pmd_modify() to a new patch.
> - Update comments.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/pgtable.h       | 111 ++++++++++++++++++++++++---
>  arch/x86/include/asm/pgtable_types.h |  31 +++++++-
>  2 files changed, 131 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index ab50d25f9afc..62aeb118bc36 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -120,9 +120,9 @@ extern pmdval_t early_pmd_flags;
>   * The following only work if pte_present() is true.
>   * Undefined behaviour if not..
>   */
> -static inline int pte_dirty(pte_t pte)
> +static inline bool pte_dirty(pte_t pte)
>  {
> -	return pte_flags(pte) & _PAGE_DIRTY_HW;
> +	return pte_flags(pte) & _PAGE_DIRTY_BITS;
>  }
>  
>  
> @@ -159,9 +159,9 @@ static inline int pte_young(pte_t pte)
>  	return pte_flags(pte) & _PAGE_ACCESSED;
>  }
>  
> -static inline int pmd_dirty(pmd_t pmd)
> +static inline bool pmd_dirty(pmd_t pmd)
>  {
> -	return pmd_flags(pmd) & _PAGE_DIRTY_HW;
> +	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
>  }
>  
>  static inline int pmd_young(pmd_t pmd)
> @@ -169,9 +169,9 @@ static inline int pmd_young(pmd_t pmd)
>  	return pmd_flags(pmd) & _PAGE_ACCESSED;
>  }
>  
> -static inline int pud_dirty(pud_t pud)
> +static inline bool pud_dirty(pud_t pud)
>  {
> -	return pud_flags(pud) & _PAGE_DIRTY_HW;
> +	return pud_flags(pud) & _PAGE_DIRTY_BITS;
>  }
>  
>  static inline int pud_young(pud_t pud)
> @@ -312,7 +312,7 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_DIRTY_HW);
> +	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pte_t pte_mkold(pte_t pte)
> @@ -322,6 +322,17 @@ static inline pte_t pte_mkold(pte_t pte)
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> +	/*
> +	 * Use _PAGE_DIRTY_SW on a R/O PTE to set it apart from
> +	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
> +	 */
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (pte_flags(pte) & _PAGE_DIRTY_HW) {
> +			pte = pte_clear_flags(pte, _PAGE_DIRTY_HW);
> +			pte = pte_set_flags(pte, _PAGE_DIRTY_SW);
> +		}
> +	}
> +
>  	return pte_clear_flags(pte, _PAGE_RW);
>  }
>  
> @@ -332,9 +343,25 @@ static inline pte_t pte_mkexec(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> +	pteval_t dirty = _PAGE_DIRTY_HW;
> +
> +	if (static_cpu_has(X86_FEATURE_SHSTK) && !pte_write(pte))
> +		dirty = _PAGE_DIRTY_SW;
> +
> +	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> +}
> +
> +static inline pte_t pte_mkdirty_shstk(pte_t pte)
> +{
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
>  	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }
>  
> +static inline bool pte_dirty_hw(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_DIRTY_HW;
> +}
> +
>  static inline pte_t pte_mkyoung(pte_t pte)
>  {
>  	return pte_set_flags(pte, _PAGE_ACCESSED);
> @@ -342,6 +369,13 @@ static inline pte_t pte_mkyoung(pte_t pte)
>  
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (pte_flags(pte) & _PAGE_DIRTY_SW) {
> +			pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
> +			pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
> +		}
> +	}
> +
>  	return pte_set_flags(pte, _PAGE_RW);
>  }
>  
> @@ -396,19 +430,46 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
>  
>  static inline pmd_t pmd_mkclean(pmd_t pmd)
>  {
> -	return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
> +	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pmd_t pmd_wrprotect(pmd_t pmd)
>  {
> +	/*
> +	 * Use _PAGE_DIRTY_SW on a R/O PMD to set it apart from
> +	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
> +	 */
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (pmd_flags(pmd) & _PAGE_DIRTY_HW) {
> +			pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
> +			pmd = pmd_set_flags(pmd, _PAGE_DIRTY_SW);
> +		}
> +	}
> +
>  	return pmd_clear_flags(pmd, _PAGE_RW);
>  }
>  
>  static inline pmd_t pmd_mkdirty(pmd_t pmd)
>  {
> +	pmdval_t dirty = _PAGE_DIRTY_HW;
> +
> +	if (static_cpu_has(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW))
> +		dirty = _PAGE_DIRTY_SW;
> +
> +	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
> +}
> +
> +static inline pmd_t pmd_mkdirty_shstk(pmd_t pmd)
> +{
> +	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_SW);
>  	return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }
>  
> +static inline bool pmd_dirty_hw(pmd_t pmd)
> +{
> +	return  pmd_flags(pmd) & _PAGE_DIRTY_HW;
> +}
> +
>  static inline pmd_t pmd_mkdevmap(pmd_t pmd)
>  {
>  	return pmd_set_flags(pmd, _PAGE_DEVMAP);
> @@ -426,6 +487,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
>  
>  static inline pmd_t pmd_mkwrite(pmd_t pmd)
>  {
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (pmd_flags(pmd) & _PAGE_DIRTY_SW) {
> +			pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_SW);
> +			pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
> +		}
> +	}
> +
>  	return pmd_set_flags(pmd, _PAGE_RW);
>  }
>  
> @@ -450,17 +518,33 @@ static inline pud_t pud_mkold(pud_t pud)
>  
>  static inline pud_t pud_mkclean(pud_t pud)
>  {
> -	return pud_clear_flags(pud, _PAGE_DIRTY_HW);
> +	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pud_t pud_wrprotect(pud_t pud)
>  {
> +	/*
> +	 * Use _PAGE_DIRTY_SW on a R/O PUD to set it apart from
> +	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
> +	 */
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (pud_flags(pud) & _PAGE_DIRTY_HW) {
> +			pud = pud_clear_flags(pud, _PAGE_DIRTY_HW);
> +			pud = pud_set_flags(pud, _PAGE_DIRTY_SW);
> +		}
> +	}
> +
>  	return pud_clear_flags(pud, _PAGE_RW);
>  }
>  
>  static inline pud_t pud_mkdirty(pud_t pud)
>  {
> -	return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
> +	pudval_t dirty = _PAGE_DIRTY_HW;
> +
> +	if (static_cpu_has(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW))
> +		dirty = _PAGE_DIRTY_SW;
> +
> +	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
>  }
>  
>  static inline pud_t pud_mkdevmap(pud_t pud)
> @@ -480,6 +564,13 @@ static inline pud_t pud_mkyoung(pud_t pud)
>  
>  static inline pud_t pud_mkwrite(pud_t pud)
>  {
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (pud_flags(pud) & _PAGE_DIRTY_SW) {
> +			pud = pud_clear_flags(pud, _PAGE_DIRTY_SW);
> +			pud = pud_set_flags(pud, _PAGE_DIRTY_HW);
> +		}
> +	}
> +
>  	return pud_set_flags(pud, _PAGE_RW);
>  }
>  
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index e647e3c75578..826823df917f 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -23,7 +23,8 @@
>  #define _PAGE_BIT_SOFTW2	10	/* " */
>  #define _PAGE_BIT_SOFTW3	11	/* " */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
>  #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>  #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
>  #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
> @@ -35,6 +36,12 @@
>  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>  
> +/*
> + * This bit indicates a copy-on-write page, and is different from
> + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
> + */
> +#define _PAGE_BIT_DIRTY_SW	_PAGE_BIT_SOFTW5 /* was written to */
> +
>  /* If _PAGE_BIT_PRESENT is clear, we use these: */
>  /* - if the user mapped it with PROT_NONE; pte_present gives true */
>  #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
> @@ -108,6 +115,28 @@
>  #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
>  #endif
>  
> +/* A R/O and dirty PTE exists in the following cases:
> + *	(a) A modified, copy-on-write (COW) page;
> + *	(b) A R/O page that has been COW'ed;
> + *	(c) A SHSTK page.
> + * _PAGE_DIRTY_SW is used to separate case (c) from others.
> + * This results in the following settings:
> + *
> + *	Modified PTE:         (R/W + DIRTY_HW)
> + *	Modified and COW PTE: (R/O + DIRTY_SW)
> + *	R/O PTE COW'ed:       (R/O + DIRTY_SW)
> + *	SHSTK PTE:            (R/O + DIRTY_HW)
> + *	SHSTK PTE COW'ed:     (R/O + DIRTY_HW)
> + *	SHSTK PTE being shared among threads: (R/O + DIRTY_SW)
> + */
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +#define _PAGE_DIRTY_SW	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_SW)
> +#else
> +#define _PAGE_DIRTY_SW	(_AT(pteval_t, 0))
> +#endif
> +
> +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_DIRTY_SW)
> +
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
>  #define _PAGE_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  2020-02-05 18:19 ` [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
@ 2020-02-25 20:13   ` Kees Cook
  2020-02-26 22:04   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:13 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:19AM -0800, Yu-cheng Yu wrote:
> After the introduction of _PAGE_DIRTY_SW, a dirty PTE can have either
> _PAGE_DIRTY_HW or _PAGE_DIRTY_SW.  Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 4b04af569c05..e467ca182633 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1201,7 +1201,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
>  	}
>  
>  	/* Clear dirty field. */
> -	se->val64 &= ~_PAGE_DIRTY;
> +	se->val64 &= ~_PAGE_DIRTY_BITS;
>  
>  	ops->clear_pse(se);
>  	ops->clear_ips(se);
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW
  2020-02-05 18:19 ` [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW Yu-cheng Yu
@ 2020-02-25 20:14   ` Kees Cook
  2020-02-26 22:20   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:14 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:20AM -0800, Yu-cheng Yu wrote:
> When Shadow Stack (SHSTK) is enabled, the [R/O + PAGE_DIRTY_HW] setting is
> reserved only for SHSTK.  Non-Shadow Stack R/O PTEs are
> [R/O + PAGE_DIRTY_SW].
> 
> When a PTE goes from [R/W + PAGE_DIRTY_HW] to [R/O + PAGE_DIRTY_SW], it
> could become a transient SHSTK PTE in two cases.
> 
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> SHSTK PTE.  However, this will not occur on processors supporting SHSTK
> therefore we don't need a TLB flush here.
> 
> The second case is that when the software, without atomic, tests & replaces
> PAGE_DIRTY_HW with PAGE_DIRTY_SW, a transient SHSTK PTE can exist.  This is
> prevented with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> v9:
> - Change compile-time conditionals to runtime checks.
> - Fix parameters of try_cmpxchg(): change pte_t/pmd_t to
>   pte_t.pte/pmd_t.pmd.
> 
> v4:
> - Implement try_cmpxchg().
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/pgtable.h | 66 ++++++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2733e7ec16b3..43cb27379208 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1253,6 +1253,39 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +	/*
> +	 * Some processors can start a write, but end up seeing a read-only
> +	 * PTE by the time they get to the Dirty bit.  In this case, they
> +	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
> +	 * looks like a Shadow Stack PTE.
> +	 *
> +	 * However, this behavior has been improved and will not occur on
> +	 * processors supporting Shadow Stack.  Without this guarantee, a
> +	 * transition to a non-present PTE and flush the TLB would be
> +	 * needed.
> +	 *
> +	 * When changing a writable PTE to read-only and if the PTE has
> +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so that
> +	 * the PTE is not a valid Shadow Stack PTE.
> +	 */
> +#ifdef CONFIG_X86_64
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		pte_t new_pte, pte = READ_ONCE(*ptep);
> +
> +		do {
> +			/*
> +			 * This is the same as moving _PAGE_DIRTY_HW
> +			 * to _PAGE_DIRTY_SW.
> +			 */
> +			new_pte = pte_wrprotect(pte);
> +			new_pte.pte |= (new_pte.pte & _PAGE_DIRTY_HW) >>
> +					_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
> +			new_pte.pte &= ~_PAGE_DIRTY_HW;
> +		} while (!try_cmpxchg(&ptep->pte, &pte.pte, new_pte.pte));
> +
> +		return;
> +	}
> +#endif
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>  }
>  
> @@ -1303,6 +1336,39 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pmd_t *pmdp)
>  {
> +	/*
> +	 * Some processors can start a write, but end up seeing a read-only
> +	 * PMD by the time they get to the Dirty bit.  In this case, they
> +	 * will set the Dirty bit, leaving a read-only, Dirty PMD which
> +	 * looks like a Shadow Stack PMD.
> +	 *
> +	 * However, this behavior has been improved and will not occur on
> +	 * processors supporting Shadow Stack.  Without this guarantee, a
> +	 * transition to a non-present PMD and flush the TLB would be
> +	 * needed.
> +	 *
> +	 * When changing a writable PMD to read-only and if the PMD has
> +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so that
> +	 * the PMD is not a valid Shadow Stack PMD.
> +	 */
> +#ifdef CONFIG_X86_64
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		pmd_t new_pmd, pmd = READ_ONCE(*pmdp);
> +
> +		do {
> +			/*
> +			 * This is the same as moving _PAGE_DIRTY_HW
> +			 * to _PAGE_DIRTY_SW.
> +			 */
> +			new_pmd = pmd_wrprotect(pmd);
> +			new_pmd.pmd |= (new_pmd.pmd & _PAGE_DIRTY_HW) >>
> +					_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
> +			new_pmd.pmd &= ~_PAGE_DIRTY_HW;
> +		} while (!try_cmpxchg(&pmdp->pmd, &pmd.pmd, new_pmd.pmd));
> +
> +		return;
> +	}
> +#endif
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
>  }
>  
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking
  2020-02-05 18:19 ` [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
@ 2020-02-25 20:16   ` Kees Cook
  2020-02-26 22:47   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:16 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:21AM -0800, Yu-cheng Yu wrote:
> If a page fault is triggered by a Shadow Stack (SHSTK) access
> (e.g. CALL/RET) or SHSTK management instructions (e.g. WRUSSQ), then bit[6]
> of the page fault error code is set.
> 
> In access_error(), verify a SHSTK page fault is within a SHSTK memory area.
> It is always an error otherwise.
> 
> For a valid SHSTK access, set FAULT_FLAG_WRITE to effect copy-on-write.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/traps.h |  2 ++
>  arch/x86/mm/fault.c          | 18 ++++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 7ac26bbd0bef..8023d177fcd8 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -169,6 +169,7 @@ enum {
>   *   bit 3 ==				1: use of reserved bit detected
>   *   bit 4 ==				1: fault was an instruction fetch
>   *   bit 5 ==				1: protection keys block access
> + *   bit 6 ==				1: shadow stack access fault
>   */
>  enum x86_pf_error_code {
>  	X86_PF_PROT	=		1 << 0,
> @@ -177,5 +178,6 @@ enum x86_pf_error_code {
>  	X86_PF_RSVD	=		1 << 3,
>  	X86_PF_INSTR	=		1 << 4,
>  	X86_PF_PK	=		1 << 5,
> +	X86_PF_SHSTK	=		1 << 6,
>  };
>  #endif /* _ASM_X86_TRAPS_H */
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 304d31d8cbbc..9c1243302663 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1187,6 +1187,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>  				       (error_code & X86_PF_INSTR), foreign))
>  		return 1;
>  
> +	/*
> +	 * Verify X86_PF_SHSTK is within a Shadow Stack VMA.
> +	 * It is always an error if there is a Shadow Stack
> +	 * fault outside a Shadow Stack VMA.
> +	 */
> +	if (error_code & X86_PF_SHSTK) {
> +		if (!(vma->vm_flags & VM_SHSTK))
> +			return 1;
> +		return 0;
> +	}
> +
>  	if (error_code & X86_PF_WRITE) {
>  		/* write, present and write, not present: */
>  		if (unlikely(!(vma->vm_flags & VM_WRITE)))
> @@ -1344,6 +1355,13 @@ void do_user_addr_fault(struct pt_regs *regs,
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
> +	/*
> +	 * If the fault is caused by a Shadow Stack access,
> +	 * i.e. CALL/RET/SAVEPREVSSP/RSTORSSP, then set
> +	 * FAULT_FLAG_WRITE to effect copy-on-write.
> +	 */
> +	if (hw_error_code & X86_PF_SHSTK)
> +		flags |= FAULT_FLAG_WRITE;
>  	if (hw_error_code & X86_PF_WRITE)
>  		flags |= FAULT_FLAG_WRITE;
>  	if (hw_error_code & X86_PF_INSTR)
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-02-05 18:19 ` [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault Yu-cheng Yu
@ 2020-02-25 20:20   ` Kees Cook
  2020-03-05 18:30     ` Yu-cheng Yu
  2020-02-27  0:08   ` Dave Hansen
  1 sibling, 1 reply; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:20 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:22AM -0800, Yu-cheng Yu wrote:
> When a task does fork(), its Shadow Stack (SHSTK) must be duplicated for
> the child.  This patch implements a flow similar to copy-on-write of an
> anonymous page, but for SHSTK.
> 
> A SHSTK PTE must be RO and Dirty.  This Dirty bit requirement is used to
> effect the copying.  In copy_one_pte(), clear the Dirty bit from a SHSTK
> PTE to cause a page fault upon the next SHSTK access.  At that time, fix
> the PTE and copy/re-use the page.

Just to confirm, during the fork, it's really not a SHSTK for a moment
(it's still RO, but not dirty). Can other racing threads muck this up,
or is this bit removed only on the copied side?

-Kees

> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/mm/pgtable.c         | 15 +++++++++++++++
>  include/asm-generic/pgtable.h | 17 +++++++++++++++++
>  mm/memory.c                   |  7 ++++++-
>  3 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 7bd2c3a52297..2eb33794c08d 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -872,3 +872,18 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>  
>  #endif /* CONFIG_X86_64 */
>  #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +inline bool arch_copy_pte_mapping(vm_flags_t vm_flags)
> +{
> +	return (vm_flags & VM_SHSTK);
> +}
> +
> +inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & VM_SHSTK)
> +		return pte_mkdirty_shstk(pte);
> +	else
> +		return pte;
> +}
> +#endif /* CONFIG_X86_INTEL_SHADOW_STACK_USER */
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 798ea36a0549..9cb2f9ba5895 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1190,6 +1190,23 @@ static inline bool arch_has_pfn_modify_check(void)
>  }
>  #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
>  
> +#ifdef CONFIG_MMU
> +#ifndef CONFIG_ARCH_HAS_SHSTK
> +static inline bool arch_copy_pte_mapping(vm_flags_t vm_flags)
> +{
> +	return false;
> +}
> +
> +static inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
> +{
> +	return pte;
> +}
> +#else
> +bool arch_copy_pte_mapping(vm_flags_t vm_flags);
> +pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma);
> +#endif
> +#endif /* CONFIG_MMU */
> +
>  /*
>   * Architecture PAGE_KERNEL_* fallbacks
>   *
> diff --git a/mm/memory.c b/mm/memory.c
> index 45442d9a4f52..6daa28614327 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -772,7 +772,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * If it's a COW mapping, write protect it both
>  	 * in the parent and the child
>  	 */
> -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> +	if ((is_cow_mapping(vm_flags) && pte_write(pte)) ||
> +	    arch_copy_pte_mapping(vm_flags)) {
>  		ptep_set_wrprotect(src_mm, addr, src_pte);
>  		pte = pte_wrprotect(pte);
>  	}
> @@ -2417,6 +2418,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  	entry = pte_mkyoung(vmf->orig_pte);
>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	entry = pte_set_vma_features(entry, vma);
>  	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
>  		update_mmu_cache(vma, vmf->address, vmf->pte);
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2504,6 +2506,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		entry = pte_set_vma_features(entry, vma);
>  		/*
>  		 * Clear the pte entry and flush it first, before updating the
>  		 * pte with the new entry. This will avoid a race condition
> @@ -3023,6 +3026,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +		pte = pte_set_vma_features(pte, vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
>  		exclusive = RMAP_EXCLUSIVE;
> @@ -3165,6 +3169,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	entry = mk_pte(page, vma->vm_page_prot);
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
> +	entry = pte_set_vma_features(entry, vma);
>  
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  			&vmf->ptl);
> -- 
> 2.21.0
> 
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB Shadow Stack page fault
  2020-02-05 18:19 ` [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB " Yu-cheng Yu
@ 2020-02-25 20:59   ` Kees Cook
  2020-03-13 22:00     ` Yu-cheng Yu
  0 siblings, 1 reply; 107+ messages in thread
From: Kees Cook @ 2020-02-25 20:59 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:23AM -0800, Yu-cheng Yu wrote:
> This patch implements THP Shadow Stack (SHSTK) copying in the same way as
> in the previous patch for regular PTE.
> 
> In copy_huge_pmd(), clear the dirty bit from the PMD to cause a page fault
> upon the next SHSTK access to the PMD.  At that time, fix the PMD and
> copy/re-use the page.

Now is as good a time as any to ask: do you have selftests for all this?
It seems like it would be really nice to have a way to verify SHSTK is
working correctly.

-Kees

> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/mm/pgtable.c         |  8 ++++++++
>  include/asm-generic/pgtable.h | 11 +++++++++++
>  mm/huge_memory.c              |  4 ++++
>  3 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 2eb33794c08d..3340b1d4e9da 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -886,4 +886,12 @@ inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
>  	else
>  		return pte;
>  }
> +
> +inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & VM_SHSTK)
> +		return pmd_mkdirty_shstk(pmd);
> +	else
> +		return pmd;
> +}
>  #endif /* CONFIG_X86_INTEL_SHADOW_STACK_USER */
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 9cb2f9ba5895..a9df093fdf45 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1201,9 +1201,20 @@ static inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
>  {
>  	return pte;
>  }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	return pmd;
> +}
> +#endif
>  #else
>  bool arch_copy_pte_mapping(vm_flags_t vm_flags);
>  pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +pmd_t pmd_set_vma_features(pmd_t pmd, struct vm_area_struct *vma);
> +#endif
>  #endif
>  #endif /* CONFIG_MMU */
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a88093213674..93ef368df2dd 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -636,6 +636,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>  
>  		entry = mk_huge_pmd(page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_set_vma_features(entry, vma);
>  		page_add_new_anon_rmap(page, vma, haddr, true);
>  		mem_cgroup_commit_charge(page, memcg, false, true);
>  		lru_cache_add_active_or_unevictable(page, vma);
> @@ -1278,6 +1279,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
>  		pte_t entry;
>  		entry = mk_pte(pages[i], vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		entry = pte_set_vma_features(entry, vma);
>  		memcg = (void *)page_private(pages[i]);
>  		set_page_private(pages[i], 0);
>  		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
> @@ -1360,6 +1362,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  		pmd_t entry;
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_set_vma_features(entry, vma);
>  		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry,  1))
>  			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
>  		ret |= VM_FAULT_WRITE;
> @@ -1432,6 +1435,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  		pmd_t entry;
>  		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_set_vma_features(entry, vma);
>  		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
>  		page_add_new_anon_rmap(new_page, vma, haddr, true);
>  		mem_cgroup_commit_charge(new_page, memcg, false, true);
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support
  2020-02-05 18:19 ` [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support Yu-cheng Yu
@ 2020-02-25 21:07   ` Kees Cook
  2020-02-27  0:55   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:25AM -0800, Yu-cheng Yu wrote:
> This patch adds basic Shadow Stack (SHSTK) enabling/disabling routines.
> A task's SHSTK is allocated from memory with VM_SHSTK flag and read-only
> protection.  It has a fixed size of RLIMIT_STACK.
> 
> v9:
> - Change cpu_feature_enabled() to static_cpu_has().
> - Merge cet_disable_shstk to cet_disable_free_shstk.
> - Remove the empty slot at the top of the SHSTK, as it is not needed.
> - Move do_mmap_locked() to alloc_shstk(), which is a static function.
> 
> v6:
> - Create a function do_mmap_locked() for SHSTK allocation.
> 
> v2:
> - Change noshstk to no_cet_shstk.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/cet.h                    |  31 +++++
>  arch/x86/include/asm/disabled-features.h      |   8 +-
>  arch/x86/include/asm/processor.h              |   5 +
>  arch/x86/kernel/Makefile                      |   2 +
>  arch/x86/kernel/cet.c                         | 121 ++++++++++++++++++
>  arch/x86/kernel/cpu/common.c                  |  25 ++++
>  arch/x86/kernel/process.c                     |   1 +
>  .../arch/x86/include/asm/disabled-features.h  |   8 +-
>  8 files changed, 199 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/cet.h
>  create mode 100644 arch/x86/kernel/cet.c
> 
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> new file mode 100644
> index 000000000000..c44c991ca91f
> --- /dev/null
> +++ b/arch/x86/include/asm/cet.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_CET_H
> +#define _ASM_X86_CET_H
> +
> +#ifndef __ASSEMBLY__
> +#include <linux/types.h>
> +
> +struct task_struct;
> +/*
> + * Per-thread CET status
> + */
> +struct cet_status {
> +	unsigned long	shstk_base;
> +	unsigned long	shstk_size;
> +	unsigned int	shstk_enabled:1;
> +};
> +
> +#ifdef CONFIG_X86_INTEL_CET
> +int cet_setup_shstk(void);
> +void cet_disable_free_shstk(struct task_struct *p);
> +#else
> +static inline void cet_disable_free_shstk(struct task_struct *p) {}
> +#endif
> +
> +#define cpu_x86_cet_enabled() \
> +	(static_cpu_has(X86_FEATURE_SHSTK) || \
> +	 static_cpu_has(X86_FEATURE_IBT))
> +
> +#endif /* __ASSEMBLY__ */
> +
> +#endif /* _ASM_X86_CET_H */
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 8e1d0bb46361..e1454509ad83 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -62,6 +62,12 @@
>  # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
>  #endif
>  
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +#define DISABLE_SHSTK	0
> +#else
> +#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
> +#endif
> +
>  /*
>   * Make sure to add features to the correct mask
>   */
> @@ -81,7 +87,7 @@
>  #define DISABLED_MASK13	0
>  #define DISABLED_MASK14	0
>  #define DISABLED_MASK15	0
> -#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
> +#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
>  #define DISABLED_MASK17	0
>  #define DISABLED_MASK18	0
>  #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 0340aad3f2fc..793d210e64da 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -25,6 +25,7 @@ struct vm86;
>  #include <asm/special_insns.h>
>  #include <asm/fpu/types.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/cet.h>
>  
>  #include <linux/personality.h>
>  #include <linux/cache.h>
> @@ -539,6 +540,10 @@ struct thread_struct {
>  	unsigned int		sig_on_uaccess_err:1;
>  	unsigned int		uaccess_err:1;	/* uaccess failed */
>  
> +#ifdef CONFIG_X86_INTEL_CET
> +	struct cet_status	cet;
> +#endif
> +
>  	/* Floating point and extended processor state */
>  	struct fpu		fpu;
>  	/*
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 6175e370ee4a..b8c1ea4ab7eb 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -142,6 +142,8 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
>  obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
>  obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
>  
> +obj-$(CONFIG_X86_INTEL_CET)		+= cet.o
> +
>  ###
>  # 64 bit specific files
>  ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> new file mode 100644
> index 000000000000..b4c7d88e9a8f
> --- /dev/null
> +++ b/arch/x86/kernel/cet.c
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * cet.c - Control-flow Enforcement (CET)
> + *
> + * Copyright (c) 2019, Intel Corporation.
> + * Yu-cheng Yu <yu-cheng.yu@intel.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/signal.h>
> +#include <linux/compat.h>
> +#include <asm/msr.h>
> +#include <asm/user.h>
> +#include <asm/fpu/internal.h>
> +#include <asm/fpu/xstate.h>
> +#include <asm/fpu/types.h>
> +#include <asm/cet.h>
> +
> +static void start_update_msrs(void)
> +{
> +	fpregs_lock();
> +	if (test_thread_flag(TIF_NEED_FPU_LOAD))
> +		__fpregs_load_activate();
> +}
> +
> +static void end_update_msrs(void)
> +{
> +	fpregs_unlock();
> +}
> +
> +static unsigned long cet_get_shstk_addr(void)
> +{
> +	struct fpu *fpu = &current->thread.fpu;
> +	unsigned long ssp = 0;
> +
> +	fpregs_lock();
> +
> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
> +		rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +	} else {
> +		struct cet_user_state *p;
> +
> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
> +		if (p)
> +			ssp = p->user_ssp;
> +	}
> +
> +	fpregs_unlock();
> +	return ssp;
> +}
> +
> +static unsigned long alloc_shstk(unsigned long size)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, populate;
> +
> +	down_write(&mm->mmap_sem);
> +	addr = do_mmap(NULL, 0, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE,
> +		       VM_SHSTK, 0, &populate, NULL);
> +	up_write(&mm->mmap_sem);
> +
> +	if (populate)
> +		mm_populate(addr, populate);
> +
> +	return addr;
> +}
> +
> +int cet_setup_shstk(void)
> +{
> +	unsigned long addr, size;
> +	struct cet_status *cet = &current->thread.cet;
> +
> +	if (!static_cpu_has(X86_FEATURE_SHSTK))
> +		return -EOPNOTSUPP;
> +
> +	size = rlimit(RLIMIT_STACK);
> +	addr = alloc_shstk(size);
> +
> +	if (IS_ERR((void *)addr))
> +		return PTR_ERR((void *)addr);
> +
> +	cet->shstk_base = addr;
> +	cet->shstk_size = size;
> +	cet->shstk_enabled = 1;
> +
> +	start_update_msrs();
> +	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> +	wrmsrl(MSR_IA32_U_CET, MSR_IA32_CET_SHSTK_EN);
> +	end_update_msrs();
> +	return 0;
> +}
> +
> +void cet_disable_free_shstk(struct task_struct *tsk)
> +{
> +	struct cet_status *cet = &tsk->thread.cet;
> +
> +	if (!static_cpu_has(X86_FEATURE_SHSTK) ||
> +	    !cet->shstk_enabled || !cet->shstk_base)
> +		return;
> +
> +	if (!tsk->mm || (tsk->mm != current->mm))
> +		return;
> +
> +	if (tsk == current) {
> +		u64 msr_val;
> +
> +		start_update_msrs();
> +		rdmsrl(MSR_IA32_U_CET, msr_val);
> +		wrmsrl(MSR_IA32_U_CET, msr_val & ~MSR_IA32_CET_SHSTK_EN);
> +		end_update_msrs();
> +	}
> +
> +	vm_munmap(cet->shstk_base, cet->shstk_size);
> +	cet->shstk_base = 0;
> +	cet->shstk_size = 0;
> +	cet->shstk_enabled = 0;
> +}
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 2e4d90294fe6..40498ec72fda 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -54,6 +54,7 @@
>  #include <asm/microcode_intel.h>
>  #include <asm/intel-family.h>
>  #include <asm/cpu_device_id.h>
> +#include <asm/cet.h>
>  #include <asm/uv/uv.h>
>  
>  #include "cpu.h"
> @@ -486,6 +487,29 @@ static __init int setup_disable_pku(char *arg)
>  __setup("nopku", setup_disable_pku);
>  #endif /* CONFIG_X86_64 */
>  
> +static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> +{
> +	if (cpu_x86_cet_enabled())
> +		cr4_set_bits(X86_CR4_CET);
> +}
> +
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +static __init int setup_disable_shstk(char *s)
> +{
> +	/* require an exact match without trailing characters */
> +	if (s[0] != '\0')
> +		return 0;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SHSTK))
> +		return 1;
> +
> +	setup_clear_cpu_cap(X86_FEATURE_SHSTK);
> +	pr_info("x86: 'no_cet_shstk' specified, disabling Shadow Stack\n");
> +	return 1;
> +}
> +__setup("no_cet_shstk", setup_disable_shstk);
> +#endif

I wonder if this should be "cet_shstk=..." instead? Will it always be a
giant knob like this? Will we want to disable it for userspace but keep
it for kernel space, etc?

> +
>  /*
>   * Some CPU features depend on higher CPUID levels, which may not always
>   * be available due to CPUID level capping or broken virtualization
> @@ -1510,6 +1534,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  	x86_init_rdrand(c);
>  	x86_init_cache_qos(c);
>  	setup_pku(c);
> +	setup_cet(c);
>  
>  	/*
>  	 * Clear/Set all flags overridden by options, need do it
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 8d0b9442202e..e102e63de641 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -43,6 +43,7 @@
>  #include <asm/spec-ctrl.h>
>  #include <asm/io_bitmap.h>
>  #include <asm/proto.h>
> +#include <asm/cet.h>
>  
>  #include "process.h"
>  
> diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
> index 8e1d0bb46361..e1454509ad83 100644
> --- a/tools/arch/x86/include/asm/disabled-features.h
> +++ b/tools/arch/x86/include/asm/disabled-features.h
> @@ -62,6 +62,12 @@
>  # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
>  #endif
>  
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +#define DISABLE_SHSTK	0
> +#else
> +#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
> +#endif
> +
>  /*
>   * Make sure to add features to the correct mask
>   */
> @@ -81,7 +87,7 @@
>  #define DISABLED_MASK13	0
>  #define DISABLED_MASK14	0
>  #define DISABLED_MASK15	0
> -#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
> +#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
>  #define DISABLED_MASK17	0
>  #define DISABLED_MASK18	0
>  #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
> -- 
> 2.21.0
> 
> 

Otherwise, looks good to me. :)

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction
  2020-02-05 18:19 ` [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction Yu-cheng Yu
@ 2020-02-25 21:10   ` Kees Cook
  2020-03-05 18:39     ` Yu-cheng Yu
  0 siblings, 1 reply; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:10 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:26AM -0800, Yu-cheng Yu wrote:
> WRUSS is a new kernel-mode instruction but writes directly to user Shadow
> Stack (SHSTK) memory.  This is used to construct a return address on SHSTK
> for the signal handler.
> 
> This instruction can fault if the user SHSTK is not valid SHSTK memory.
> In that case, the kernel does a fixup.

Since these functions aren't used in this patch, should this get merged
with patch 19?

-Kees

> 
> v4:
> - Change to asm goto.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/special_insns.h | 32 ++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 6d37b8fcfc77..1b9b2e79c353 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -222,6 +222,38 @@ static inline void clwb(volatile void *__p)
>  		: [pax] "a" (p));
>  }
>  
> +#ifdef CONFIG_X86_INTEL_CET
> +#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
> +static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
> +{
> +	asm_volatile_goto("1: wrussd %1, (%0)\n"
> +			  _ASM_EXTABLE(1b, %l[fail])
> +			  :: "r" (addr), "r" (val)
> +			  :: fail);
> +	return 0;
> +fail:
> +	return -EPERM;
> +}
> +#else
> +static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
> +{
> +	WARN_ONCE(1, "%s used but not supported.\n", __func__);
> +	return -EFAULT;
> +}
> +#endif
> +
> +static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
> +{
> +	asm_volatile_goto("1: wrussq %1, (%0)\n"
> +			  _ASM_EXTABLE(1b, %l[fail])
> +			  :: "r" (addr), "r" (val)
> +			  :: fail);
> +	return 0;
> +fail:
> +	return -EPERM;
> +}
> +#endif /* CONFIG_X86_INTEL_CET */
> +
>  #define nop() asm volatile ("nop")
>  
>  
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 19/27] x86/cet/shstk: Handle signals for Shadow Stack
  2020-02-05 18:19 ` [RFC PATCH v9 19/27] x86/cet/shstk: Handle signals for Shadow Stack Yu-cheng Yu
@ 2020-02-25 21:17   ` Kees Cook
  0 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:17 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:27AM -0800, Yu-cheng Yu wrote:
> To deliver a signal, create a Shadow Stack (SHSTK) restore token and put
> the token and the signal restorer address on the SHSTK.  For sigreturn,
> verify the token and restore the SHSTK pointer.
> 
> Introduce a signal context extension struct 'sc_ext', which is used to save
> SHSTK restore token address and WAIT_ENDBR status.  WAIT_ENDBR will be
> introduced later in the Indirect Branch Tracking (IBT) series, but add that
> into sc_ext now to keep the struct stable in case the IBT series is applied
> later.
> 
> v9:
> - Update CET MSR access according to XSAVES supervisor state changes.
> - Add 'wait_endbr' to struct 'sc_ext'.
> - Update and simplify signal frame allocation, setup, and restoration.
> - Update commit log text.
> 
> v2:
> - Move CET status from sigcontext to a separate struct sc_ext, which is
>   located above the fpstate on the signal frame.
> - Add a restore token for sigreturn address.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/ia32/ia32_signal.c            |  17 +++
>  arch/x86/include/asm/cet.h             |   7 ++
>  arch/x86/include/asm/fpu/internal.h    |   2 +
>  arch/x86/include/uapi/asm/sigcontext.h |   9 ++
>  arch/x86/kernel/cet.c                  | 153 +++++++++++++++++++++++++
>  arch/x86/kernel/fpu/signal.c           |  89 ++++++++++++++
>  arch/x86/kernel/signal.c               |  10 ++
>  7 files changed, 287 insertions(+)
> 
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index 30416d7f19d4..c0bb350a3d2d 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -35,6 +35,7 @@
>  #include <asm/sigframe.h>
>  #include <asm/sighandling.h>
>  #include <asm/smap.h>
> +#include <asm/cet.h>
>  
>  /*
>   * Do a signal return; undo the signal stack.
> @@ -223,6 +224,7 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
>  				 void __user **fpstate)
>  {
>  	unsigned long sp, fx_aligned, math_size;
> +	void __user *restorer = NULL;
>  
>  	/* Default to using normal stack */
>  	sp = regs->sp;
> @@ -236,8 +238,23 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
>  		 ksig->ka.sa.sa_restorer)
>  		sp = (unsigned long) ksig->ka.sa.sa_restorer;
>  
> +	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
> +		restorer = ksig->ka.sa.sa_restorer;
> +	} else if (current->mm->context.vdso) {
> +		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
> +			restorer = current->mm->context.vdso +
> +				vdso_image_32.sym___kernel_rt_sigreturn;
> +		else
> +			restorer = current->mm->context.vdso +
> +				vdso_image_32.sym___kernel_sigreturn;
> +	}
> +
>  	sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
>  	*fpstate = (struct _fpstate_32 __user *) sp;
> +
> +	if (save_cet_to_sigframe(*fpstate, (unsigned long)restorer, 1))
> +		return (void __user *) -1L;
> +
>  	if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
>  				     math_size) < 0)
>  		return (void __user *) -1L;
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index c44c991ca91f..409d4f91a0dc 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -6,6 +6,8 @@
>  #include <linux/types.h>
>  
>  struct task_struct;
> +struct sc_ext;
> +
>  /*
>   * Per-thread CET status
>   */
> @@ -18,8 +20,13 @@ struct cet_status {
>  #ifdef CONFIG_X86_INTEL_CET
>  int cet_setup_shstk(void);
>  void cet_disable_free_shstk(struct task_struct *p);
> +int cet_restore_signal(bool ia32, struct sc_ext *sc);
> +int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
>  #else
>  static inline void cet_disable_free_shstk(struct task_struct *p) {}
> +static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EINVAL; }
> +static inline int cet_setup_signal(bool ia32, unsigned long rstor,
> +				   struct sc_ext *sc) { return -EINVAL; }
>  #endif
>  
>  #define cpu_x86_cet_enabled() \
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 42159f45bf9c..241521c0ed02 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -476,6 +476,8 @@ static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
>  	__copy_kernel_to_fpregs(fpstate, -1);
>  }
>  
> +extern int save_cet_to_sigframe(void __user *fp, unsigned long restorer,
> +				int is_ia32);
>  extern int copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size);
>  
>  /*
> diff --git a/arch/x86/include/uapi/asm/sigcontext.h b/arch/x86/include/uapi/asm/sigcontext.h
> index 844d60eb1882..cf2d55db3be4 100644
> --- a/arch/x86/include/uapi/asm/sigcontext.h
> +++ b/arch/x86/include/uapi/asm/sigcontext.h
> @@ -196,6 +196,15 @@ struct _xstate {
>  	/* New processor state extensions go here: */
>  };
>  
> +/*
> + * Located at the end of sigcontext->fpstate, aligned to 8.
> + */
> +struct sc_ext {
> +	unsigned long total_size;
> +	unsigned long ssp;
> +	unsigned long wait_endbr;
> +};
> +
>  /*
>   * The 32-bit signal frame:
>   */
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> index b4c7d88e9a8f..cba5c7656aab 100644
> --- a/arch/x86/kernel/cet.c
> +++ b/arch/x86/kernel/cet.c
> @@ -19,6 +19,8 @@
>  #include <asm/fpu/xstate.h>
>  #include <asm/fpu/types.h>
>  #include <asm/cet.h>
> +#include <asm/special_insns.h>
> +#include <uapi/asm/sigcontext.h>
>  
>  static void start_update_msrs(void)
>  {
> @@ -69,6 +71,80 @@ static unsigned long alloc_shstk(unsigned long size)
>  	return addr;
>  }
>  
> +#define TOKEN_MODE_MASK	3UL
> +#define TOKEN_MODE_64	1UL
> +#define IS_TOKEN_64(token) ((token & TOKEN_MODE_MASK) == TOKEN_MODE_64)
> +#define IS_TOKEN_32(token) ((token & TOKEN_MODE_MASK) == 0)
> +
> +/*
> + * Verify the restore token at the address of 'ssp' is
> + * valid and then set shadow stack pointer according to the
> + * token.
> + */
> +static int verify_rstor_token(bool ia32, unsigned long ssp,
> +			      unsigned long *new_ssp)
> +{
> +	unsigned long token;
> +
> +	*new_ssp = 0;
> +
> +	if (!IS_ALIGNED(ssp, 8))
> +		return -EINVAL;
> +
> +	if (get_user(token, (unsigned long __user *)ssp))
> +		return -EFAULT;
> +
> +	/* Is 64-bit mode flag correct? */
> +	if (!ia32 && !IS_TOKEN_64(token))
> +		return -EINVAL;
> +	else if (ia32 && !IS_TOKEN_32(token))
> +		return -EINVAL;
> +
> +	token &= ~TOKEN_MODE_MASK;
> +
> +	/*
> +	 * Restore address properly aligned?
> +	 */
> +	if ((!ia32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
> +		return -EINVAL;
> +
> +	/*
> +	 * Token was placed properly?
> +	 */
> +	if ((ALIGN_DOWN(token, 8) - 8) != ssp)
> +		return -EINVAL;
> +
> +	*new_ssp = token;
> +	return 0;
> +}
> +
> +/*
> + * Create a restore token on the shadow stack.
> + * A token is always 8-byte and aligned to 8.
> + */
> +static int create_rstor_token(bool ia32, unsigned long ssp,
> +			      unsigned long *new_ssp)
> +{
> +	unsigned long addr;
> +
> +	*new_ssp = 0;
> +
> +	if ((!ia32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
> +		return -EINVAL;
> +
> +	addr = ALIGN_DOWN(ssp, 8) - 8;
> +
> +	/* Is the token for 64-bit? */
> +	if (!ia32)
> +		ssp |= TOKEN_MODE_64;
> +
> +	if (write_user_shstk_64(addr, ssp))
> +		return -EFAULT;
> +
> +	*new_ssp = addr;
> +	return 0;
> +}
> +
>  int cet_setup_shstk(void)
>  {
>  	unsigned long addr, size;
> @@ -119,3 +195,80 @@ void cet_disable_free_shstk(struct task_struct *tsk)
>  	cet->shstk_size = 0;
>  	cet->shstk_enabled = 0;
>  }
> +
> +/*
> + * Called from __fpu__restore_sig() and XSAVES buffer is protected by
> + * set_thread_flag(TIF_NEED_FPU_LOAD).
> + */
> +int cet_restore_signal(bool ia32, struct sc_ext *sc_ext)
> +{
> +	struct cet_user_state *cet_user_state;
> +	struct cet_status *cet = &current->thread.cet;
> +	unsigned long new_ssp = 0;
> +	u64 msr_val = 0;
> +	int err;
> +
> +	if (!cet->shstk_enabled)
> +		return 0;
> +
> +	cet_user_state = get_xsave_addr(&current->thread.fpu.state.xsave,
> +					XFEATURE_CET_USER);
> +	if (!cet_user_state)
> +		return -1;
> +
> +	if (cet->shstk_enabled) {
> +		err = verify_rstor_token(ia32, sc_ext->ssp, &new_ssp);
> +		if (err)
> +			return err;
> +
> +		cet_user_state->user_ssp = new_ssp;
> +		msr_val |= MSR_IA32_CET_SHSTK_EN;
> +	}
> +
> +	cet_user_state->user_cet = msr_val;
> +	return 0;
> +}
> +
> +/*
> + * Setup the shadow stack for the signal handler: first,
> + * create a restore token to keep track of the current ssp,
> + * and then the return address of the signal handler.
> + */
> +int cet_setup_signal(bool ia32, unsigned long rstor_addr, struct sc_ext *sc_ext)
> +{
> +	struct cet_status *cet = &current->thread.cet;
> +	unsigned long ssp = 0, new_ssp = 0;
> +	int err;
> +
> +	if (!cet->shstk_enabled)
> +		return 0;
> +
> +	if (cet->shstk_enabled) {

This if isn't needed any more.

> +		if (!rstor_addr)
> +			return -EINVAL;
> +
> +		ssp = cet_get_shstk_addr();
> +		err = create_rstor_token(ia32, ssp, &new_ssp);
> +		if (err)
> +			return err;
> +
> +		if (ia32) {
> +			ssp = new_ssp - sizeof(u32);
> +			err = write_user_shstk_32(ssp, (unsigned int)rstor_addr);
> +		} else {
> +			ssp = new_ssp - sizeof(u64);
> +			err = write_user_shstk_64(ssp, rstor_addr);
> +		}
> +
> +		if (err)
> +			return err;
> +
> +		sc_ext->ssp = new_ssp;
> +	}
> +
> +	start_update_msrs();
> +	if (cet->shstk_enabled)
> +		wrmsrl(MSR_IA32_PL3_SSP, ssp);
> +	end_update_msrs();
> +
> +	return 0;
> diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
> index 0d3e06a772b0..875cc0fadce3 100644
> --- a/arch/x86/kernel/fpu/signal.c
> +++ b/arch/x86/kernel/fpu/signal.c
> @@ -52,6 +52,69 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
>  	return 0;
>  }
>  
> +int save_cet_to_sigframe(void __user *fp, unsigned long restorer, int is_ia32)
> +{
> +	int err = 0;
> +
> +#ifdef CONFIG_X86_INTEL_CET
> +	if (!current->thread.cet.shstk_enabled)
> +		return 0;

The general guidelines for #ifdef in code is to instead use
IS_ENABLED() instead, which helps with readability. e.g.:

	if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
		return 0;

But since you're using parts of the structure that's only visible with
that CONFIG, maybe just shorten the ifdef?

#ifndef CONFIG_X86_INTEL_CET
	return 0;
#else
	...whole function...
#endif

I've also seen people prefer to have the entire function declaration
wrapped:

#ifndef CONFIG_X86_INTEL_CET
int save_cet_to_sigframe(void __user *fp, unsigned long restorer, int is_ia32)
{
	return 0;
}
#else
int save_cet_to_sigframe(void __user *fp, unsigned long restorer, int is_ia32)
{
...
}
#endif

> +	int err = 0;
> +
> +#ifdef CONFIG_X86_INTEL_CET
> +	if (!current->thread.cet.shstk_enabled)
> +
> +	if (fp) {
> +		struct sc_ext ext = {0, 0, 0};
> +
> +		err = cet_setup_signal(is_ia32, restorer, &ext);
> +		if (!err) {
> +			void __user *p = fp;
> +
> +			ext.total_size = sizeof(ext);
> +
> +			if (is_ia32)
> +				p += sizeof(struct fregs_state);
> +
> +			p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
> +			p = (void __user *)ALIGN((unsigned long)p, 8);
> +
> +			if (copy_to_user(p, &ext, sizeof(ext)))
> +				return -EFAULT;
> +		}
> +	}
> +#endif
> +
> +	return err;
> +}
> +
> +static int restore_cet_from_sigframe(int is_ia32, void __user *fp)
> +{
> +	int err = 0;
> +
> +#ifdef CONFIG_X86_INTEL_CET
> +	if (!current->thread.cet.shstk_enabled)
> +		return 0;
> +
> +	if (fp) {
> +		struct sc_ext ext = {0, 0, 0};
> +		void __user *p = fp;
> +
> +		if (is_ia32)
> +			p += sizeof(struct fregs_state);
> +
> +		p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
> +		p = (void __user *)ALIGN((unsigned long)p, 8);
> +
> +		if (copy_from_user(&ext, p, sizeof(ext)))
> +			return -EFAULT;
> +
> +		if (ext.total_size != sizeof(ext))
> +			return -EFAULT;
> +
> +		err = cet_restore_signal(is_ia32, &ext);
> +	}
> +#endif
> +
> +	return err;
> +}
> +
>  /*
>   * Signal frame handlers.
>   */
> @@ -367,6 +430,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
>  		pagefault_disable();
>  		ret = copy_user_to_fpregs_zeroing(buf_fx, xfeatures_user, fx_only);
>  		pagefault_enable();
> +
> +		if (!ret)
> +			ret = restore_cet_from_sigframe(0, buf);
> +
>  		if (!ret) {
>  			if (xfeatures_mask_supervisor())
>  				copy_kernel_to_xregs(&fpu->state.xsave,
> @@ -397,6 +464,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
>  		sanitize_restored_user_xstate(&fpu->state, envp, xfeatures_user,
>  					      fx_only);
>  
> +		ret = restore_cet_from_sigframe((int)ia32_fxstate, buf);
> +		if (ret)
> +			goto err_out;
> +
>  		fpregs_lock();
>  		if (unlikely(init_bv))
>  			copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
> @@ -468,12 +539,30 @@ int fpu__restore_sig(void __user *buf, int ia32_frame)
>  	return __fpu__restore_sig(buf, buf_fx, size);
>  }
>  
> +static unsigned long fpu__alloc_sigcontext_ext(unsigned long sp)
> +{
> +	/*
> +	 * sigcontext_ext is at: fpu + fpu_user_xstate_size +
> +	 * FP_XSTATE_MAGIC2_SIZE, then aligned to 8.
> +	 */
> +	if (cpu_x86_cet_enabled()) {
> +		struct cet_status *cet = &current->thread.cet;
> +
> +		if (cet->shstk_enabled)
> +			sp -= (sizeof(struct sc_ext) + 8);
> +	}
> +
> +	return sp;
> +}
> +
>  unsigned long
>  fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
>  		     unsigned long *buf_fx, unsigned long *size)
>  {
>  	unsigned long frame_size = xstate_sigframe_size();
>  
> +	sp = fpu__alloc_sigcontext_ext(sp);
> +
>  	*buf_fx = sp = round_down(sp - frame_size, 64);
>  	if (ia32_frame && use_fxsr()) {
>  		frame_size += sizeof(struct fregs_state);
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index ce9421ec285f..b26f5084a8a1 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -46,6 +46,7 @@
>  
>  #include <asm/sigframe.h>
>  #include <asm/signal.h>
> +#include <asm/cet.h>
>  
>  #define COPY(x)			do {			\
>  	get_user_ex(regs->x, &sc->x);			\
> @@ -246,6 +247,9 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
>  	unsigned long buf_fx = 0;
>  	int onsigstack = on_sig_stack(sp);
>  	int ret;
> +#ifdef CONFIG_X86_64
> +	void __user *restorer = NULL;
> +#endif
>  
>  	/* redzone */
>  	if (IS_ENABLED(CONFIG_X86_64))
> @@ -277,6 +281,12 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
>  	if (onsigstack && !likely(on_sig_stack(sp)))
>  		return (void __user *)-1L;
>  
> +#ifdef CONFIG_X86_64
> +	if (ka->sa.sa_flags & SA_RESTORER)
> +		restorer = ka->sa.sa_restorer;
> +	ret = save_cet_to_sigframe(*fpstate, (unsigned long)restorer, 0);
> +#endif
> +
>  	/* save i387 and extended state */
>  	ret = copy_fpstate_to_sigframe(*fpstate, (void __user *)buf_fx, math_size);
>  	if (ret < 0)
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 21/27] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND
  2020-02-05 18:19 ` [RFC PATCH v9 21/27] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND Yu-cheng Yu
@ 2020-02-25 21:18   ` Kees Cook
  0 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:18 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:29AM -0800, Yu-cheng Yu wrote:
> An ELF file's .note.gnu.property indicates architecture features of
> the file.  Introduce feature definitions for Control-flow Enforcement
> Technology (CET): Shadow Stack and Indirect Branch Tracking.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Why not merge with patch 20?

-Kees

> ---
>  include/uapi/linux/elf.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index c37731407074..61251ecabdd7 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -444,4 +444,11 @@ typedef struct elf64_note {
>    Elf64_Word n_type;	/* Content type */
>  } Elf64_Nhdr;
>  
> +/* .note.gnu.property types */
> +#define GNU_PROPERTY_X86_FEATURE_1_AND		0xc0000002
> +
> +/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
> +#define GNU_PROPERTY_X86_FEATURE_1_IBT		0x00000001
> +#define GNU_PROPERTY_X86_FEATURE_1_SHSTK	0x00000002
> +
>  #endif /* _UAPI_LINUX_ELF_H */
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 22/27] ELF: Add ELF program property parsing support
  2020-02-05 18:19 ` [RFC PATCH v9 22/27] ELF: Add ELF program property parsing support Yu-cheng Yu
@ 2020-02-25 21:20   ` Kees Cook
  0 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:20 UTC (permalink / raw)
  To: Yu-cheng Yu, Dave Martin
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, x86-patch-review

On Wed, Feb 05, 2020 at 10:19:30AM -0800, Yu-cheng Yu wrote:
> From: Dave Martin <Dave.Martin@arm.com>
> 
> ELF program properties will be needed for detecting whether to
> enable optional architecture or ABI features for a new ELF process.

I think this is an earlier version of this feature? (e.g. I had suggested
WARN_ON_ONCE()) Perhaps we should try to merge this ahead of the CET
and BTI patches, just so there isn't a forked discussion?

-Kees

> 
> For now, there are no generic properties that we care about, so do
> nothing unless CONFIG_ARCH_USE_GNU_PROPERTY=y.
> 
> Otherwise, the presence of properties using the PT_PROGRAM_PROPERTY
> phdrs entry (if any), and notify each property to the arch code.
> 
> For now, the added code is not used.
> 
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  fs/binfmt_elf.c          | 127 +++++++++++++++++++++++++++++++++++++++
>  fs/compat_binfmt_elf.c   |   4 ++
>  include/linux/elf.h      |  19 ++++++
>  include/uapi/linux/elf.h |   4 ++
>  4 files changed, 154 insertions(+)
> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index ecd8d2698515..054446f93442 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -39,12 +39,18 @@
>  #include <linux/sched/coredump.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/sched/cputime.h>
> +#include <linux/sizes.h>
> +#include <linux/types.h>
>  #include <linux/cred.h>
>  #include <linux/dax.h>
>  #include <linux/uaccess.h>
>  #include <asm/param.h>
>  #include <asm/page.h>
>  
> +#ifndef ELF_COMPAT
> +#define ELF_COMPAT 0
> +#endif
> +
>  #ifndef user_long_t
>  #define user_long_t long
>  #endif
> @@ -678,6 +684,111 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>   * libraries.  There is no binary dependent code anywhere else.
>   */
>  
> +static int parse_elf_property(const char *data, size_t *off, size_t datasz,
> +			      struct arch_elf_state *arch,
> +			      bool have_prev_type, u32 *prev_type)
> +{
> +	size_t o, step;
> +	const struct gnu_property *pr;
> +	int ret;
> +
> +	if (*off == datasz)
> +		return -ENOENT;
> +
> +	if (WARN_ON(*off > datasz || *off % ELF_GNU_PROPERTY_ALIGN))
> +		return -EIO;
> +	o = *off;
> +	datasz -= *off;
> +
> +	if (datasz < sizeof(*pr))
> +		return -EIO;
> +	pr = (const struct gnu_property *)(data + o);
> +	o += sizeof(*pr);
> +	datasz -= sizeof(*pr);
> +
> +	if (pr->pr_datasz > datasz)
> +		return -EIO;
> +
> +	WARN_ON(o % ELF_GNU_PROPERTY_ALIGN);
> +	step = round_up(pr->pr_datasz, ELF_GNU_PROPERTY_ALIGN);
> +	if (step > datasz)
> +		return -EIO;
> +
> +	/* Properties are supposed to be unique and sorted on pr_type: */
> +	if (have_prev_type && pr->pr_type <= *prev_type)
> +		return -EIO;
> +	*prev_type = pr->pr_type;
> +
> +	ret = arch_parse_elf_property(pr->pr_type, data + o,
> +				      pr->pr_datasz, ELF_COMPAT, arch);
> +	if (ret)
> +		return ret;
> +
> +	*off = o + step;
> +	return 0;
> +}
> +
> +#define NOTE_DATA_SZ SZ_1K
> +#define GNU_PROPERTY_TYPE_0_NAME "GNU"
> +#define NOTE_NAME_SZ (sizeof(GNU_PROPERTY_TYPE_0_NAME))
> +
> +static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
> +				struct arch_elf_state *arch)
> +{
> +	union {
> +		struct elf_note nhdr;
> +		char data[NOTE_DATA_SZ];
> +	} note;
> +	loff_t pos;
> +	ssize_t n;
> +	size_t off, datasz;
> +	int ret;
> +	bool have_prev_type;
> +	u32 prev_type;
> +
> +	if (!IS_ENABLED(CONFIG_ARCH_USE_GNU_PROPERTY) || !phdr)
> +		return 0;
> +
> +	/* load_elf_binary() shouldn't call us unless this is true... */
> +	if (WARN_ON(phdr->p_type != PT_GNU_PROPERTY))
> +		return -EIO;
> +
> +	/* If the properties are crazy large, that's too bad (for now): */
> +	if (phdr->p_filesz > sizeof(note))
> +		return -ENOEXEC;
> +
> +	pos = phdr->p_offset;
> +	n = kernel_read(f, &note, phdr->p_filesz, &pos);
> +
> +	BUILD_BUG_ON(sizeof(note) < sizeof(note.nhdr) + NOTE_NAME_SZ);
> +	if (n < 0 || n < sizeof(note.nhdr) + NOTE_NAME_SZ)
> +		return -EIO;
> +
> +	if (note.nhdr.n_type != NT_GNU_PROPERTY_TYPE_0 ||
> +	    note.nhdr.n_namesz != NOTE_NAME_SZ ||
> +	    strncmp(note.data + sizeof(note.nhdr),
> +		    GNU_PROPERTY_TYPE_0_NAME, n - sizeof(note.nhdr)))
> +		return -EIO;
> +
> +	off = round_up(sizeof(note.nhdr) + NOTE_NAME_SZ,
> +		       ELF_GNU_PROPERTY_ALIGN);
> +	if (off > n)
> +		return -EIO;
> +
> +	if (note.nhdr.n_descsz > n - off)
> +		return -EIO;
> +	datasz = off + note.nhdr.n_descsz;
> +
> +	have_prev_type = false;
> +	do {
> +		ret = parse_elf_property(note.data, &off, datasz, arch,
> +					 have_prev_type, &prev_type);
> +		have_prev_type = true;
> +	} while (!ret);
> +
> +	return ret == -ENOENT ? 0 : ret;
> +}
> +
>  static int load_elf_binary(struct linux_binprm *bprm)
>  {
>  	struct file *interpreter = NULL; /* to shut gcc up */
> @@ -685,6 +796,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	int load_addr_set = 0;
>  	unsigned long error;
>  	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> +	struct elf_phdr *elf_property_phdata = NULL;
>  	unsigned long elf_bss, elf_brk;
>  	int bss_prot = 0;
>  	int retval, i;
> @@ -731,6 +843,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
>  		char *elf_interpreter;
>  
> +		if (elf_ppnt->p_type == PT_GNU_PROPERTY) {
> +			elf_property_phdata = elf_ppnt;
> +			continue;
> +		}
> +
>  		if (elf_ppnt->p_type != PT_INTERP)
>  			continue;
>  
> @@ -818,9 +935,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  			goto out_free_dentry;
>  
>  		/* Pass PT_LOPROC..PT_HIPROC headers to arch code */
> +		elf_property_phdata = NULL;
>  		elf_ppnt = interp_elf_phdata;
>  		for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++)
>  			switch (elf_ppnt->p_type) {
> +			case PT_GNU_PROPERTY:
> +				elf_property_phdata = elf_ppnt;
> +				break;
> +
>  			case PT_LOPROC ... PT_HIPROC:
>  				retval = arch_elf_pt_proc(&loc->interp_elf_ex,
>  							  elf_ppnt, interpreter,
> @@ -831,6 +953,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  			}
>  	}
>  
> +	retval = parse_elf_properties(interpreter ?: bprm->file,
> +				      elf_property_phdata, &arch_state);
> +	if (retval)
> +		goto out_free_dentry;
> +
>  	/*
>  	 * Allow arch code to reject the ELF at this point, whilst it's
>  	 * still possible to return an error to the code that invoked
> diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c
> index aaad4ca1217e..13a087bc816b 100644
> --- a/fs/compat_binfmt_elf.c
> +++ b/fs/compat_binfmt_elf.c
> @@ -17,6 +17,8 @@
>  #include <linux/elfcore-compat.h>
>  #include <linux/time.h>
>  
> +#define ELF_COMPAT	1
> +
>  /*
>   * Rename the basic ELF layout types to refer to the 32-bit class of files.
>   */
> @@ -28,11 +30,13 @@
>  #undef	elf_shdr
>  #undef	elf_note
>  #undef	elf_addr_t
> +#undef	ELF_GNU_PROPERTY_ALIGN
>  #define elfhdr		elf32_hdr
>  #define elf_phdr	elf32_phdr
>  #define elf_shdr	elf32_shdr
>  #define elf_note	elf32_note
>  #define elf_addr_t	Elf32_Addr
> +#define ELF_GNU_PROPERTY_ALIGN	ELF32_GNU_PROPERTY_ALIGN
>  
>  /*
>   * Some data types as stored in coredump.
> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index 459cddcceaac..7bdc6da160c7 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -22,6 +22,9 @@
>  	SET_PERSONALITY(ex)
>  #endif
>  
> +#define ELF32_GNU_PROPERTY_ALIGN	4
> +#define ELF64_GNU_PROPERTY_ALIGN	8
> +
>  #if ELF_CLASS == ELFCLASS32
>  
>  extern Elf32_Dyn _DYNAMIC [];
> @@ -32,6 +35,7 @@ extern Elf32_Dyn _DYNAMIC [];
>  #define elf_addr_t	Elf32_Off
>  #define Elf_Half	Elf32_Half
>  #define Elf_Word	Elf32_Word
> +#define ELF_GNU_PROPERTY_ALIGN	ELF32_GNU_PROPERTY_ALIGN
>  
>  #else
>  
> @@ -43,6 +47,7 @@ extern Elf64_Dyn _DYNAMIC [];
>  #define elf_addr_t	Elf64_Off
>  #define Elf_Half	Elf64_Half
>  #define Elf_Word	Elf64_Word
> +#define ELF_GNU_PROPERTY_ALIGN	ELF64_GNU_PROPERTY_ALIGN
>  
>  #endif
>  
> @@ -64,4 +69,18 @@ struct gnu_property {
>  	u32 pr_datasz;
>  };
>  
> +struct arch_elf_state;
> +
> +#ifndef CONFIG_ARCH_USE_GNU_PROPERTY
> +static inline int arch_parse_elf_property(u32 type, const void *data,
> +					  size_t datasz, bool compat,
> +					  struct arch_elf_state *arch)
> +{
> +	return 0;
> +}
> +#else
> +extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
> +				   bool compat, struct arch_elf_state *arch);
> +#endif
> +
>  #endif /* _LINUX_ELF_H */
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 61251ecabdd7..518651708d8f 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -368,6 +368,7 @@ typedef struct elf64_shdr {
>   * Notes used in ET_CORE. Architectures export some of the arch register sets
>   * using the corresponding note types via the PTRACE_GETREGSET and
>   * PTRACE_SETREGSET requests.
> + * The note name for all these is "LINUX".
>   */
>  #define NT_PRSTATUS	1
>  #define NT_PRFPREG	2
> @@ -430,6 +431,9 @@ typedef struct elf64_shdr {
>  #define NT_MIPS_FP_MODE	0x801		/* MIPS floating-point mode */
>  #define NT_MIPS_MSA	0x802		/* MIPS SIMD registers */
>  
> +/* Note types with note name "GNU" */
> +#define NT_GNU_PROPERTY_TYPE_0	5
> +
>  /* Note header in a PT_NOTE section */
>  typedef struct elf32_note {
>    Elf32_Word	n_namesz;	/* Name size */
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 24/27] x86/cet/shstk: ELF header parsing for Shadow Stack
  2020-02-05 18:19 ` [RFC PATCH v9 24/27] x86/cet/shstk: ELF header parsing for Shadow Stack Yu-cheng Yu
@ 2020-02-25 21:22   ` Kees Cook
  0 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:22 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:32AM -0800, Yu-cheng Yu wrote:
> Check an ELF file's .note.gnu.property, and setup Shadow Stack if the
> application supports it.
> 
> v9:
> - Change cpu_feature_enabled() to static_cpu_has().
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/Kconfig             |  2 ++
>  arch/x86/include/asm/elf.h   | 13 +++++++++++++
>  arch/x86/kernel/process_64.c | 31 +++++++++++++++++++++++++++++++
>  3 files changed, 46 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6c34b701c588..d1447380e02e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1987,6 +1987,8 @@ config X86_INTEL_SHADOW_STACK_USER
>  	select ARCH_USES_HIGH_VMA_FLAGS
>  	select X86_INTEL_CET
>  	select ARCH_HAS_SHSTK
> +	select ARCH_USE_GNU_PROPERTY
> +	select ARCH_BINFMT_ELF_STATE
>  	---help---
>  	  Shadow Stack (SHSTK) provides protection against program
>  	  stack corruption.  It is active when the kernel has this
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 69c0f892e310..fac79b621e0a 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -367,6 +367,19 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>  					      int uses_interp);
>  #define compat_arch_setup_additional_pages compat_arch_setup_additional_pages
>  
> +#ifdef CONFIG_ARCH_BINFMT_ELF_STATE
> +struct arch_elf_state {
> +	unsigned int gnu_property;
> +};
> +
> +#define INIT_ARCH_ELF_STATE {	\
> +	.gnu_property = 0,	\
> +}
> +
> +#define arch_elf_pt_proc(ehdr, phdr, elf, interp, state) (0)
> +#define arch_check_elf(ehdr, interp, interp_ehdr, state) (0)
> +#endif
> +
>  /* Do not change the values. See get_align_mask() */
>  enum align_flags {
>  	ALIGN_VA_32	= BIT(0),
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 506d66830d4d..99548cde0cc6 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -732,3 +732,34 @@ unsigned long KSTK_ESP(struct task_struct *task)
>  {
>  	return task_pt_regs(task)->sp;
>  }
> +
> +#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
> +int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
> +			     bool compat, struct arch_elf_state *state)
> +{
> +	if (type != GNU_PROPERTY_X86_FEATURE_1_AND)
> +		return 0;
> +
> +	if (datasz != sizeof(unsigned int))
> +		return -ENOEXEC;
> +
> +	state->gnu_property = *(unsigned int *)data;
> +	return 0;
> +}
> +
> +int arch_setup_elf_property(struct arch_elf_state *state)
> +{
> +	int r = 0;
> +
> +	memset(&current->thread.cet, 0, sizeof(struct cet_status));
> +
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {
> +		if (state->gnu_property & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
> +			r = cet_setup_shstk();
> +		if (r < 0)
> +			return r;

This test is redundant; there's no loop. This can just fall through to
the final return.

-Kees

> +	}
> +
> +	return r;
> +}
> +#endif
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread Shadow Stack
  2020-02-05 18:19 ` [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread " Yu-cheng Yu
@ 2020-02-25 21:29   ` Kees Cook
  2020-03-25 21:51     ` Yu-cheng Yu
  0 siblings, 1 reply; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:29 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:33AM -0800, Yu-cheng Yu wrote:
> The Shadow Stack (SHSTK) for clone/fork is handled as the following:
> 
> (1) If ((clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM),
>     the kernel allocates (and frees on thread exit) a new SHSTK for the
>     child.
> 
>     It is possible for the kernel to complete the clone syscall and set the
>     child's SHSTK pointer to NULL and let the child thread allocate a SHSTK
>     for itself.  There are two issues in this approach: It is not
>     compatible with existing code that does inline syscall and it cannot
>     handle signals before the child can successfully allocate a SHSTK.
> 
> (2) For (clone_flags & CLONE_VFORK), the child uses the existing SHSTK.
> 
> (3) For all other cases, the SHSTK is copied/reused whenever the parent or
>     the child does a call/ret.
> 
> This patch handles cases (1) & (2).  Case (3) is handled in the SHSTK page
> fault patches.
> 
> A 64-bit SHSTK has a fixed size of RLIMIT_STACK. A compat-mode thread SHSTK
> has a fixed size of 1/4 RLIMIT_STACK.  This allows more threads to share a
> 32-bit address space.

I am not understanding this part. :) Entries are sizeof(unsigned long),
yes? A 1/2 RLIMIT_STACK would cover 32-bit, but 1/4 is less, so why does
that provide for more threads?

> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/cet.h         |  2 ++
>  arch/x86/include/asm/mmu_context.h |  3 +++
>  arch/x86/kernel/cet.c              | 41 ++++++++++++++++++++++++++++++
>  arch/x86/kernel/process.c          |  7 +++++
>  4 files changed, 53 insertions(+)
> 
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 409d4f91a0dc..9a3e2da9c1c4 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -19,10 +19,12 @@ struct cet_status {
>  
>  #ifdef CONFIG_X86_INTEL_CET
>  int cet_setup_shstk(void);
> +int cet_setup_thread_shstk(struct task_struct *p);
>  void cet_disable_free_shstk(struct task_struct *p);
>  int cet_restore_signal(bool ia32, struct sc_ext *sc);
>  int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
>  #else
> +static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
>  static inline void cet_disable_free_shstk(struct task_struct *p) {}
>  static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EINVAL; }
>  static inline int cet_setup_signal(bool ia32, unsigned long rstor,
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 5f33924e200f..6a8189308823 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -13,6 +13,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/paravirt.h>
>  #include <asm/mpx.h>
> +#include <asm/cet.h>
>  #include <asm/debugreg.h>
>  
>  extern atomic64_t last_mm_ctx_id;
> @@ -230,6 +231,8 @@ do {						\
>  #else
>  #define deactivate_mm(tsk, mm)			\
>  do {						\
> +	if (!tsk->vfork_done)			\
> +		cet_disable_free_shstk(tsk);	\
>  	load_gs_index(0);			\
>  	loadsegment(fs, 0);			\
>  } while (0)
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> index cba5c7656aab..5b45abda80a1 100644
> --- a/arch/x86/kernel/cet.c
> +++ b/arch/x86/kernel/cet.c
> @@ -170,6 +170,47 @@ int cet_setup_shstk(void)
>  	return 0;
>  }
>  
> +int cet_setup_thread_shstk(struct task_struct *tsk)
> +{
> +	unsigned long addr, size;
> +	struct cet_user_state *state;
> +	struct cet_status *cet = &tsk->thread.cet;
> +
> +	if (!cet->shstk_enabled)
> +		return 0;
> +
> +	state = get_xsave_addr(&tsk->thread.fpu.state.xsave,
> +			       XFEATURE_CET_USER);
> +
> +	if (!state)
> +		return -EINVAL;
> +
> +	size = rlimit(RLIMIT_STACK);

Is SHSTK incompatible with RLIM_INFINITY stack rlimits?

> +
> +	/*
> +	 * Compat-mode pthreads share a limited address space.
> +	 * If each function call takes an average of four slots
> +	 * stack space, we need 1/4 of stack size for shadow stack.
> +	 */
> +	if (in_compat_syscall())
> +		size /= 4;
> +
> +	addr = alloc_shstk(size);

I assume it'd fail here, but I worry about Stack Clash style attacks.
I'd like to see test cases that make sure the SHSTK gap is working
correctly.

-Kees

> +
> +	if (IS_ERR((void *)addr)) {
> +		cet->shstk_base = 0;
> +		cet->shstk_size = 0;
> +		cet->shstk_enabled = 0;
> +		return PTR_ERR((void *)addr);
> +	}
> +
> +	fpu__prepare_write(&tsk->thread.fpu);
> +	state->user_ssp = (u64)(addr + size);
> +	cet->shstk_base = addr;
> +	cet->shstk_size = size;
> +	return 0;
> +}
> +
>  void cet_disable_free_shstk(struct task_struct *tsk)
>  {
>  	struct cet_status *cet = &tsk->thread.cet;
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index e102e63de641..7098618142f2 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -110,6 +110,7 @@ void exit_thread(struct task_struct *tsk)
>  
>  	free_vm86(t);
>  
> +	cet_disable_free_shstk(tsk);
>  	fpu__drop(fpu);
>  }
>  
> @@ -180,6 +181,12 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
>  	if (clone_flags & CLONE_SETTLS)
>  		ret = set_new_tls(p, tls);
>  
> +#ifdef CONFIG_X86_64
> +	/* Allocate a new shadow stack for pthread */
> +	if (!ret && (clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM)
> +		ret = cet_setup_thread_shstk(p);
> +#endif
> +
>  	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
>  		io_bitmap_share(p);
>  
> -- 
> 2.21.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack
  2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (26 preceding siblings ...)
  2020-02-05 18:19 ` [RFC PATCH v9 27/27] x86/cet/shstk: Add arch_prctl functions for Shadow Stack Yu-cheng Yu
@ 2020-02-25 21:31 ` Kees Cook
  27 siblings, 0 replies; 107+ messages in thread
From: Kees Cook @ 2020-02-25 21:31 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Wed, Feb 05, 2020 at 10:19:08AM -0800, Yu-cheng Yu wrote:
> Control-flow Enforcement (CET) is a new Intel processor feature that blocks
> return/jump-oriented programming attacks.  Details can be found in "Intel
> 64 and IA-32 Architectures Software Developer's Manual" [1].

At v9, this probably isn't RFC any more. :)

As mentioned in another patch, I'd really like to see some self tests
for this feature. It's relatively complex...

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-05 18:19 ` [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection Yu-cheng Yu
  2020-02-25 20:07   ` Kees Cook
@ 2020-02-26 17:03   ` Dave Hansen
  2020-02-26 19:57     ` Pavel Machek
  2020-03-05 20:38     ` Yu-cheng Yu
  2020-02-26 18:05   ` Dave Hansen
  2 siblings, 2 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 17:03 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> Introduce Kconfig option: X86_INTEL_SHADOW_STACK_USER.
> 
> Shadow Stack (SHSTK) provides protection against function return address
> corruption.  It is active when the kernel has this feature enabled, and
> both the processor and the application support it.  When this feature is
> enabled, legacy non-SHSTK applications continue to work, but without SHSTK
> protection.
> 
> The user-mode SHSTK protection is only implemented for the 64-bit kernel.
> IA32 applications are supported under the compatibility mode.

I think what you're trying to say here is that the hardware supports
shadow stacks with 32-bit kernels.  However, this series does not
include that support and we have no plans to add it.

Right?

I'll let others weigh in, but I rather dislike the use of acronyms here.
 I'd much rather see the english "shadow stack" everywhere than SHSTK.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5e8949953660..6c34b701c588 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1974,6 +1974,28 @@ config X86_INTEL_TSX_MODE_AUTO
>  	  side channel attacks- equals the tsx=auto command line parameter.
>  endchoice
>  
> +config X86_INTEL_CET
> +	def_bool n
> +
> +config ARCH_HAS_SHSTK
> +	def_bool n
> +
> +config X86_INTEL_SHADOW_STACK_USER
> +	prompt "Intel Shadow Stack for user-mode"

Nit: this whole thing is to support more than a single stack.  I'd make
this plural at least in the text: "shadow stacks".

> +	def_bool n
> +	depends on CPU_SUP_INTEL && X86_64
> +	select ARCH_USES_HIGH_VMA_FLAGS
> +	select X86_INTEL_CET
> +	select ARCH_HAS_SHSTK
> +	---help---
> +	  Shadow Stack (SHSTK) provides protection against program
> +	  stack corruption.  It is active when the kernel has this
> +	  feature enabled, and the processor and the application
> +	  support it.  When this feature is enabled, legacy non-SHSTK
> +	  applications continue to work, but without SHSTK protection.
> +
> +	  If unsure, say y.

This is missing a *lot* of information.

What matters to someone turning this on?

1. It's a hardware feature.  This only matters if you have the right
   hardware
2. It's a security hardening feature.  You dance around this, but need
   to come out and say it.
3. Apps must be enabled to use it.  You get no protection "for free" on
   old userspace.
4. The hardware supports user and kernel, but this option is for
   userspace only.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler
  2020-02-05 18:19 ` [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler Yu-cheng Yu
  2020-02-25 20:06   ` Kees Cook
@ 2020-02-26 17:10   ` Dave Hansen
  2020-03-05 20:44     ` Yu-cheng Yu
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 17:10 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index 87ef69a72c52..8ed406f469e7 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -102,6 +102,10 @@ static const __initconst struct idt_data def_idts[] = {
>  #elif defined(CONFIG_X86_32)
>  	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
>  #endif
> +
> +#ifdef CONFIG_X86_64
> +	INTG(X86_TRAP_CP,		control_protection),
> +#endif
>  };

This patch in particular appears to have all of its code unconditionally
compiled in.  That's in contrast to things that have Kconfig options, like:

#ifdef CONFIG_X86_MCE
        INTG(X86_TRAP_MC,               &machine_check),
#endif

or:

#ifdef CONFIG_X86_THERMAL_VECTOR
        INTG(THERMAL_APIC_VECTOR,       thermal_interrupt),
#endif

Is there a reason this code is always compiled in on 64-bit even when
the config option is off?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
  2020-02-06  0:16   ` Randy Dunlap
  2020-02-25 20:02   ` Kees Cook
@ 2020-02-26 17:57   ` Dave Hansen
  2020-03-09 17:00     ` Yu-cheng Yu
  2 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 17:57 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

> index ade4e6ec23e0..8b69ebf0baed 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3001,6 +3001,12 @@
>  			noexec=on: enable non-executable mappings (default)
>  			noexec=off: disable non-executable mappings
>  
> +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> +			applications

If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
for userspace"?

> +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
> +			applications
> +
>  	nosmap		[X86,PPC]
>  			Disable SMAP (Supervisor Mode Access Prevention)
>  			even if it is supported by processor.

BTW, this documentation is misplaced.  It needs to go to the spot where
you introduce the code for these options.

> diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
> index a8de2fbc1caa..81f919801765 100644
> --- a/Documentation/x86/index.rst
> +++ b/Documentation/x86/index.rst
> @@ -19,6 +19,7 @@ x86-specific Documentation
>     tlb
>     mtrr
>     pat
> +   intel_cet
>     intel_mpx
>     intel-iommu
>     intel_txt
> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index 000000000000..71e2462fea5c
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,294 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +
> +[1] Overview
> +============
> +
> +Control-flow Enforcement Technology (CET) provides protection against
> +return/jump-oriented programming (ROP) attacks.  It can be setup to

							      ^ set up

> +protect both applications and the kernel.  In the first phase, only
> +user-mode protection is implemented in the 64-bit kernel; 32-bit
> +applications are supported in compatibility mode.

Please just say what *is* at the time of the writing.  We don't need to
talk about "phases".

Also, you haven't mentioned that this is a *hardware* feature and that
it's only on Intel CPUs at the moment.  That's kinda essential.  If I've
got an AMD CPU, I can just stop reading. :)

The hardware supports shadow stacks for both userspace and the kernel in
both 32 and 64-bit modes.  32-bit kernel support is not implemented.
Both 32-bit and 64-bit user applications can run on 64-bit kernels.

This is also missing the same key points about enabling as the Kconfig
text: apps don't get this for free and must be specifically enabled.

> +CET introduces Shadow Stack (SHSTK) and Indirect Branch Tracking
> +(IBT).  SHSTK is a secondary stack allocated from memory and cannot
> +be directly modified by applications.  When executing a CALL, the
> +processor pushes a copy of the return address to SHSTK. 

... and to the normal stack

>  Upon
> +function return, the processor pops the SHSTK copy and compares it
> +to the one from the program stack.  If the two copies differ, the
> +processor raises a control-protection fault.  IBT verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes (see CET instructions below).
> +
> +There are two kernel configuration options:
> +
> +    X86_INTEL_SHADOW_STACK_USER, and
> +    X86_INTEL_BRANCH_TRACKING_USER.
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> +are required.

Why are these needed to build a CET-enabled kernel?

>  To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +There are two command-line options for disabling CET features::
> +
> +    no_cet_shstk - disables SHSTK, and
> +    no_cet_ibt   - disables IBT.
> +
> +At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.

Availability of what?

If I set X86_INTEL_SHADOW_STACK_USER=n, I'll still see the cpuinfo
flags, but I won't have runtime support.

Probably best to say that cpuinfo tells you about processor support only.

> +[2] CET assembly instructions
> +=============================

Why do we need this in the kernel?  What is specific to Linux or the
kernel?  Why wouldn't I just go read the SDM if I want to know how the
instructions work?

> +[3] Application Enabling
> +========================
> +
> +An application's CET capability is marked in its ELF header and can
> +be verified from the following command output, in the
> +NT_GNU_PROPERTY_TYPE_0 field:
> +
> +    readelf -n <application>
> +
> +If an application supports CET and is statically linked, it will run
> +with CET protection.  If the application needs any shared libraries,
> +the loader checks all dependencies and enables CET only when all
> +requirements are met.

What about shared libraries loaded after the program starts?

> +[4] Legacy Libraries
> +====================
> +
> +GLIBC provides a few tunables for backward compatibility.
> +
> +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
> +    Turn off SHSTK/IBT for the current shell.
> +
> +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> +    This controls how dlopen() handles SHSTK legacy libraries::
> +
> +        on         - continue with SHSTK enabled;
> +        permissive - continue with SHSTK off.

This seems like manpage fodder more than kernel documentation to me.

> +[5] CET system calls
> +====================
> +
> +The following arch_prctl() system calls are added for CET:

FWIW, I wouldn't call each of these a "system call".

"Several arch_prctl()'s have been added for CET:"

> +arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
> +    Return CET feature status.
> +
> +    The parameter 'addr' is a pointer to a user buffer.
> +    On returning to the caller, the kernel fills the following
> +    information::
> +
> +        *addr       = SHSTK/IBT status
> +        *(addr + 1) = SHSTK base address
> +        *(addr + 2) = SHSTK size
> +
> +arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
> +    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
> +    if CET is locked.
> +
> +arch_prctl(ARCH_X86_CET_LOCK)
> +    Lock in CET feature.

Shouldn't this say what "locking" means?

> +arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
> +    Allocate a new SHSTK and put a restore token at top.
> +
> +    The parameter 'addr' is a pointer to a user buffer and indicates
> +    the desired SHSTK size to allocate.  On returning to the caller,
> +    the kernel fills '*addr' with the base address of the new SHSTK.



> +arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
> +    Mark an address range as IBT legacy code.
> +
> +    The parameter 'addr' is a pointer to a user buffer that has the
> +    following information::
> +
> +        *addr       = starting linear address of the legacy code
> +        *(addr + 1) = size of the legacy code
> +        *(addr + 2) = set (1); clear (0)
> +
> +Note:
> +  There is no CET-enabling arch_prctl function.  By design, CET is
> +  enabled automatically if the binary and the system can support it.

This is kinda interesting.  It means that a JIT couldn't choose to
protect the code it generates and have different rules from itself?

> +  The parameters passed are always unsigned 64-bit.  When an IA32
> +  application passing pointers, it should only use the lower 32 bits.

Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
even know it's running on a 64-bit kernel?

> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size of
> +RLIMIT_STACK.

I can't really parse that sentence.  Is this saying that shadow stacks
are limited by and share space with normal stacks via RLIMIT_STACK?

>  A compat-mode thread's SHSTK size is 1/4 of
> +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> +share a 32-bit address space.

I thought the size was passed in from userspace?  Where does this sizing
take place?  Is this a convention or is it being enforced?

> +Signal
> +------
> +
> +The main program and its signal handlers use the same SHSTK.  Because
> +the SHSTK stores only return addresses, a large SHSTK will cover the
> +condition that both the program stack and the sigaltstack run out.

						 ^ typo?

I'm not sure what this is trying to say.

> +The kernel creates a restore token at the SHSTK restoring address and
> +verifies that token when restoring from the signal handler.

I think there's a sentence or two of background missing here.  I'm
really lost as to what this is trying to tell me.

> +IBT for signal delivering and sigreturn is the same as the main
> +program's setup; except for WAIT_ENDBR status, which can be read from
> +MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
> +indirect CALL/JMP and before the next instruction starts.

I'm 100% lost.  I have no idea what this is trying to tell me or why it
is relevant to the kernel.

> +Fork
> +----
> +
> +The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
> +read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
> +a SHSTK access triggers a page fault with an additional SHSTK bit set
> +in the page fault error code.
> +
> +When a task forks a child, its SHSTK PTEs are copied and both the
> +parent's and the child's SHSTK PTEs are cleared of the dirty bit.
> +Upon the next SHSTK access, the resulting SHSTK page fault is handled
> +by page copy/re-use.

What's the most important thing about shadow stacks and fork()?  Does
this documentation tell that to the reader?

> +When a pthread child is created, the kernel allocates a new SHSTK for
> +the new thread.

Why is this here?  Are pthread children created work fork()?

> +Setjmp/Longjmp
> +--------------
> +
> +Longjmp unwinds SHSTK until it matches the program stack.

I'm missing what this has to do with the kernel.

> +Ucontext
> +--------
> +
> +In GLIBC, getcontext/setcontext is implemented in similar way as
> +setjmp/longjmp.
> +
> +When makecontext creates a new ucontext, a new SHSTK is allocated for
> +that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel

Nit: ARCH_X86_CET_ALLOC_SHSTK is not a syscall.

> +creates a restore token at the top of the new SHSTK and the user-mode
> +code switches to the new SHSTK with the RSTORSSP instruction.

This seems like a howto for doing user-level threading.  It seems like
it could be replaced by a single sentence in the
ARCH_X86_CET_ALLOC_SHSTK documentation explaining that new shadow stacks
are generally (always??) allocated along with new stacks.  Since new
clone() threads need a new stack, they also need a new shadow stack.
User-level threads that need a new stack are also expected to allocate a
new shadow stack.

Right?

> +[7] The management of read-only & dirty PTEs for SHSTK
> +======================================================
> +
> +A RO and dirty PTE exists in the following cases:
> +
> +(a) A page is modified and then shared with a fork()'ed child;
> +(b) A R/O page that has been COW'ed;
> +(c) A SHSTK page.
> +
> +The processor only checks the dirty bit for (c).  To prevent the use
> +of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
> +DIRTY_SW for (a) and (b) above.  This results to the following PTE
> +settings::
> +
> +    Modified PTE:             (R/W + DIRTY_HW)
> +    Modified and shared PTE:  (R/O + DIRTY_SW)
> +    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
> +    SHSTK PTE:                (R/O + DIRTY_HW)
> +    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
> +    SHSTK PTE, shared:        (R/O + DIRTY_SW)
> +
> +Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.

I really don't think this belongs in the documentation, especially since
it's duplicated almost verbatim in code comments.

> +[8] The implementation of IBT legacy bitmap
> +===========================================
> +
> +When IBT is active, a non-IBT-capable legacy library can be executed
> +if its address ranges are specified in the legacy code bitmap.  The
> +bitmap covers the whole user-space address, which is TASK_SIZE_MAX
> +for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB
> +legacy code page.  It is read-only from an application, and setup by

								^ set up

> +the kernel as a special mapping when the first time the application
> +calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
> +manages the bitmap through the arch_prctl.





^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-05 18:19 ` [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection Yu-cheng Yu
  2020-02-25 20:07   ` Kees Cook
  2020-02-26 17:03   ` Dave Hansen
@ 2020-02-26 18:05   ` Dave Hansen
  2020-02-27  1:02     ` H.J. Lu
  2020-03-06 18:37     ` Yu-cheng Yu
  2 siblings, 2 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 18:05 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> +# Check assembler Shadow Stack suppot

				  ^ support

> +ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +  ifeq ($(call as-instr, saveprevssp, y),)
> +      $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
> +  endif
> +endif

Is this *just* looking for instruction support in the assembler?

We usually just .byte them, like this for pkeys:

        asm volatile(".byte 0x0f,0x01,0xee\n\t"
                     : "=a" (pkru), "=d" (edx)
                     : "c" (ecx));

That way everybody with old toolchains can still build the kernel (and
run/test code with your config option on, btw...).


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory
  2020-02-05 18:19 ` [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory Yu-cheng Yu
  2020-02-25 20:07   ` Kees Cook
@ 2020-02-26 18:07   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 18:07 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> A Shadow Stack (SHSTK) PTE must be read-only and have _PAGE_DIRTY set.
> However, read-only and Dirty PTEs also exist for copy-on-write (COW) pages.
> These two cases are handled differently for page faults and a new VM flag
> is necessary for tracking SHSTK VMAs.
> 
> v9:
> - Add VM_SHSTK case to arch_vma_name().
> - Revise the commit log to explain why a new VM flag is needed.

To be honest, a flag is not strictly *needed*.  It is certainly
convenient and straightforward, but it's far from being truly necessary.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack.
  2020-02-05 18:19 ` [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack Yu-cheng Yu
  2020-02-25 20:11   ` Kees Cook
@ 2020-02-26 18:17   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 18:17 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> INCSSPD/INCSSPQ instruction is used to unwind a Shadow Stack (SHSTK).  It
> performs 'pop and discard' of the first and last element from SHSTK in the
> range specified in the operand.

This implies, but does not directly hit on an important detail: these
instructions *touch* memory.  They don't just mess with the shadow stack
pointer, they actually dereference memory.  This makes them very
different from just manipulating %rsp and are what actually make this
guard page thing work in the first place.

> The maximum value of the operand is 255,
> and the maximum moving distance of the SHSTK pointer is 255 * 4 for
> INCSSPD, 255 * 8 for INCSSPQ.

You could also be kind and do the math for us, reminding us that ~1k and
~2k are both very far away from the 4k guard page size.

> Since SHSTK has a fixed size, creating a guard page above prevents
> INCSSP/RET from moving beyond.

What does this have to do with being a fixed size?  Also, this seems
incongruous with an API that takes a size as an argument.  It sounds
like shadow stacks are fixed in size *after* allocation, which is really
different from being truly fixed in size.

> Likewise, creating a guard page below
> prevents CALL from underflowing the SHSTK.

The language here is goofy.  I think of any "stack overflow" as the
condition where a stack grows too large.  I don't call too-large
grows-down stacks underflows, even though they are going down in their
addressing.

> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  include/linux/mm.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b5145fbe102e..75de07674649 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2464,9 +2464,15 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
>  static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
>  {
>  	unsigned long vm_start = vma->vm_start;
> +	unsigned long gap = 0;
>  
> -	if (vma->vm_flags & VM_GROWSDOWN) {
> -		vm_start -= stack_guard_gap;
> +	if (vma->vm_flags & VM_GROWSDOWN)
> +		gap = stack_guard_gap;
> +	else if (vma->vm_flags & VM_SHSTK)
> +		gap = PAGE_SIZE;

Comments, please.  There is also a *lot* of stuff that has to go right
to make PAGE_SIZE OK here, including the rather funky architecture of a
single instruction.

It seems cruel and unusual punishment to future generations to make them
chase git logs for the logic rather than look at a nice code comment.

I think it's probably also best to have this be

	gap = ARCH_SHADOW_STACK_GUARD_GAP;

and then you can give the full rundown about the sizing logic inside the
arch/x86/include definition.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
  2020-02-05 18:19 ` [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
  2020-02-25 20:12   ` Kees Cook
@ 2020-02-26 18:20   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 18:20 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> Before introducing _PAGE_DIRTY_SW for non-hardware memory management
> purposes in the next patch, rename _PAGE_DIRTY to _PAGE_DIRTY_HW and
> _PAGE_BIT_DIRTY to _PAGE_BIT_DIRTY_HW to make these PTE dirty bits
> more clear.  There are no functional changes from this patch.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-26 17:03   ` Dave Hansen
@ 2020-02-26 19:57     ` Pavel Machek
  2020-03-05 20:38     ` Yu-cheng Yu
  1 sibling, 0 replies; 107+ messages in thread
From: Pavel Machek @ 2020-02-26 19:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Peter Zijlstra,
	Randy Dunlap, Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > Introduce Kconfig option: X86_INTEL_SHADOW_STACK_USER.
> > 
> > Shadow Stack (SHSTK) provides protection against function return address
> > corruption.  It is active when the kernel has this feature enabled, and
> > both the processor and the application support it.  When this feature is
> > enabled, legacy non-SHSTK applications continue to work, but without SHSTK
> > protection.
> > 
> > The user-mode SHSTK protection is only implemented for the 64-bit kernel.
> > IA32 applications are supported under the compatibility mode.
> 
> I think what you're trying to say here is that the hardware supports
> shadow stacks with 32-bit kernels.  However, this series does not
> include that support and we have no plans to add it.
> 
> Right?
> 
> I'll let others weigh in, but I rather dislike the use of acronyms here.
>  I'd much rather see the english "shadow stack" everywhere than SHSTK.

For the record, I like "shadow stack" better, too.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW
  2020-02-05 18:19 ` [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW Yu-cheng Yu
  2020-02-25 20:12   ` Kees Cook
@ 2020-02-26 21:35   ` Dave Hansen
  2020-04-01 19:08     ` Yu-cheng Yu
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 21:35 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> When Shadow Stack (SHSTK) is introduced, a R/O and Dirty PTE exists in the
> following cases:
> 
> (a) A modified, copy-on-write (COW) page;
> (b) A R/O page that has been COW'ed;
> (c) A SHSTK page.

I really like to begin these patches with a problem statement:

	There is essentially no room left in the x86 hardware PTEs on
	some OSes (not Linux).  That left the hardware architects
	looking for a way to represent a new memory type (shadow stack)
	within the existing bits.  They chose to repurpose a lightly-
	used state: Write=0,Dirty=1.

	The reason it's lightly used is that Dirty=1 is normally set by
	hardware and can not normally be set by hardware on a Write=0
	PTE.  Software must normally be involved to create one of these
	PTEs, so software can simply opt to not create them.

But that leaves us with a Linux problem: we need to ensure we never
create Write=0,Dirty=1 PTEs.  In places where we do create them, we need
to find an alternative way to represent them _without_ using the same
hardware bit combination.  Thus, enter _PAGE_DIRTY_SW.

... back to the list:
> (a) A modified, copy-on-write (COW) page;
> (b) A R/O page that has been COW'ed;

(a) is pretty clear to me.  We had a Write=1,Dirty=1 PTE and fork()'d.
The fork() code set Write=0, but left Dirty=1.  In this case, we have a
read-only PTE underneath a VM_WRITE VMA.

(b) is not clear to me.  Could you please differentiate between the
permissions of the PTE and the permissions of the VMA, and also include
the steps needed to create it?

I think you also forgot a state:

(d) a page where the processor observed a Write=1 PTE, started a write,
    set Dirty=1, but then observed a Write=0 PTE.

That's possible today.

> To separate non-SHSTK memory from SHSTK, introduce a spare bit of the
> 64-bit PTE as _PAGE_BIT_DIRTY_SW and use that for case (a) and (b).
> This results in the following possible settings:
> 
> Modified PTE:         (R/W + DIRTY_HW)
> Modified and COW PTE: (R/O + DIRTY_SW)
> R/O PTE COW'ed:       (R/O + DIRTY_SW)
> SHSTK PTE:            (R/O + DIRTY_HW)
> SHSTK shared PTE[1]:  (R/O + DIRTY_SW)
> SHSTK PTE COW'ed:     (R/O + DIRTY_HW)
> 
> [1] When a SHSTK page is being shared among threads,

I think you mean processes.  You can probably even mention here that
this happens at fork().

>     its PTE is cleared of
>     _PAGE_DIRTY_HW, so the next SHSTK access causes a fault, and the page
>     is duplicated and _PAGE_DIRTY_HW is set again.

It's worth noting here that this is the COW equivalent for shadow stack
pages, even though it's copy-on-any-access rather than copy-on-write.


>  static inline pte_t pte_mkold(pte_t pte)
> @@ -322,6 +322,17 @@ static inline pte_t pte_mkold(pte_t pte)
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> +	/*
> +	 * Use _PAGE_DIRTY_SW on a R/O PTE to set it apart from
> +	 * a Shadow Stack PTE, which is R/O + _PAGE_DIRTY_HW.
> +	 */

I think we can do better here than this comment.  Maybe:

	/*
	 * Blindly clearing _PAGE_RW might accidentally create
	 * A shadow stack PTE (RW=0,Dirty=1).  Move the hardware
	 * dirty value to the software bit.
	 */
	

> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {

Do we need to check cpuid, or do we need to check whether shadow stacks
are enabled?  What if X86_FEATURE_SHSTK is set, but cr4.X86_CR4_CET=0?

I think you've gone and tried to clear X86_FEATURE_SHSTK whenever the
feature is not enabled.  That's a _bit_ funky, but I guess it works.  I
think I'd rather have some common helper like: shadow_stacks_enabled()
that gets called so that you at least have a single place in the code to
point out this convention.

> +		if (pte_flags(pte) & _PAGE_DIRTY_HW) {
> +			pte = pte_clear_flags(pte, _PAGE_DIRTY_HW);
> +			pte = pte_set_flags(pte, _PAGE_DIRTY_SW);
> +		}
> +	}
> +
>  	return pte_clear_flags(pte, _PAGE_RW);
>  }

Just curious, but how clean does the assembly look after this change?
Does this really blow up the code?

This code is used in fork() which we care deeply about.  Did you go
looking for any performance impact from this?

> @@ -332,9 +343,25 @@ static inline pte_t pte_mkexec(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> +	pteval_t dirty = _PAGE_DIRTY_HW;
> +
> +	if (static_cpu_has(X86_FEATURE_SHSTK) && !pte_write(pte))
> +		dirty = _PAGE_DIRTY_SW;
> +
> +	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> +}

Comment, please.

	/* Avoid creating (HW)Dirty=1,Write=0 PTEs */

> +static inline pte_t pte_mkdirty_shstk(pte_t pte)
> +{
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
>  	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }

I've already forgotten what the right thing here is and why you _need_
_PAGE_DIRTY_SW clear.  That's a bad sign. :)

Could you please enlighten us by adding a comment?

> +static inline bool pte_dirty_hw(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_DIRTY_HW;
> +}

There's at least one open-coded instance of this above.  Why not just
move this up so you can use it?
...

All of those comments pretty much go for the pmd and pud variants too,
of course.

> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index e647e3c75578..826823df917f 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -23,7 +23,8 @@
>  #define _PAGE_BIT_SOFTW2	10	/* " */
>  #define _PAGE_BIT_SOFTW3	11	/* " */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
>  #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>  #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
>  #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
> @@ -35,6 +36,12 @@
>  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>  
> +/*
> + * This bit indicates a copy-on-write page, and is different from
> + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
> + */
> +#define _PAGE_BIT_DIRTY_SW	_PAGE_BIT_SOFTW5 /* was written to */

Does it *only* indicate a copy-on-write (or copy-on-access) page?  If
so, haven't we misnamed it?

>  /* If _PAGE_BIT_PRESENT is clear, we use these: */
>  /* - if the user mapped it with PROT_NONE; pte_present gives true */
>  #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
> @@ -108,6 +115,28 @@
>  #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
>  #endif
>  
> +/* A R/O and dirty PTE exists in the following cases:

Which dirty is this talking about?  DIRTY_HW?  DIRTY_SW?

> + *	(a) A modified, copy-on-write (COW) page;
> + *	(b) A R/O page that has been COW'ed;
> + *	(c) A SHSTK page.

Don't forget (d).

> + * _PAGE_DIRTY_SW is used to separate case (c) from others.
> + * This results in the following settings:
> + *
> + *	Modified PTE:         (R/W + DIRTY_HW)
> + *	Modified and COW PTE: (R/O + DIRTY_SW)
> + *	R/O PTE COW'ed:       (R/O + DIRTY_SW)
> + *	SHSTK PTE:            (R/O + DIRTY_HW)
> + *	SHSTK PTE COW'ed:     (R/O + DIRTY_HW)
> + *	SHSTK PTE being shared among threads: (R/O + DIRTY_SW)
> + */
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +#define _PAGE_DIRTY_SW	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_SW)
> +#else
> +#define _PAGE_DIRTY_SW	(_AT(pteval_t, 0))
> +#endif
> +
> +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_DIRTY_SW)
> +
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
>  #define _PAGE_TABLE_NOENC	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
> 




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 10/27] x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for _PAGE_DIRTY_SW
  2020-02-05 18:19 ` [RFC PATCH v9 10/27] x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for _PAGE_DIRTY_SW Yu-cheng Yu
@ 2020-02-26 22:02   ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 22:02 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

The subject really needs work.  Could you think of a way to summarize
the changes here in english as opposed to just listing the symbols you
modified?

I think we could probably just auto-generate subjects for patches if the
existing one were sufficient.

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> After the introduction of _PAGE_DIRTY_SW, pte_modify and pmd_modify need to
> set the Dirty bit accordingly: if Shadow Stack is enabled and _PAGE_RW is
> cleared, use _PAGE_DIRTY_SW; otherwise _PAGE_DIRTY_HW.

You've basically gone and written the code's if() statement in english
here.  That doesn't really help me understand the patch.

> Since the Dirty bit is modify by pte_modify(), remove _PAGE_DIRTY_HW from
> PAGE_CHG_MASK.

			 ^ modified

This is a great example of a changelog that adds very little value.
It's following the comments and doing what they say, but it's pretty
obvious that the analysis stopped there.

What *kinds* of bits are in _PAGE_CHG_MASK or not?  What changed about
_PAGE_DIRTY_HW.  By this definition, shouldn't _PAGE_DIRTY_SW have
technically been in this mask before this patch?

> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 62aeb118bc36..2733e7ec16b3 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -702,6 +702,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  	val &= _PAGE_CHG_MASK;
>  	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
>  	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
> +
> +	if (pte_dirty(pte)) {
> +		if (static_cpu_has(X86_FEATURE_SHSTK) && !(val & _PAGE_RW))
> +			val |= _PAGE_DIRTY_SW;
> +		else
> +			val |= _PAGE_DIRTY_HW;
> +	}
> +
>  	return __pte(val);
>  }

OK, so this is a path we use for changing bunches of PTEs to 'newprot'.
 It doesn't use the pte_*() helpers that the previous patch fixed up, so
we need a new site.

Right?

Maybe that would make good changelog text.

Also, couldn't we just have a pte_fixup() function or something that did
this logic and could be shared?

> @@ -712,6 +720,14 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>  	val &= _HPAGE_CHG_MASK;
>  	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
>  	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
> +
> +	if (pmd_dirty(pmd)) {
> +		if (static_cpu_has(X86_FEATURE_SHSTK) && !(val & _PAGE_RW))
> +			val |= _PAGE_DIRTY_SW;
> +		else
> +			val |= _PAGE_DIRTY_HW;
> +	}
> +
>  	return __pmd(val);
>  }
>  
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 826823df917f..e7e28bf7e919 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -150,8 +150,8 @@
>   * instance, and is *not* included in this mask since
>   * pte_modify() does modify it.
>   */
> -#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
> -			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW |	\
> +#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |	\
> +			 _PAGE_SPECIAL | _PAGE_ACCESSED |	\
>  			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
>  #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  2020-02-05 18:19 ` [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
  2020-02-25 20:13   ` Kees Cook
@ 2020-02-26 22:04   ` Dave Hansen
  2020-04-03 15:42     ` Yu-cheng Yu
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 22:04 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 4b04af569c05..e467ca182633 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1201,7 +1201,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
>  	}
>  
>  	/* Clear dirty field. */
> -	se->val64 &= ~_PAGE_DIRTY;
> +	se->val64 &= ~_PAGE_DIRTY_BITS;
>  
>  	ops->clear_pse(se);
>  	ops->clear_ips(se);

Are the i915 maintainers on cc?

Shouldn't this use pte_mkclean() instead of open-coding?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW
  2020-02-05 18:19 ` [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW Yu-cheng Yu
  2020-02-25 20:14   ` Kees Cook
@ 2020-02-26 22:20   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 22:20 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> When Shadow Stack (SHSTK) is enabled, the [R/O + PAGE_DIRTY_HW] setting is
> reserved only for SHSTK.

Got it.

> Non-Shadow Stack R/O PTEs are [R/O + PAGE_DIRTY_SW].

This is only true for *dirty* PTEs, right?

> When a PTE goes from [R/W + PAGE_DIRTY_HW] to [R/O + PAGE_DIRTY_SW], it
> could become a transient SHSTK PTE in two cases.
> 
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> SHSTK PTE.  However, this will not occur on processors supporting SHSTK
> therefore we don't need a TLB flush here.
> 
> The second case is that when the software, without atomic, tests & replaces
> PAGE_DIRTY_HW with PAGE_DIRTY_SW, a transient SHSTK PTE can exist.  This is
> prevented with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> v9:
> - Change compile-time conditionals to runtime checks.
> - Fix parameters of try_cmpxchg(): change pte_t/pmd_t to
>   pte_t.pte/pmd_t.pmd.
> 
> v4:
> - Implement try_cmpxchg().
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/pgtable.h | 66 ++++++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2733e7ec16b3..43cb27379208 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1253,6 +1253,39 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +	/*
> +	 * Some processors can start a write, but end up seeing a read-only
> +	 * PTE by the time they get to the Dirty bit.  In this case, they
> +	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
> +	 * looks like a Shadow Stack PTE.
> +	 *
> +	 * However, this behavior has been improved and will not occur on
> +	 * processors supporting Shadow Stack.  Without this guarantee, a
> +	 * transition to a non-present PTE and flush the TLB would be
> +	 * needed.
> +	 *
> +	 * When changing a writable PTE to read-only and if the PTE has
> +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so that
> +	 * the PTE is not a valid Shadow Stack PTE.
> +	 */
> +#ifdef CONFIG_X86_64
> +	if (static_cpu_has(X86_FEATURE_SHSTK)) {

Judicious application of arch/x86/include/asm/disabled-features.h should
be able to get rid of the #ifdef.  See pkeys in there for another example.

> +		pte_t new_pte, pte = READ_ONCE(*ptep);
> +
> +		do {
> +			/*
> +			 * This is the same as moving _PAGE_DIRTY_HW
> +			 * to _PAGE_DIRTY_SW.
> +			 */
> +			new_pte = pte_wrprotect(pte);
> +			new_pte.pte |= (new_pte.pte & _PAGE_DIRTY_HW) >>
> +					_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
> +			new_pte.pte &= ~_PAGE_DIRTY_HW;
> +		} while (!try_cmpxchg(&ptep->pte, &pte.pte, new_pte.pte));

Have you tried to test this code?

This is trying to transition the value at '&ptep->pte' from the
'pte.pte' value to 'new_pte.pte'.  If the value at '&ptep->pte' does not
match 'pte.pte', the cmpxchg will fail and we'll run through the loop again.

What terminates that loop?

The "old" value (pte.pte) never gets updated since it is read outside
the loop.  There's no guarantee that the contents (&ptep->pte) will ever
match pte.pte.

Doesn't the READ_ONCE() need to be inside the loop?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking
  2020-02-05 18:19 ` [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
  2020-02-25 20:16   ` Kees Cook
@ 2020-02-26 22:47   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-26 22:47 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> If a page fault is triggered by a Shadow Stack (SHSTK) access
> (e.g. CALL/RET) or SHSTK management instructions (e.g. WRUSSQ), then bit[6]
> of the page fault error code is set.

How about starting with a definition:

	Shadow stack accesses are those that are performed by the CPU
	where it expects to encounter a shadow stack mapping.  These
	accesses are performed implicitly by CALL/RET at the site of the
	shadow stack pointer.  These accesses are made explicitly by
	shadow stack management instructions like WRUSSQ.

> In access_error(), verify a SHSTK page fault is within a SHSTK memory area.
> It is always an error otherwise.

How about: Shadow stacks accesses to shadow-stack mapping can see faults
in normal, valid operation just like regular accesses to regular
mappings.  Shadow stacks need some of the same features like delayed
allocation, swap and copy-on-write.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a
non-shadow-stack mapping.

> For a valid SHSTK access, set FAULT_FLAG_WRITE to effect copy-on-write.

It seems rather odd to want copy-on-write behavior for read faults.
Could you elaborate on why, please?

> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 7ac26bbd0bef..8023d177fcd8 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -169,6 +169,7 @@ enum {
>   *   bit 3 ==				1: use of reserved bit detected
>   *   bit 4 ==				1: fault was an instruction fetch
>   *   bit 5 ==				1: protection keys block access
> + *   bit 6 ==				1: shadow stack access fault
>   */
>  enum x86_pf_error_code {
>  	X86_PF_PROT	=		1 << 0,
> @@ -177,5 +178,6 @@ enum x86_pf_error_code {
>  	X86_PF_RSVD	=		1 << 3,
>  	X86_PF_INSTR	=		1 << 4,
>  	X86_PF_PK	=		1 << 5,
> +	X86_PF_SHSTK	=		1 << 6,
>  };
>  #endif /* _ASM_X86_TRAPS_H */
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 304d31d8cbbc..9c1243302663 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1187,6 +1187,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>  				       (error_code & X86_PF_INSTR), foreign))
>  		return 1;
>  
> +	/*
> +	 * Verify X86_PF_SHSTK is within a Shadow Stack VMA.
> +	 * It is always an error if there is a Shadow Stack
> +	 * fault outside a Shadow Stack VMA.
> +	 */

Nit: there was an access that caused the fault.  We can be a bit more
broad in the implications from the comment if we say "access" instead of
"fault".

> +	if (error_code & X86_PF_SHSTK) {
> +		if (!(vma->vm_flags & VM_SHSTK))
> +			return 1;
> +		return 0;
> +	}
> +
>  	if (error_code & X86_PF_WRITE) {
>  		/* write, present and write, not present: */
>  		if (unlikely(!(vma->vm_flags & VM_WRITE)))

Is there an analogous check for !X86_PF_SHSTK faults to VM_SHSTK VMAs?

> @@ -1344,6 +1355,13 @@ void do_user_addr_fault(struct pt_regs *regs,
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
> +	/*
> +	 * If the fault is caused by a Shadow Stack access,
> +	 * i.e. CALL/RET/SAVEPREVSSP/RSTORSSP, then set
> +	 * FAULT_FLAG_WRITE to effect copy-on-write.
> +	 */
> +	if (hw_error_code & X86_PF_SHSTK)
> +		flags |= FAULT_FLAG_WRITE;
>  	if (hw_error_code & X86_PF_WRITE)
>  		flags |= FAULT_FLAG_WRITE;
>  	if (hw_error_code & X86_PF_INSTR)

It would be great if you could also include the *why*.  *Why* do read
faults need copy-on-write semantics?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-02-05 18:19 ` [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault Yu-cheng Yu
  2020-02-25 20:20   ` Kees Cook
@ 2020-02-27  0:08   ` Dave Hansen
  2020-04-07 18:14     ` Yu-cheng Yu
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-02-27  0:08 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

> diff --git a/mm/memory.c b/mm/memory.c
> index 45442d9a4f52..6daa28614327 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -772,7 +772,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * If it's a COW mapping, write protect it both
>  	 * in the parent and the child
>  	 */
> -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> +	if ((is_cow_mapping(vm_flags) && pte_write(pte)) ||
> +	    arch_copy_pte_mapping(vm_flags)) {
>  		ptep_set_wrprotect(src_mm, addr, src_pte);
>  		pte = pte_wrprotect(pte);
>  	}

You have to modify this because pte_write()==0 for shadow stack PTEs, right?

Aren't shadow stack ptes *logically* writable, even if they don't have
the write bit set?  What would happen if we made pte_write()==1 for them?

> @@ -2417,6 +2418,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  	entry = pte_mkyoung(vmf->orig_pte);
>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	entry = pte_set_vma_features(entry, vma);
>  	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
>  		update_mmu_cache(vma, vmf->address, vmf->pte);
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2504,6 +2506,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		entry = pte_set_vma_features(entry, vma);
>  		/*
>  		 * Clear the pte entry and flush it first, before updating the
>  		 * pte with the new entry. This will avoid a race condition
> @@ -3023,6 +3026,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +		pte = pte_set_vma_features(pte, vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
>  		exclusive = RMAP_EXCLUSIVE;
> @@ -3165,6 +3169,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	entry = mk_pte(page, vma->vm_page_prot);
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
> +	entry = pte_set_vma_features(entry, vma);
>  
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  			&vmf->ptl);
> 

These seem wrong, or at best inconsistent with what's already done.

We don't need anything like pte_set_vma_features() today because we have
vma->vm_page_prot.  We could easily have done what you suggest here for
things like protection keys: ignore the pkey PTE bits until we create
the final PTE then shove them in there.

What are the bit patterns of the shadow stack bits that come out of
these sites?  Can they be represented in ->vm_page_prot?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 16/27] mm: Update can_follow_write_pte() for Shadow Stack
  2020-02-05 18:19 ` [RFC PATCH v9 16/27] mm: Update can_follow_write_pte() for Shadow Stack Yu-cheng Yu
@ 2020-02-27  0:34   ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-27  0:34 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

> +inline bool pte_exclusive(pte_t pte, struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & VM_SHSTK)
> +		return pte_dirty_hw(pte);
> +	else
> +		return pte_dirty(pte);
> +}

I'm not really getting the naming.  What is exclusive?

> diff --git a/mm/gup.c b/mm/gup.c
> index 7646bf993b25..d1dbfbde8443 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -164,10 +164,12 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>   * FOLL_FORCE can write to even unwritable pte's, but only
>   * after we've gone through a COW cycle and they are dirty.
>   */
> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> +static inline bool can_follow_write(pte_t pte, unsigned int flags,
> +				    struct vm_area_struct *vma)

Having two identically named functions in two files in the same
subsystem seems like a recipe for confusion when I grep or cscope for
things.  It hardly seems worth the 4 characters of space savings IMNHO.

>  {
>  	return pte_write(pte) ||
> -		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> +		((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> +		 pte_exclusive(pte, vma));
>  }

FWIW, this is the hunk that fixed DirtyCOW.

The least this deserves is acknowledgement of that in the changelog and
a missive about how you're sure you didn't just introduce
ShadowDirtyCOW.  Don't bother.  I already registered the domain. ;)


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support
  2020-02-05 18:19 ` [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support Yu-cheng Yu
  2020-02-25 21:07   ` Kees Cook
@ 2020-02-27  0:55   ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-27  0:55 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> This patch adds basic Shadow Stack (SHSTK) enabling/disabling routines.
> A task's SHSTK is allocated from memory with VM_SHSTK flag and read-only
> protection.  It has a fixed size of RLIMIT_STACK.
> 
> v9:
> - Change cpu_feature_enabled() to static_cpu_has().
> - Merge cet_disable_shstk to cet_disable_free_shstk.
> - Remove the empty slot at the top of the SHSTK, as it is not needed.
> - Move do_mmap_locked() to alloc_shstk(), which is a static function.
> 
> v6:
> - Create a function do_mmap_locked() for SHSTK allocation.
> 
> v2:
> - Change noshstk to no_cet_shstk.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/cet.h                    |  31 +++++
>  arch/x86/include/asm/disabled-features.h      |   8 +-
>  arch/x86/include/asm/processor.h              |   5 +
>  arch/x86/kernel/Makefile                      |   2 +
>  arch/x86/kernel/cet.c                         | 121 ++++++++++++++++++
>  arch/x86/kernel/cpu/common.c                  |  25 ++++
>  arch/x86/kernel/process.c                     |   1 +
>  .../arch/x86/include/asm/disabled-features.h  |   8 +-
>  8 files changed, 199 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/cet.h
>  create mode 100644 arch/x86/kernel/cet.c
> 
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> new file mode 100644
> index 000000000000..c44c991ca91f
> --- /dev/null
> +++ b/arch/x86/include/asm/cet.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_CET_H
> +#define _ASM_X86_CET_H
> +
> +#ifndef __ASSEMBLY__
> +#include <linux/types.h>
> +
> +struct task_struct;
> +/*
> + * Per-thread CET status
> + */
> +struct cet_status {
> +	unsigned long	shstk_base;
> +	unsigned long	shstk_size;
> +	unsigned int	shstk_enabled:1;
> +};

Just out of curiosity, what's the theoretical size limit of shadow
stacks?  Is there one?

Also, not to nitpick too much, but you could pretty easily save the
storage of shstk_enabled by using 0 for the size.

> +#ifdef CONFIG_X86_INTEL_CET
> +int cet_setup_shstk(void);
> +void cet_disable_free_shstk(struct task_struct *p);
> +#else
> +static inline void cet_disable_free_shstk(struct task_struct *p) {}
> +#endif
> +
> +#define cpu_x86_cet_enabled() \
> +	(static_cpu_has(X86_FEATURE_SHSTK) || \
> +	 static_cpu_has(X86_FEATURE_IBT))
> +
> +#endif /* __ASSEMBLY__ */

You don't need the #ifdef if you stick the X86_FEATUREs in
disabled-features.h properly.

> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> new file mode 100644
> index 000000000000..b4c7d88e9a8f
> --- /dev/null
> +++ b/arch/x86/kernel/cet.c
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * cet.c - Control-flow Enforcement (CET)
> + *
> + * Copyright (c) 2019, Intel Corporation.
> + * Yu-cheng Yu <yu-cheng.yu@intel.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/signal.h>
> +#include <linux/compat.h>
> +#include <asm/msr.h>
> +#include <asm/user.h>
> +#include <asm/fpu/internal.h>
> +#include <asm/fpu/xstate.h>
> +#include <asm/fpu/types.h>
> +#include <asm/cet.h>
> +
> +static void start_update_msrs(void)
> +{
> +	fpregs_lock();
> +	if (test_thread_flag(TIF_NEED_FPU_LOAD))
> +		__fpregs_load_activate();
> +}
> +
> +static void end_update_msrs(void)
> +{
> +	fpregs_unlock();
> +}
> +
> +static unsigned long cet_get_shstk_addr(void)
> +{
> +	struct fpu *fpu = &current->thread.fpu;
> +	unsigned long ssp = 0;
> +
> +	fpregs_lock();
> +
> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
> +		rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +	} else {
> +		struct cet_user_state *p;
> +
> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
> +		if (p)
> +			ssp = p->user_ssp;
> +	}
> +
> +	fpregs_unlock();
> +	return ssp;
> +}
> +
> +static unsigned long alloc_shstk(unsigned long size)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, populate;
> +
> +	down_write(&mm->mmap_sem);
> +	addr = do_mmap(NULL, 0, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE,
> +		       VM_SHSTK, 0, &populate, NULL);
> +	up_write(&mm->mmap_sem);
> +
> +	if (populate)
> +		mm_populate(addr, populate);
> +
> +	return addr;
> +}
> +
> +int cet_setup_shstk(void)
> +{
> +	unsigned long addr, size;
> +	struct cet_status *cet = &current->thread.cet;
> +
> +	if (!static_cpu_has(X86_FEATURE_SHSTK))
> +		return -EOPNOTSUPP;
> +
> +	size = rlimit(RLIMIT_STACK);

This doesn't seem right.  In general, I thought you could have disabled
rlimits, which would mean a size of -1:
	
	#define RLIM64_INFINITY         (~0ULL)

Or is there something special about stacks that I'm missing?

Also, does size need to be page aligned?

> +	addr = alloc_shstk(size);
> +
> +	if (IS_ERR((void *)addr))
> +		return PTR_ERR((void *)addr);
> +
> +	cet->shstk_base = addr;
> +	cet->shstk_size = size;
> +	cet->shstk_enabled = 1;
> +
> +	start_update_msrs();
> +	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> +	wrmsrl(MSR_IA32_U_CET, MSR_IA32_CET_SHSTK_EN);

Doesn't MSR_IA32_U_CET have lots of bits?  Won't this blow away other bits?

> +	end_update_msrs();
> +	return 0;
> +}
> +
> +void cet_disable_free_shstk(struct task_struct *tsk)
> +{
> +	struct cet_status *cet = &tsk->thread.cet;
> +
> +	if (!static_cpu_has(X86_FEATURE_SHSTK) ||
> +	    !cet->shstk_enabled || !cet->shstk_base)
> +		return;

This seems to indicate that you can have ->shstk_base set without it
being enabled.  But I don't see any spots in the code that do that.
Confused.

> +	if (!tsk->mm || (tsk->mm != current->mm))
> +		return;
> +
> +	if (tsk == current) {
> +		u64 msr_val;
> +
> +		start_update_msrs();
> +		rdmsrl(MSR_IA32_U_CET, msr_val);
> +		wrmsrl(MSR_IA32_U_CET, msr_val & ~MSR_IA32_CET_SHSTK_EN);
> +		end_update_msrs();
> +	}
> +
> +	vm_munmap(cet->shstk_base, cet->shstk_size);

What about vm_munmap() failure?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-26 18:05   ` Dave Hansen
@ 2020-02-27  1:02     ` H.J. Lu
  2020-02-27  1:16       ` Dave Hansen
  2020-03-06 18:37     ` Yu-cheng Yu
  1 sibling, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-02-27  1:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, linux-doc, Linux-MM,
	linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, Feb 26, 2020 at 10:05 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > +# Check assembler Shadow Stack suppot
>
>                                   ^ support
>
> > +ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> > +  ifeq ($(call as-instr, saveprevssp, y),)
> > +      $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
> > +  endif
> > +endif
>
> Is this *just* looking for instruction support in the assembler?
>
> We usually just .byte them, like this for pkeys:
>
>         asm volatile(".byte 0x0f,0x01,0xee\n\t"
>                      : "=a" (pkru), "=d" (edx)
>                      : "c" (ecx));
>
> That way everybody with old toolchains can still build the kernel (and
> run/test code with your config option on, btw...).

CET requires a complete new OS image from kernel, toolchain, run-time.
CET enabled kernel without the rest of updated OS won't give you CET
at all.

-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-27  1:02     ` H.J. Lu
@ 2020-02-27  1:16       ` Dave Hansen
  2020-02-27  2:11         ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-02-27  1:16 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, linux-doc, Linux-MM,
	linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/26/20 5:02 PM, H.J. Lu wrote:
>> That way everybody with old toolchains can still build the kernel (and
>> run/test code with your config option on, btw...).
> CET requires a complete new OS image from kernel, toolchain, run-time.
> CET enabled kernel without the rest of updated OS won't give you CET
> at all.

If you require a new toolchain, nobody even builds your fancy feature.
Probably including 0day and all of the lazy maintainers with crufty old
distros.

The point isn't to actually run CET at all.  The point is to get as many
people as possible testing as much of it as possible.  Testing includes
compile testing, static analysis and bloat watching.  It also includes
functional and performance testing when you've got the feature compiled
in but unavailable at runtime.  Did this hurt anything even when I'm not
using it?



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-27  1:16       ` Dave Hansen
@ 2020-02-27  2:11         ` H.J. Lu
  2020-02-27  3:57           ` Andy Lutomirski
  0 siblings, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-02-27  2:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, linux-doc, Linux-MM,
	linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, Feb 26, 2020 at 5:16 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/26/20 5:02 PM, H.J. Lu wrote:
> >> That way everybody with old toolchains can still build the kernel (and
> >> run/test code with your config option on, btw...).
> > CET requires a complete new OS image from kernel, toolchain, run-time.
> > CET enabled kernel without the rest of updated OS won't give you CET
> > at all.
>
> If you require a new toolchain, nobody even builds your fancy feature.
> Probably including 0day and all of the lazy maintainers with crufty old
> distros.

GCC 8 or above is needed since vDSO must be compiled with
--fcf-protection=branch.

> The point isn't to actually run CET at all.  The point is to get as many
> people as possible testing as much of it as possible.  Testing includes
> compile testing, static analysis and bloat watching.  It also includes
> functional and performance testing when you've got the feature compiled
> in but unavailable at runtime.  Did this hurt anything even when I'm not
> using it?
>

I will leave the CET toolchain issue to Yu-cheng.

-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-27  2:11         ` H.J. Lu
@ 2020-02-27  3:57           ` Andy Lutomirski
  2020-02-27 18:03             ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Andy Lutomirski @ 2020-02-27  3:57 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review


> On Feb 26, 2020, at 6:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Wed, Feb 26, 2020 at 5:16 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>> On 2/26/20 5:02 PM, H.J. Lu wrote:
>>>> That way everybody with old toolchains can still build the kernel (and
>>>> run/test code with your config option on, btw...).
>>> CET requires a complete new OS image from kernel, toolchain, run-time.
>>> CET enabled kernel without the rest of updated OS won't give you CET
>>> at all.
>> 
>> If you require a new toolchain, nobody even builds your fancy feature.
>> Probably including 0day and all of the lazy maintainers with crufty old
>> distros.
> 
> GCC 8 or above is needed since vDSO must be compiled with
> --fcf-protection=branch.

Fair enough. I don’t particularly want to carry a gross hack to add the ENDBRANCHes without compiler support.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-27  3:57           ` Andy Lutomirski
@ 2020-02-27 18:03             ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-02-27 18:03 UTC (permalink / raw)
  To: Andy Lutomirski, H.J. Lu
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, linux-doc, Linux-MM,
	linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 2/26/20 7:57 PM, Andy Lutomirski wrote:
>> GCC 8 or above is needed since vDSO must be compiled with 
>> --fcf-protection=branch.
> Fair enough. I don’t particularly want to carry a gross hack to add
> the ENDBRANCHes without compiler support.

Yeah, that's not worth it.

But my main issue the shadow stack instructions:

>> +ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
>> +  ifeq ($(call as-instr, saveprevssp, y),)
>> +      $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
>> +  endif
>> +endif

Which are unrelated to ENDBRANCH.

But, in any case, let's say Kconfig says we should try to use IBT, but
we get to building the vDSO and don't have the right toolchain.  Do we
just stop the build?  Or do we let the build go on and then decline to
let folks enable IBT at runtime?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-25 20:02   ` Kees Cook
@ 2020-02-28 15:55     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-02-28 15:55 UTC (permalink / raw)
  To: Kees Cook
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Tue, 2020-02-25 at 12:02 -0800, Kees Cook wrote:
> On Wed, Feb 05, 2020 at 10:19:09AM -0800, Yu-cheng Yu wrote:
> > Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
> > document on Control-flow Enforcement Technology (CET).
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> I'm not a huge fan of the boot param names, but I can't suggest anything
> better. ;) I love the extensive docs!
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>

Thanks for reviewing!

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-02-25 20:20   ` Kees Cook
@ 2020-03-05 18:30     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-05 18:30 UTC (permalink / raw)
  To: Kees Cook
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Tue, 2020-02-25 at 12:20 -0800, Kees Cook wrote:
> On Wed, Feb 05, 2020 at 10:19:22AM -0800, Yu-cheng Yu wrote:
> > When a task does fork(), its Shadow Stack (SHSTK) must be duplicated for
> > the child.  This patch implements a flow similar to copy-on-write of an
> > anonymous page, but for SHSTK.
> > 
> > A SHSTK PTE must be RO and Dirty.  This Dirty bit requirement is used to
> > effect the copying.  In copy_one_pte(), clear the Dirty bit from a SHSTK
> > PTE to cause a page fault upon the next SHSTK access.  At that time, fix
> > the PTE and copy/re-use the page.
> 
> Just to confirm, during the fork, it's really not a SHSTK for a moment
> (it's still RO, but not dirty). Can other racing threads muck this up,
> or is this bit removed only on the copied side?

In [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and
pmdp_set_wrprotect for _PAGE_DIRTY_SW, _PAGE_DIRTY_HW is changed to
_PAGE_DIRTY_SW with cmpxchg.  That prevents racing.

The hw dirty bit is removed from the original copy first.  The next shadow
stack access to the page causes copying.  The copied page gets the hw dirty
bit again.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction
  2020-02-25 21:10   ` Kees Cook
@ 2020-03-05 18:39     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-05 18:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Tue, 2020-02-25 at 13:10 -0800, Kees Cook wrote:
> On Wed, Feb 05, 2020 at 10:19:26AM -0800, Yu-cheng Yu wrote:
> > WRUSS is a new kernel-mode instruction but writes directly to user Shadow
> > Stack (SHSTK) memory.  This is used to construct a return address on SHSTK
> > for the signal handler.
> > 
> > This instruction can fault if the user SHSTK is not valid SHSTK memory.
> > In that case, the kernel does a fixup.
> 
> Since these functions aren't used in this patch, should this get merged
> with patch 19?

Yes, I can do that.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-26 17:03   ` Dave Hansen
  2020-02-26 19:57     ` Pavel Machek
@ 2020-03-05 20:38     ` Yu-cheng Yu
  1 sibling, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-05 20:38 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, 2020-02-26 at 09:03 -0800, Dave Hansen wrote:
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > Introduce Kconfig option: X86_INTEL_SHADOW_STACK_USER.
> > 
> > Shadow Stack (SHSTK) provides protection against function return address
> > corruption.  It is active when the kernel has this feature enabled, and
> > both the processor and the application support it.  When this feature is
> > enabled, legacy non-SHSTK applications continue to work, but without SHSTK
> > protection.
> > 
> > The user-mode SHSTK protection is only implemented for the 64-bit kernel.
> > IA32 applications are supported under the compatibility mode.
> 
> I think what you're trying to say here is that the hardware supports
> shadow stacks with 32-bit kernels.  However, this series does not
> include that support and we have no plans to add it.
> 
> Right?

Yes.

> 
> I'll let others weigh in, but I rather dislike the use of acronyms here.
>  I'd much rather see the english "shadow stack" everywhere than SHSTK.

I will change to shadow stack.

> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 5e8949953660..6c34b701c588 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1974,6 +1974,28 @@ config X86_INTEL_TSX_MODE_AUTO
> >  	  side channel attacks- equals the tsx=auto command line parameter.
> >  endchoice
> >  
> > +config X86_INTEL_CET
> > +	def_bool n
> > +
> > +config ARCH_HAS_SHSTK
> > +	def_bool n
> > +
> > +config X86_INTEL_SHADOW_STACK_USER
> > +	prompt "Intel Shadow Stack for user-mode"
> 
> Nit: this whole thing is to support more than a single stack.  I'd make
> this plural at least in the text: "shadow stacks".

OK.

> 
> > +	def_bool n
> > +	depends on CPU_SUP_INTEL && X86_64
> > +	select ARCH_USES_HIGH_VMA_FLAGS
> > +	select X86_INTEL_CET
> > +	select ARCH_HAS_SHSTK
> > +	---help---
> > +	  Shadow Stack (SHSTK) provides protection against program
> > +	  stack corruption.  It is active when the kernel has this
> > +	  feature enabled, and the processor and the application
> > +	  support it.  When this feature is enabled, legacy non-SHSTK
> > +	  applications continue to work, but without SHSTK protection.
> > +
> > +	  If unsure, say y.
> 
> This is missing a *lot* of information.
> 
> What matters to someone turning this on?
> 
> 1. It's a hardware feature.  This only matters if you have the right
>    hardware
> 2. It's a security hardening feature.  You dance around this, but need
>    to come out and say it.
> 3. Apps must be enabled to use it.  You get no protection "for free" on
>    old userspace.
> 4. The hardware supports user and kernel, but this option is for
>    userspace only.

I will update the help text.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler
  2020-02-26 17:10   ` Dave Hansen
@ 2020-03-05 20:44     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-05 20:44 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, 2020-02-26 at 09:10 -0800, Dave Hansen wrote:
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> > index 87ef69a72c52..8ed406f469e7 100644
> > --- a/arch/x86/kernel/idt.c
> > +++ b/arch/x86/kernel/idt.c
> > @@ -102,6 +102,10 @@ static const __initconst struct idt_data def_idts[] = {
> >  #elif defined(CONFIG_X86_32)
> >  	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
> >  #endif
> > +
> > +#ifdef CONFIG_X86_64
> > +	INTG(X86_TRAP_CP,		control_protection),
> > +#endif
> >  };
> 
> This patch in particular appears to have all of its code unconditionally
> compiled in.  That's in contrast to things that have Kconfig options, like:
> 
> #ifdef CONFIG_X86_MCE
>         INTG(X86_TRAP_MC,               &machine_check),
> #endif
> 
> or:
> 
> #ifdef CONFIG_X86_THERMAL_VECTOR
>         INTG(THERMAL_APIC_VECTOR,       thermal_interrupt),
> #endif
> 
> Is there a reason this code is always compiled in on 64-bit even when
> the config option is off?

I will change it to CONFIG_X86_INTEL_CET.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-02-26 18:05   ` Dave Hansen
  2020-02-27  1:02     ` H.J. Lu
@ 2020-03-06 18:37     ` Yu-cheng Yu
  2020-03-06 19:02       ` Dave Hansen
  1 sibling, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-06 18:37 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, 2020-02-26 at 10:05 -0800, Dave Hansen wrote:
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > +# Check assembler Shadow Stack suppot
> 
> 				  ^ support
> 
> > +ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> > +  ifeq ($(call as-instr, saveprevssp, y),)
> > +      $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
> > +  endif
> > +endif
> 
> Is this *just* looking for instruction support in the assembler?
> 
> We usually just .byte them, like this for pkeys:
> 
>         asm volatile(".byte 0x0f,0x01,0xee\n\t"
>                      : "=a" (pkru), "=d" (edx)
>                      : "c" (ecx));
> 
> That way everybody with old toolchains can still build the kernel (and
> run/test code with your config option on, btw...).

We used to do this for CET instructions, but after adding kernel-mode
instructions and inserting ENDBR's, the code becomes cluttered.  I also
found an earlier discussion on the ENDBR:

https://lore.kernel.org/lkml/CALCETrVRH8LeYoo7V1VBPqg4WS0Enxtizt=T7dPvgoeWfJrdzA@mail.gmail.com/

It makes sense to let the user know early on that the system cannot support
CET and cannot build a CET-enabled kernel.

One thing we can do is to disable CET in Kconfig and not in kernel
build, which I will do in the next version.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-03-06 18:37     ` Yu-cheng Yu
@ 2020-03-06 19:02       ` Dave Hansen
  2020-03-06 21:16         ` Yu-cheng Yu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-03-06 19:02 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 3/6/20 10:37 AM, Yu-cheng Yu wrote:
> We used to do this for CET instructions, but after adding kernel-mode
> instructions and inserting ENDBR's, the code becomes cluttered.  I also
> found an earlier discussion on the ENDBR:
> 
> https://lore.kernel.org/lkml/CALCETrVRH8LeYoo7V1VBPqg4WS0Enxtizt=T7dPvgoeWfJrdzA@mail.gmail.com/
> 
> It makes sense to let the user know early on that the system cannot support
> CET and cannot build a CET-enabled kernel.
> 
> One thing we can do is to disable CET in Kconfig and not in kernel
> build, which I will do in the next version.

I'll go on the record and say I think we should allow building
CET-enabled kernels on old toolchains.  We need it for build test
coverage.  We can spit out a warning, but we need to allow building it.

Andy L, do you have any heartburn with that?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection
  2020-03-06 19:02       ` Dave Hansen
@ 2020-03-06 21:16         ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-06 21:16 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Fri, 2020-03-06 at 11:02 -0800, Dave Hansen wrote:
> On 3/6/20 10:37 AM, Yu-cheng Yu wrote:
> > We used to do this for CET instructions, but after adding kernel-mode
> > instructions and inserting ENDBR's, the code becomes cluttered.  I also
> > found an earlier discussion on the ENDBR:
> > 
> > https://lore.kernel.org/lkml/CALCETrVRH8LeYoo7V1VBPqg4WS0Enxtizt=T7dPvgoeWfJrdzA@mail.gmail.com/
> > 
> > It makes sense to let the user know early on that the system cannot support
> > CET and cannot build a CET-enabled kernel.
> > 
> > One thing we can do is to disable CET in Kconfig and not in kernel
> > build, which I will do in the next version.
> 
> I'll go on the record and say I think we should allow building
> CET-enabled kernels on old toolchains.  We need it for build test
> coverage.  We can spit out a warning, but we need to allow building it.

The build test will go through (assembler or .byte), once the opcode patch
is applied [1].  Also, when we enable kernel-mode CET, it is difficult to
build IBT code without the right tool chain.

Yu-cheng

[1] opcode patch: 
https://lore.kernel.org/lkml/20200204171425.28073-1-yu-cheng.yu@intel.com/




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-02-26 17:57   ` Dave Hansen
@ 2020-03-09 17:00     ` Yu-cheng Yu
  2020-03-09 17:21       ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-09 17:00 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote:
> > index ade4e6ec23e0..8b69ebf0baed 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3001,6 +3001,12 @@
> >  			noexec=on: enable non-executable mappings (default)
> >  			noexec=off: disable non-executable mappings
> >  
> > +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> > +			applications
> 
> If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
> for userspace"?

What about no_user_shstk, no_kernel_shstk?

> 
> > +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
> > +			applications
> > +
> >  	nosmap		[X86,PPC]
> >  			Disable SMAP (Supervisor Mode Access Prevention)
> >  			even if it is supported by processor.
> 
> BTW, this documentation is misplaced.  It needs to go to the spot where
> you introduce the code for these options.

We used to introduce the document later in the series.  The feedback was to
introduce it first so that readers know what to expect.

[...]

> > diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> > new file mode 100644
> > index 000000000000..71e2462fea5c
> > --- /dev/null
> > +++ b/Documentation/x86/intel_cet.rst
> > @@ -0,0 +1,294 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=========================================
> > +Control-flow Enforcement Technology (CET)
> > +=========================================
> > +
> > +[1] Overview
> > +============
> > +
> > +Control-flow Enforcement Technology (CET) provides protection against
> > +return/jump-oriented programming (ROP) attacks.  It can be setup to

[...]

> > +
> > +There are two kernel configuration options:
> > +
> > +    X86_INTEL_SHADOW_STACK_USER, and
> > +    X86_INTEL_BRANCH_TRACKING_USER.
> > +
> > +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> > +are required.
> 
> Why are these needed to build a CET-enabled kernel?

We could (and used to) allow legacy toolchains, but after considering
practical purposes, dropped the support.  We can continue the discussion,
and if those are desired, bring them back.

[...]

> > +[2] CET assembly instructions
> > +=============================
> 
> Why do we need this in the kernel?  What is specific to Linux or the
> kernel?  Why wouldn't I just go read the SDM if I want to know how the
> instructions work?

Now the SDM has this.  I will drop this section.

> > +[3] Application Enabling
> > +========================
> > +
> > +An application's CET capability is marked in its ELF header and can
> > +be verified from the following command output, in the
> > +NT_GNU_PROPERTY_TYPE_0 field:
> > +
> > +    readelf -n <application>
> > +
> > +If an application supports CET and is statically linked, it will run
> > +with CET protection.  If the application needs any shared libraries,
> > +the loader checks all dependencies and enables CET only when all
> > +requirements are met.
> 
> What about shared libraries loaded after the program starts?

The loader does the check for dlopen().


> > +[4] Legacy Libraries
> > +====================
> > +
> > +GLIBC provides a few tunables for backward compatibility.
> > +
> > +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
> > +    Turn off SHSTK/IBT for the current shell.
> > +
> > +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> > +    This controls how dlopen() handles SHSTK legacy libraries::
> > +
> > +        on         - continue with SHSTK enabled;
> > +        permissive - continue with SHSTK off.
> 
> This seems like manpage fodder more than kernel documentation to me.

Yes, we can drop this as well.

[...]

> > +Note:
> > +  There is no CET-enabling arch_prctl function.  By design, CET is
> > +  enabled automatically if the binary and the system can support it.
> 
> This is kinda interesting.  It means that a JIT couldn't choose to
> protect the code it generates and have different rules from itself?

JIT needs to be updated for CET first.  Once that is done, it runs with CET
enabled.  It can use the NOTRACK prefix, for example.

> > +  The parameters passed are always unsigned 64-bit.  When an IA32
> > +  application passing pointers, it should only use the lower 32 bits.
> 
> Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
> even know it's running on a 64-bit kernel?

The 32-bit app is passing only a pointer to an array of 64-bit numbers.

> 
> > +[6] The implementation of the SHSTK
> > +===================================
> > +
> > +SHSTK size
> > +----------
> > +
> > +A task's SHSTK is allocated from memory to a fixed size of
> > +RLIMIT_STACK.
> 
> I can't really parse that sentence.  Is this saying that shadow stacks
> are limited by and share space with normal stacks via RLIMIT_STACK?
> 
> >  A compat-mode thread's SHSTK size is 1/4 of
> > +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> > +share a 32-bit address space.
> 
> I thought the size was passed in from userspace?  Where does this sizing
> take place?  Is this a convention or is it being enforced?

I will make this (and other things you pointed out) clear in the next
version.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 17:00     ` Yu-cheng Yu
@ 2020-03-09 17:21       ` Dave Hansen
  2020-03-09 19:27         ` Yu-cheng Yu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-03-09 17:21 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote:
>>> index ade4e6ec23e0..8b69ebf0baed 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -3001,6 +3001,12 @@
>>>  			noexec=on: enable non-executable mappings (default)
>>>  			noexec=off: disable non-executable mappings
>>>  
>>> +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
>>> +			applications
>>
>> If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
>> for userspace"?
> 
> What about no_user_shstk, no_kernel_shstk?

Those are better.

>>> +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
>>> +			applications
>>> +
>>>  	nosmap		[X86,PPC]
>>>  			Disable SMAP (Supervisor Mode Access Prevention)
>>>  			even if it is supported by processor.
>>
>> BTW, this documentation is misplaced.  It needs to go to the spot where
>> you introduce the code for these options.
> 
> We used to introduce the document later in the series.  The feedback was to
> introduce it first so that readers know what to expect.

To me, that doesn't apply for things that are implemented in this
specific of a spot in the code and *ALSO* might not even make the final
series.


>>> +Note:
>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
>>> +  enabled automatically if the binary and the system can support it.
>>
>> This is kinda interesting.  It means that a JIT couldn't choose to
>> protect the code it generates and have different rules from itself?
> 
> JIT needs to be updated for CET first.  Once that is done, it runs with CET
> enabled.  It can use the NOTRACK prefix, for example.

Am I missing something?

What's the direct connection between shadow stacks and Indirect Branch
Tracking other than Intel marketing umbrellas?

>>> +  The parameters passed are always unsigned 64-bit.  When an IA32
>>> +  application passing pointers, it should only use the lower 32 bits.
>>
>> Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
>> even know it's running on a 64-bit kernel?
> 
> The 32-bit app is passing only a pointer to an array of 64-bit numbers.

Well, the documentation just talked about pointers and I naively assume
it means the "unsigned long *" you had in there.

Rather than make suggestions, just say that the ABI is universally
64-bit.  Saying that the pointers must be valid is just kinda silly.
It's also not 100% clear what an "IA32 application" *MEANS* given fun
things like x32.

Also, I went to go find this implementation in your series.  I couldn't
find it.  Did I miss a patch?  Or are you documenting things you didn't
even implement?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 17:21       ` Dave Hansen
@ 2020-03-09 19:27         ` Yu-cheng Yu
  2020-03-09 19:35           ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-09 19:27 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> > On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote:
> > > > index ade4e6ec23e0..8b69ebf0baed 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -3001,6 +3001,12 @@
> > > >  			noexec=on: enable non-executable mappings (default)
> > > >  			noexec=off: disable non-executable mappings
> > > >  
> > > > +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> > > > +			applications
> > > 
> > > If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
> > > for userspace"?
> > 
> > What about no_user_shstk, no_kernel_shstk?

[...]

> > > > +Note:
> > > > +  There is no CET-enabling arch_prctl function.  By design, CET is
> > > > +  enabled automatically if the binary and the system can support it.
> > > 
> > > This is kinda interesting.  It means that a JIT couldn't choose to
> > > protect the code it generates and have different rules from itself?
> > 
> > JIT needs to be updated for CET first.  Once that is done, it runs with CET
> > enabled.  It can use the NOTRACK prefix, for example.
> 
> Am I missing something?
> 
> What's the direct connection between shadow stacks and Indirect Branch
> Tracking other than Intel marketing umbrellas?

What I meant is that JIT code needs to be updated first; if it skips RETs,
it needs to unwind the stack, and if it does indirect JMPs somewhere it
needs to fix up the branch target or use NOTRACK.

> > > > +  The parameters passed are always unsigned 64-bit.  When an IA32
> > > > +  application passing pointers, it should only use the lower 32 bits.
> > > 
> > > Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
> > > even know it's running on a 64-bit kernel?
> > 
> > The 32-bit app is passing only a pointer to an array of 64-bit numbers.
> 
> Well, the documentation just talked about pointers and I naively assume
> it means the "unsigned long *" you had in there.
> 
> Rather than make suggestions, just say that the ABI is universally
> 64-bit.  Saying that the pointers must be valid is just kinda silly.
> It's also not 100% clear what an "IA32 application" *MEANS* given fun
> things like x32.

Ok, I will update the text.

> 
> Also, I went to go find this implementation in your series.  I couldn't
> find it.  Did I miss a patch?  Or are you documenting things you didn't
> even implement?

In patch #27: Add arch_prctl functions for Shadow Stack.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 19:27         ` Yu-cheng Yu
@ 2020-03-09 19:35           ` Dave Hansen
  2020-03-09 19:50             ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-03-09 19:35 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
> On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
>> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
>>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
>>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
>>>>> +  enabled automatically if the binary and the system can support it.
>>>>
>>>> This is kinda interesting.  It means that a JIT couldn't choose to
>>>> protect the code it generates and have different rules from itself?
>>>
>>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
>>> enabled.  It can use the NOTRACK prefix, for example.
>>
>> Am I missing something?
>>
>> What's the direct connection between shadow stacks and Indirect Branch
>> Tracking other than Intel marketing umbrellas?
> 
> What I meant is that JIT code needs to be updated first; if it skips RETs,
> it needs to unwind the stack, and if it does indirect JMPs somewhere it
> needs to fix up the branch target or use NOTRACK.

I'm totally lost.  I think we have very different models of how a JIT
might generate and run code.

I can totally see a scenario where a JIT goes and generates a bunch of
code, then forks a new thread to go run that code.  The control flow of
the JIT thread itself *NEVER* interacts with the control flow of the
program it writes.  They never share a stack and nothing ever jumps or
rets between the two worlds.

Does anything actually do that?  I've got no idea.  But, I can clearly
see a world where the entirety of Chrome and Firefox and the entire rust
runtime might not be fully recompiled and CET-enabled for a while.  But,
we still want the JIT-generated code to be CET-protected since it has
the most exposed attack surface.

I don't think that's too far-fetched.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 19:35           ` Dave Hansen
@ 2020-03-09 19:50             ` H.J. Lu
  2020-03-09 20:16               ` Andy Lutomirski
  0 siblings, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-03-09 19:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, linux-doc, Linux-MM,
	linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 12:35 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
> > On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
> >> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> >>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
> >>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
> >>>>> +  enabled automatically if the binary and the system can support it.
> >>>>
> >>>> This is kinda interesting.  It means that a JIT couldn't choose to
> >>>> protect the code it generates and have different rules from itself?
> >>>
> >>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
> >>> enabled.  It can use the NOTRACK prefix, for example.
> >>
> >> Am I missing something?
> >>
> >> What's the direct connection between shadow stacks and Indirect Branch
> >> Tracking other than Intel marketing umbrellas?
> >
> > What I meant is that JIT code needs to be updated first; if it skips RETs,
> > it needs to unwind the stack, and if it does indirect JMPs somewhere it
> > needs to fix up the branch target or use NOTRACK.
>
> I'm totally lost.  I think we have very different models of how a JIT
> might generate and run code.
>
> I can totally see a scenario where a JIT goes and generates a bunch of
> code, then forks a new thread to go run that code.  The control flow of
> the JIT thread itself *NEVER* interacts with the control flow of the
> program it writes.  They never share a stack and nothing ever jumps or
> rets between the two worlds.
>
> Does anything actually do that?  I've got no idea.  But, I can clearly
> see a world where the entirety of Chrome and Firefox and the entire rust
> runtime might not be fully recompiled and CET-enabled for a while.  But,
> we still want the JIT-generated code to be CET-protected since it has
> the most exposed attack surface.
>
> I don't think that's too far-fetched.

CET support is all or nothing.   You can mix and match, but you will get
no CET protection, similar to NX feature.

-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 19:50             ` H.J. Lu
@ 2020-03-09 20:16               ` Andy Lutomirski
  2020-03-09 20:54                 ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Andy Lutomirski @ 2020-03-09 20:16 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review



> On Mar 9, 2020, at 12:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Mon, Mar 9, 2020 at 12:35 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>>> On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
>>> On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
>>>> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
>>>>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
>>>>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
>>>>>>> +  enabled automatically if the binary and the system can support it.
>>>>>> 
>>>>>> This is kinda interesting.  It means that a JIT couldn't choose to
>>>>>> protect the code it generates and have different rules from itself?
>>>>> 
>>>>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
>>>>> enabled.  It can use the NOTRACK prefix, for example.
>>>> 
>>>> Am I missing something?
>>>> 
>>>> What's the direct connection between shadow stacks and Indirect Branch
>>>> Tracking other than Intel marketing umbrellas?
>>> 
>>> What I meant is that JIT code needs to be updated first; if it skips RETs,
>>> it needs to unwind the stack, and if it does indirect JMPs somewhere it
>>> needs to fix up the branch target or use NOTRACK.
>> 
>> I'm totally lost.  I think we have very different models of how a JIT
>> might generate and run code.
>> 
>> I can totally see a scenario where a JIT goes and generates a bunch of
>> code, then forks a new thread to go run that code.  The control flow of
>> the JIT thread itself *NEVER* interacts with the control flow of the
>> program it writes.  They never share a stack and nothing ever jumps or
>> rets between the two worlds.
>> 
>> Does anything actually do that?  I've got no idea.  But, I can clearly
>> see a world where the entirety of Chrome and Firefox and the entire rust
>> runtime might not be fully recompiled and CET-enabled for a while.  But,
>> we still want the JIT-generated code to be CET-protected since it has
>> the most exposed attack surface.
>> 
>> I don't think that's too far-fetched.
> 
> CET support is all or nothing.   You can mix and match, but you will get
> no CET protection, similar to NX feature.
> 

Can you explain?

If a program with the magic ELF CET flags missing can’t make a thread with IBT and/or SHSTK enabled, then I think we’ve made an error and should fix it.

> -- 
> H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 20:16               ` Andy Lutomirski
@ 2020-03-09 20:54                 ` H.J. Lu
  2020-03-09 20:59                   ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-03-09 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 1:16 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Mar 9, 2020, at 12:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Mar 9, 2020 at 12:35 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >>> On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
> >>> On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
> >>>> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> >>>>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
> >>>>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
> >>>>>>> +  enabled automatically if the binary and the system can support it.
> >>>>>>
> >>>>>> This is kinda interesting.  It means that a JIT couldn't choose to
> >>>>>> protect the code it generates and have different rules from itself?
> >>>>>
> >>>>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
> >>>>> enabled.  It can use the NOTRACK prefix, for example.
> >>>>
> >>>> Am I missing something?
> >>>>
> >>>> What's the direct connection between shadow stacks and Indirect Branch
> >>>> Tracking other than Intel marketing umbrellas?
> >>>
> >>> What I meant is that JIT code needs to be updated first; if it skips RETs,
> >>> it needs to unwind the stack, and if it does indirect JMPs somewhere it
> >>> needs to fix up the branch target or use NOTRACK.
> >>
> >> I'm totally lost.  I think we have very different models of how a JIT
> >> might generate and run code.
> >>
> >> I can totally see a scenario where a JIT goes and generates a bunch of
> >> code, then forks a new thread to go run that code.  The control flow of
> >> the JIT thread itself *NEVER* interacts with the control flow of the
> >> program it writes.  They never share a stack and nothing ever jumps or
> >> rets between the two worlds.
> >>
> >> Does anything actually do that?  I've got no idea.  But, I can clearly
> >> see a world where the entirety of Chrome and Firefox and the entire rust
> >> runtime might not be fully recompiled and CET-enabled for a while.  But,
> >> we still want the JIT-generated code to be CET-protected since it has
> >> the most exposed attack surface.
> >>
> >> I don't think that's too far-fetched.
> >
> > CET support is all or nothing.   You can mix and match, but you will get
> > no CET protection, similar to NX feature.
> >
>
> Can you explain?

I was talking about creating a program from mixed object files with and without
CET marker.

> If a program with the magic ELF CET flags missing can’t make a thread with IBT and/or SHSTK enabled, then I think we’ve made an error and should fix it.
>

A non-CET program can start a CET program and vice versa.

-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 20:54                 ` H.J. Lu
@ 2020-03-09 20:59                   ` Dave Hansen
  2020-03-09 21:12                     ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-03-09 20:59 UTC (permalink / raw)
  To: H.J. Lu, Andy Lutomirski
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, linux-doc, Linux-MM,
	linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 3/9/20 1:54 PM, H.J. Lu wrote:
>> If a program with the magic ELF CET flags missing can’t make a
>> thread with IBT and/or SHSTK enabled, then I think we’ve made an
>> error and should fix it.
>> 
> A non-CET program can start a CET program and vice versa.

Could we be specific here, please?

HJ are you saying that:
* CET program can execve() a non-CET program, and
* a non-CET program can execve() a CET program

?

That's obvious.

But what are the rules for clone()?  Should there be rules for
mismatches for CET enabling between threads if a process (not child
processes)?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 20:59                   ` Dave Hansen
@ 2020-03-09 21:12                     ` H.J. Lu
  2020-03-09 22:02                       ` Andy Lutomirski
  2020-03-09 22:19                       ` Dave Hansen
  0 siblings, 2 replies; 107+ messages in thread
From: H.J. Lu @ 2020-03-09 21:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 1:59 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 1:54 PM, H.J. Lu wrote:
> >> If a program with the magic ELF CET flags missing can’t make a
> >> thread with IBT and/or SHSTK enabled, then I think we’ve made an
> >> error and should fix it.
> >>
> > A non-CET program can start a CET program and vice versa.
>
> Could we be specific here, please?
>
> HJ are you saying that:
> * CET program can execve() a non-CET program, and
> * a non-CET program can execve() a CET program
>
> ?

Yes.

> That's obvious.
>
> But what are the rules for clone()?  Should there be rules for
> mismatches for CET enabling between threads if a process (not child
> processes)?

What did you mean? A threaded application is either CET enabled or not
CET enabled.   A new thread from clone makes no difference.

-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 21:12                     ` H.J. Lu
@ 2020-03-09 22:02                       ` Andy Lutomirski
  2020-03-09 22:19                       ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Andy Lutomirski @ 2020-03-09 22:02 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review


> On Mar 9, 2020, at 2:13 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Mon, Mar 9, 2020 at 1:59 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>> On 3/9/20 1:54 PM, H.J. Lu wrote:
>>>> If a program with the magic ELF CET flags missing can’t make a
>>>> thread with IBT and/or SHSTK enabled, then I think we’ve made an
>>>> error and should fix it.
>>>> 
>>> A non-CET program can start a CET program and vice versa.
>> 
>> Could we be specific here, please?
>> 
>> HJ are you saying that:
>> * CET program can execve() a non-CET program, and
>> * a non-CET program can execve() a CET program
>> 
>> ?
> 
> Yes.
> 
>> That's obvious.
>> 
>> But what are the rules for clone()?  Should there be rules for
>> mismatches for CET enabling between threads if a process (not child
>> processes)?
> 
> What did you mean? A threaded application is either CET enabled or not
> CET enabled.   A new thread from clone makes no difference.

Why?  Dave’s example seems like a good reason to allow per-thread control.



> 
> -- 
> H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 21:12                     ` H.J. Lu
  2020-03-09 22:02                       ` Andy Lutomirski
@ 2020-03-09 22:19                       ` Dave Hansen
  2020-03-09 23:11                         ` H.J. Lu
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-03-09 22:19 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Andy Lutomirski, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 3/9/20 2:12 PM, H.J. Lu wrote:
>> But what are the rules for clone()?  Should there be rules for
>> mismatches for CET enabling between threads if a process (not child
>> processes)?
> What did you mean? A threaded application is either CET enabled or not
> CET enabled.   A new thread from clone makes no difference.

Stacks are fundamentally thread-local resources.  The registers that
point to them and MSRs that manage shadow stacks are all CPU-thread
local.  Nothing is fundamentally tied to the address space shared across
the process.

A thread might also share *no* control flow with its child.  It might
ask the thread to start in code that the parent can never even reach.

It sounds like you've picked a Linux implementation that has
restrictions on top of the fundamentals.  That's not wrong per se, but
it does deserve explanation and deliberate, not experimental design.

Could you go back to the folks at Intel and try to figure out what this
was designed to *do*?  Yes, I'm probably one of those folks.  You know
where to find me. :)


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 22:19                       ` Dave Hansen
@ 2020-03-09 23:11                         ` H.J. Lu
  2020-03-09 23:20                           ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-03-09 23:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 3:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 2:12 PM, H.J. Lu wrote:
> >> But what are the rules for clone()?  Should there be rules for
> >> mismatches for CET enabling between threads if a process (not child
> >> processes)?
> > What did you mean? A threaded application is either CET enabled or not
> > CET enabled.   A new thread from clone makes no difference.
>
> Stacks are fundamentally thread-local resources.  The registers that
> point to them and MSRs that manage shadow stacks are all CPU-thread
> local.  Nothing is fundamentally tied to the address space shared across
> the process.
>
> A thread might also share *no* control flow with its child.  It might
> ask the thread to start in code that the parent can never even reach.
>
> It sounds like you've picked a Linux implementation that has
> restrictions on top of the fundamentals.  That's not wrong per se, but
> it does deserve explanation and deliberate, not experimental design.
>
> Could you go back to the folks at Intel and try to figure out what this
> was designed to *do*?  Yes, I'm probably one of those folks.  You know
> where to find me. :)

A threaded application is loaded from disk.  The object file on disk is
either CET enabled or not CET enabled.

-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 23:11                         ` H.J. Lu
@ 2020-03-09 23:20                           ` Dave Hansen
  2020-03-09 23:51                             ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-03-09 23:20 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Andy Lutomirski, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 3/9/20 4:11 PM, H.J. Lu wrote:
> A threaded application is loaded from disk.  The object file on disk is
> either CET enabled or not CET enabled.

Huh.  Are you saying that all instructions executed on userspace on
Linux come off of object files on the disk?  That's an interesting
assertion.  You might want to go take a look at the processes on your
systems.  Here's my browser for example:

# for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
/proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
...
202f00082000-202f000bf000 r-xp 00000000 00:00 0
202f000c2000-202f000c3000 r-xp 00000000 00:00 0
202f00102000-202f00103000 r-xp 00000000 00:00 0
202f00142000-202f00143000 r-xp 00000000 00:00 0
202f00182000-202f001bf000 r-xp 00000000 00:00 0

Lots of funny looking memory areas which are anonymous and executable!
Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
idea what those are?

One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 23:20                           ` Dave Hansen
@ 2020-03-09 23:51                             ` H.J. Lu
  2020-03-09 23:59                               ` Andy Lutomirski
  0 siblings, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-03-09 23:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML, linux-doc,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 4:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 4:11 PM, H.J. Lu wrote:
> > A threaded application is loaded from disk.  The object file on disk is
> > either CET enabled or not CET enabled.
>
> Huh.  Are you saying that all instructions executed on userspace on
> Linux come off of object files on the disk?  That's an interesting
> assertion.  You might want to go take a look at the processes on your
> systems.  Here's my browser for example:
>
> # for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
> /proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
> ...
> 202f00082000-202f000bf000 r-xp 00000000 00:00 0
> 202f000c2000-202f000c3000 r-xp 00000000 00:00 0
> 202f00102000-202f00103000 r-xp 00000000 00:00 0
> 202f00142000-202f00143000 r-xp 00000000 00:00 0
> 202f00182000-202f001bf000 r-xp 00000000 00:00 0
>
> Lots of funny looking memory areas which are anonymous and executable!
> Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
> idea what those are?
>
> One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation

jitted code belongs to a process loaded from disk.  Enable CET in
an application which uses JIT engine means to also enable CET in
JIT engine.  Take git as an example, "git grep" crashed for me on Tiger
Lake.   It turned out that git itself was compiled with -fcf-protection and
git was linked against libpcre2-8.so.0 also compiled with -fcf-protection,
which has a JIT, sljit, which was not CET enabled.  git crashed in the
jitted codes due to missing ENDBR.  I had to enable CET in sljit to make
git working on CET enabled Tiger Lake.  So we need to enable CET in
JIT engine before enabling CET in applications which use JIT engine.


-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 23:51                             ` H.J. Lu
@ 2020-03-09 23:59                               ` Andy Lutomirski
  2020-03-10  0:08                                 ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Andy Lutomirski @ 2020-03-09 23:59 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 4:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Mar 9, 2020 at 4:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 3/9/20 4:11 PM, H.J. Lu wrote:
> > > A threaded application is loaded from disk.  The object file on disk is
> > > either CET enabled or not CET enabled.
> >
> > Huh.  Are you saying that all instructions executed on userspace on
> > Linux come off of object files on the disk?  That's an interesting
> > assertion.  You might want to go take a look at the processes on your
> > systems.  Here's my browser for example:
> >
> > # for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
> > /proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
> > ...
> > 202f00082000-202f000bf000 r-xp 00000000 00:00 0
> > 202f000c2000-202f000c3000 r-xp 00000000 00:00 0
> > 202f00102000-202f00103000 r-xp 00000000 00:00 0
> > 202f00142000-202f00143000 r-xp 00000000 00:00 0
> > 202f00182000-202f001bf000 r-xp 00000000 00:00 0
> >
> > Lots of funny looking memory areas which are anonymous and executable!
> > Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
> > idea what those are?
> >
> > One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation
>
> jitted code belongs to a process loaded from disk.  Enable CET in
> an application which uses JIT engine means to also enable CET in
> JIT engine.  Take git as an example, "git grep" crashed for me on Tiger
> Lake.   It turned out that git itself was compiled with -fcf-protection and
> git was linked against libpcre2-8.so.0 also compiled with -fcf-protection,
> which has a JIT, sljit, which was not CET enabled.  git crashed in the
> jitted codes due to missing ENDBR.  I had to enable CET in sljit to make
> git working on CET enabled Tiger Lake.  So we need to enable CET in
> JIT engine before enabling CET in applications which use JIT engine.

This could presumably have been fixed by having libpcre or sljit
disable IBT before calling into JIT code or by running the JIT code in
another thread.  In the other direction, a non-CET libpcre build could
build IBT-capable JITted code and enable JIT (by syscall if we allow
that or by creating a thread?) when calling it.  And IBT has this
fancy legacy bitmap to allow non-instrumented code to run with IBT on,
although SHSTK doesn't have hardware support for a similar feature.

So, sure, the glibc-linked ELF ecosystem needs some degree of CET
coordination, but it is absolutely not the case that a process MUST
have all CET or no CET.  Let's please support the complicated cases in
the kernel and the ABI too.  If glibc wants to make it annoying to do
complicated things, so be it.  People work behind glibc's back all the
time.

--Andy


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-09 23:59                               ` Andy Lutomirski
@ 2020-03-10  0:08                                 ` H.J. Lu
  2020-03-10  1:21                                   ` Andy Lutomirski
  0 siblings, 1 reply; 107+ messages in thread
From: H.J. Lu @ 2020-03-10  0:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Mon, Mar 9, 2020 at 4:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Mar 9, 2020 at 4:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 3/9/20 4:11 PM, H.J. Lu wrote:
> > > > A threaded application is loaded from disk.  The object file on disk is
> > > > either CET enabled or not CET enabled.
> > >
> > > Huh.  Are you saying that all instructions executed on userspace on
> > > Linux come off of object files on the disk?  That's an interesting
> > > assertion.  You might want to go take a look at the processes on your
> > > systems.  Here's my browser for example:
> > >
> > > # for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
> > > /proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
> > > ...
> > > 202f00082000-202f000bf000 r-xp 00000000 00:00 0
> > > 202f000c2000-202f000c3000 r-xp 00000000 00:00 0
> > > 202f00102000-202f00103000 r-xp 00000000 00:00 0
> > > 202f00142000-202f00143000 r-xp 00000000 00:00 0
> > > 202f00182000-202f001bf000 r-xp 00000000 00:00 0
> > >
> > > Lots of funny looking memory areas which are anonymous and executable!
> > > Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
> > > idea what those are?
> > >
> > > One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation
> >
> > jitted code belongs to a process loaded from disk.  Enable CET in
> > an application which uses JIT engine means to also enable CET in
> > JIT engine.  Take git as an example, "git grep" crashed for me on Tiger
> > Lake.   It turned out that git itself was compiled with -fcf-protection and
> > git was linked against libpcre2-8.so.0 also compiled with -fcf-protection,
> > which has a JIT, sljit, which was not CET enabled.  git crashed in the
> > jitted codes due to missing ENDBR.  I had to enable CET in sljit to make
> > git working on CET enabled Tiger Lake.  So we need to enable CET in
> > JIT engine before enabling CET in applications which use JIT engine.
>
> This could presumably have been fixed by having libpcre or sljit
> disable IBT before calling into JIT code or by running the JIT code in
> another thread.  In the other direction, a non-CET libpcre build could
> build IBT-capable JITted code and enable JIT (by syscall if we allow
> that or by creating a thread?) when calling it.  And IBT has this

This is not how thread in user space works.

> fancy legacy bitmap to allow non-instrumented code to run with IBT on,
> although SHSTK doesn't have hardware support for a similar feature.

All these changes are called CET enabing.

> So, sure, the glibc-linked ELF ecosystem needs some degree of CET
> coordination, but it is absolutely not the case that a process MUST
> have all CET or no CET.  Let's please support the complicated cases in
> the kernel and the ABI too.  If glibc wants to make it annoying to do
> complicated things, so be it.  People work behind glibc's back all the
> time.

CET is no different from NX in this regard.


-- 
H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-10  0:08                                 ` H.J. Lu
@ 2020-03-10  1:21                                   ` Andy Lutomirski
  2020-03-10  2:13                                     ` H.J. Lu
  0 siblings, 1 reply; 107+ messages in thread
From: Andy Lutomirski @ 2020-03-10  1:21 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

I am baffled by this discussion.

>> On Mar 9, 2020, at 5:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> 
>> On Mon, Mar 9, 2020 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>>> .
>> This could presumably have been fixed by having libpcre or sljit
>> disable IBT before calling into JIT code or by running the JIT code in
>> another thread.  In the other direction, a non-CET libpcre build could
>> build IBT-capable JITted code and enable JIT (by syscall if we allow
>> that or by creating a thread?) when calling it.  And IBT has this
> 
> This is not how thread in user space works.

void create_cet_thread(void (*func)(), unsigned int cet_flags);

I could implement this using clone() if the kernel provides the requisite support. Sure, creating threads behind libc’s back like this is perilous, but it can be done.

> 
>> fancy legacy bitmap to allow non-instrumented code to run with IBT on,
>> although SHSTK doesn't have hardware support for a similar feature.
> 
> All these changes are called CET enabing.

What does that mean?  If program A loads library B, and library B very carefully loads CET-mismatched code, program A may be blissfully unaware.

> 
>> So, sure, the glibc-linked ELF ecosystem needs some degree of CET
>> coordination, but it is absolutely not the case that a process MUST
>> have all CET or no CET.  Let's please support the complicated cases in
>> the kernel and the ABI too.  If glibc wants to make it annoying to do
>> complicated things, so be it.  People work behind glibc's back all the
>> time.
> 
> CET is no different from NX in this regard.

NX is in the page tables, and CET, mostly, is not.  Also, we seriously flubbed READ_IMPLIES_EXEC and made it affect far more mappings than ever should have been affected.

If a legacy program (non-NX-aware) loads a newer library, and the library opens a device node and mmaps it PROT_READ, it gets RX.  This is not a good design. In fact, it’s actively problematic.

Let us please not take Linux’s NX legacy support as an example of good design.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 01/27] Documentation/x86: Add CET description
  2020-03-10  1:21                                   ` Andy Lutomirski
@ 2020-03-10  2:13                                     ` H.J. Lu
  0 siblings, 0 replies; 107+ messages in thread
From: H.J. Lu @ 2020-03-10  2:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Yu-cheng Yu, the arch/x86 maintainers,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Mon, Mar 9, 2020 at 6:21 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I am baffled by this discussion.
>
> >> On Mar 9, 2020, at 5:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Mon, Mar 9, 2020 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >>>> .
> >> This could presumably have been fixed by having libpcre or sljit
> >> disable IBT before calling into JIT code or by running the JIT code in
> >> another thread.  In the other direction, a non-CET libpcre build could
> >> build IBT-capable JITted code and enable JIT (by syscall if we allow
> >> that or by creating a thread?) when calling it.  And IBT has this
> >
> > This is not how thread in user space works.
>
> void create_cet_thread(void (*func)(), unsigned int cet_flags);
>
> I could implement this using clone() if the kernel provides the requisite support. Sure, creating threads behind libc’s back like this is perilous, but it can be done.

Sure, this can live outside of libc with kernel support.

> >
> >> fancy legacy bitmap to allow non-instrumented code to run with IBT on,
> >> although SHSTK doesn't have hardware support for a similar feature.
> >
> > All these changes are called CET enabing.
>
> What does that mean?  If program A loads library B, and library B very carefully loads CET-mismatched code, program A may be blissfully unaware.

Any source changes to make codes CET compatible is to enable CET.

Shadow stack can't be turned on or off arbitrarily.  ld.so checks it and
makes sure that everything is consistent.  But this is entirely done in
user space.  In the first phase, we want to make CET simple, not too
complicated.


H.J.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB Shadow Stack page fault
  2020-02-25 20:59   ` Kees Cook
@ 2020-03-13 22:00     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-13 22:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Tue, 2020-02-25 at 12:59 -0800, Kees Cook wrote:
> On Wed, Feb 05, 2020 at 10:19:23AM -0800, Yu-cheng Yu wrote:
> > This patch implements THP Shadow Stack (SHSTK) copying in the same way as
> > in the previous patch for regular PTE.
> > 
> > In copy_huge_pmd(), clear the dirty bit from the PMD to cause a page fault
> > upon the next SHSTK access to the PMD.  At that time, fix the PMD and
> > copy/re-use the page.
> 
> Now is as good a time as any to ask: do you have selftests for all this?
> It seems like it would be really nice to have a way to verify SHSTK is
> working correctly.

Yes, I have some simple tests at https://github.com/yyu168/cet-smoke-test.
I also run Linux/tools/testing/selftests/x86 and GLIBC tests with CET and THP
combinations.

Yu-cheng




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread Shadow Stack
  2020-02-25 21:29   ` Kees Cook
@ 2020-03-25 21:51     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-03-25 21:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin,
	x86-patch-review

On Tue, 2020-02-25 at 13:29 -0800, Kees Cook wrote:
> On Wed, Feb 05, 2020 at 10:19:33AM -0800, Yu-cheng Yu wrote:
> > [...]
> > A 64-bit SHSTK has a fixed size of RLIMIT_STACK. A compat-mode thread SHSTK
> > has a fixed size of 1/4 RLIMIT_STACK.  This allows more threads to share a
> > 32-bit address space.
> 
> I am not understanding this part. :) Entries are sizeof(unsigned long),
> yes? A 1/2 RLIMIT_STACK would cover 32-bit, but 1/4 is less, so why does
> that provide for more threads?

Each thread has a separate shadow stack.  If each shadow stack is smaller, the
address space can accommodate more shadow stack allocations.

> >[...]
> > diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> > index cba5c7656aab..5b45abda80a1 100644
> > --- a/arch/x86/kernel/cet.c
> > +++ b/arch/x86/kernel/cet.c
> > @@ -170,6 +170,47 @@ int cet_setup_shstk(void)
> >  	return 0;
> >  }
> >  
> > +int cet_setup_thread_shstk(struct task_struct *tsk)
> > +{
> > +	unsigned long addr, size;
> > +	struct cet_user_state *state;
> > +	struct cet_status *cet = &tsk->thread.cet;
> > +
> > +	if (!cet->shstk_enabled)
> > +		return 0;
> > +
> > +	state = get_xsave_addr(&tsk->thread.fpu.state.xsave,
> > +			       XFEATURE_CET_USER);
> > +
> > +	if (!state)
> > +		return -EINVAL;
> > +
> > +	size = rlimit(RLIMIT_STACK);
> 
> Is SHSTK incompatible with RLIM_INFINITY stack rlimits?

I will change it to:

	size = min(rlimit(RLIMIT_STACK), 4 GB);

> 
> > +
> > +	/*
> > +	 * Compat-mode pthreads share a limited address space.
> > +	 * If each function call takes an average of four slots
> > +	 * stack space, we need 1/4 of stack size for shadow stack.
> > +	 */
> > +	if (in_compat_syscall())
> > +		size /= 4;
> > +
> > +	addr = alloc_shstk(size);
> 
> I assume it'd fail here, but I worry about Stack Clash style attacks.
> I'd like to see test cases that make sure the SHSTK gap is working
> correctly.

I will work on some tests.

Thanks,
Yu-cheng




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW
  2020-02-26 21:35   ` Dave Hansen
@ 2020-04-01 19:08     ` Yu-cheng Yu
  2020-04-01 19:22       ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-04-01 19:08 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, 2020-02-26 at 13:35 -0800, Dave Hansen wrote:
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > When Shadow Stack (SHSTK) is introduced, a R/O and Dirty PTE exists in the
> > following cases:
> > 
> > (a) A modified, copy-on-write (COW) page;
> > (b) A R/O page that has been COW'ed;
> > (c) A SHSTK page.
[...]

> > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > index e647e3c75578..826823df917f 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -23,7 +23,8 @@
> >  #define _PAGE_BIT_SOFTW2	10	/* " */
> >  #define _PAGE_BIT_SOFTW3	11	/* " */
> >  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> > -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> > +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> > +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
> >  #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
> >  #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
> >  #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
> > @@ -35,6 +36,12 @@
> >  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
> >  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
> >  
> > +/*
> > + * This bit indicates a copy-on-write page, and is different from
> > + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
> > + */
> > +#define _PAGE_BIT_DIRTY_SW	_PAGE_BIT_SOFTW5 /* was written to */
> 
> Does it *only* indicate a copy-on-write (or copy-on-access) page?  If
> so, haven't we misnamed it?

It indicates either a copy-on-write page or a read-only page that has been
cow'ed.  What about _PAGE_BIT_COW?

Yu-cheng




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW
  2020-04-01 19:08     ` Yu-cheng Yu
@ 2020-04-01 19:22       ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2020-04-01 19:22 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 4/1/20 12:08 PM, Yu-cheng Yu wrote:
>>> +/*
>>> + * This bit indicates a copy-on-write page, and is different from
>>> + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
>>> + */
>>> +#define _PAGE_BIT_DIRTY_SW	_PAGE_BIT_SOFTW5 /* was written to */
>> Does it *only* indicate a copy-on-write (or copy-on-access) page?  If
>> so, haven't we misnamed it?
> It indicates either a copy-on-write page or a read-only page that has been
> cow'ed.  What about _PAGE_BIT_COW?

Sounds sane to me.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  2020-02-26 22:04   ` Dave Hansen
@ 2020-04-03 15:42     ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-04-03 15:42 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review
  Cc: Zhenyu Wang, Zhi Wang, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, David Airlie, Daniel Vetter

On Wed, 2020-02-26 at 14:04 -0800, Dave Hansen wrote:
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> > diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> > index 4b04af569c05..e467ca182633 100644
> > --- a/drivers/gpu/drm/i915/gvt/gtt.c
> > +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> > @@ -1201,7 +1201,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
> >  	}
> >  
> >  	/* Clear dirty field. */
> > -	se->val64 &= ~_PAGE_DIRTY;
> > +	se->val64 &= ~_PAGE_DIRTY_BITS;
> >  
> >  	ops->clear_pse(se);
> >  	ops->clear_ips(se);
> 
> Are the i915 maintainers on cc?
> 
> Shouldn't this use pte_mkclean() instead of open-coding?

These functions look like a set of pte_* equivalent for the driver.  They all
use the bits directly.  Add its maintainers to cc.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-02-27  0:08   ` Dave Hansen
@ 2020-04-07 18:14     ` Yu-cheng Yu
  2020-04-07 22:21       ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Yu-cheng Yu @ 2020-04-07 18:14 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Wed, 2020-02-26 at 16:08 -0800, Dave Hansen wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 45442d9a4f52..6daa28614327 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -772,7 +772,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	 * If it's a COW mapping, write protect it both
> >  	 * in the parent and the child
> >  	 */
> > -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> > +	if ((is_cow_mapping(vm_flags) && pte_write(pte)) ||
> > +	    arch_copy_pte_mapping(vm_flags)) {
> >  		ptep_set_wrprotect(src_mm, addr, src_pte);
> >  		pte = pte_wrprotect(pte);
> >  	}
> 
> You have to modify this because pte_write()==0 for shadow stack PTEs, right?
> 
> Aren't shadow stack ptes *logically* writable, even if they don't have
> the write bit set?  What would happen if we made pte_write()==1 for them?

Here the vm_flags needs to have VM_MAYWRITE, and the PTE needs to have
_PAGE_WRITE.  A shadow stack does not have either.

To fix checking vm_flags, what about adding a "arch_is_cow_mappping()" to the
generic is_cow_mapping()?

For the PTE, the check actually tries to determine if the PTE is not already
being copy-on-write, which is:

	(!_PAGE_RW && !_PAGE_DIRTY_HW)

So what about making it pte_cow()?

	/*
	 * The PTE is in copy-on-write status.
	 */
	static inline int pte_cow(pte_t pte)
	{
		return !(pte_flags(pte) & (_PAGE_WRITE | _PAGE_DIRTY_HW));
	}
> 
> > @@ -2417,6 +2418,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
> >  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> >  	entry = pte_mkyoung(vmf->orig_pte);
> >  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > +	entry = pte_set_vma_features(entry, vma);
> >  	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
> >  		update_mmu_cache(vma, vmf->address, vmf->pte);
> >  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > @@ -2504,6 +2506,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> >  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> >  		entry = mk_pte(new_page, vma->vm_page_prot);
> >  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > +		entry = pte_set_vma_features(entry, vma);
> >  		/*
> >  		 * Clear the pte entry and flush it first, before updating the
> >  		 * pte with the new entry. This will avoid a race condition
> > @@ -3023,6 +3026,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  	pte = mk_pte(page, vma->vm_page_prot);
> >  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
> >  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +		pte = pte_set_vma_features(pte, vma);
> >  		vmf->flags &= ~FAULT_FLAG_WRITE;
> >  		ret |= VM_FAULT_WRITE;
> >  		exclusive = RMAP_EXCLUSIVE;
> > @@ -3165,6 +3169,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >  	entry = mk_pte(page, vma->vm_page_prot);
> >  	if (vma->vm_flags & VM_WRITE)
> >  		entry = pte_mkwrite(pte_mkdirty(entry));
> > +	entry = pte_set_vma_features(entry, vma);
> >  
> >  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >  			&vmf->ptl);
> > 
> 
> These seem wrong, or at best inconsistent with what's already done.
> 
> We don't need anything like pte_set_vma_features() today because we have
> vma->vm_page_prot.  We could easily have done what you suggest here for
> things like protection keys: ignore the pkey PTE bits until we create
> the final PTE then shove them in there.
> 
> What are the bit patterns of the shadow stack bits that come out of
> these sites?  Can they be represented in ->vm_page_prot?

Yes, we can put _PAGE_DIRTY_HW in vm_page_prot.  Also set the bit in
ptep_set_access_flags() for shadow stack PTEs.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-04-07 18:14     ` Yu-cheng Yu
@ 2020-04-07 22:21       ` Dave Hansen
  2020-04-08 18:18         ` Yu-cheng Yu
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2020-04-07 22:21 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On 4/7/20 11:14 AM, Yu-cheng Yu wrote:
> On Wed, 2020-02-26 at 16:08 -0800, Dave Hansen wrote:
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 45442d9a4f52..6daa28614327 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -772,7 +772,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>  	 * If it's a COW mapping, write protect it both
>>>  	 * in the parent and the child
>>>  	 */
>>> -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>> +	if ((is_cow_mapping(vm_flags) && pte_write(pte)) ||
>>> +	    arch_copy_pte_mapping(vm_flags)) {
>>>  		ptep_set_wrprotect(src_mm, addr, src_pte);
>>>  		pte = pte_wrprotect(pte);
>>>  	}
>>
>> You have to modify this because pte_write()==0 for shadow stack PTEs, right?
>>
>> Aren't shadow stack ptes *logically* writable, even if they don't have
>> the write bit set?  What would happen if we made pte_write()==1 for them?
> 
> Here the vm_flags needs to have VM_MAYWRITE, and the PTE needs to have
> _PAGE_WRITE.  A shadow stack does not have either.

I literally mean taking pte_write(), and doing something l

static inline int pte_write(pte_t pte)
{
	if (pte_present(pte) && pte_is_shadow_stack(pte))
		return 1;

        return pte_flags(pte) & _PAGE_RW;
}

Then if is_cow_mapping() returns true for shadow stack VMAs, the above
code doesn't need to change.

> To fix checking vm_flags, what about adding a "arch_is_cow_mappping()" to the
> generic is_cow_mapping()?

That makes good sense to me.

> For the PTE, the check actually tries to determine if the PTE is not already
> being copy-on-write, which is:
> 
> 	(!_PAGE_RW && !_PAGE_DIRTY_HW)
> 
> So what about making it pte_cow()?
> 
> 	/*
> 	 * The PTE is in copy-on-write status.
> 	 */
> 	static inline int pte_cow(pte_t pte)
> 	{
> 		return !(pte_flags(pte) & (_PAGE_WRITE | _PAGE_DIRTY_HW));
> 	}

... with appropriate comments that seems fine to me.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault
  2020-04-07 22:21       ` Dave Hansen
@ 2020-04-08 18:18         ` Yu-cheng Yu
  0 siblings, 0 replies; 107+ messages in thread
From: Yu-cheng Yu @ 2020-04-08 18:18 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, x86-patch-review

On Tue, 2020-04-07 at 15:21 -0700, Dave Hansen wrote:
> On 4/7/20 11:14 AM, Yu-cheng Yu wrote:
> > On Wed, 2020-02-26 at 16:08 -0800, Dave Hansen wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 45442d9a4f52..6daa28614327 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -772,7 +772,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >  	 * If it's a COW mapping, write protect it both
> > > >  	 * in the parent and the child
> > > >  	 */
> > > > -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> > > > +	if ((is_cow_mapping(vm_flags) && pte_write(pte)) ||
> > > > +	    arch_copy_pte_mapping(vm_flags)) {
> > > >  		ptep_set_wrprotect(src_mm, addr, src_pte);
> > > >  		pte = pte_wrprotect(pte);
> > > >  	}
> > > 
> > > You have to modify this because pte_write()==0 for shadow stack PTEs, right?
> > > 
> > > Aren't shadow stack ptes *logically* writable, even if they don't have
> > > the write bit set?  What would happen if we made pte_write()==1 for them?
> > 
> > Here the vm_flags needs to have VM_MAYWRITE, and the PTE needs to have
> > _PAGE_WRITE.  A shadow stack does not have either.
> 
> I literally mean taking pte_write(), and doing something l
> 
> static inline int pte_write(pte_t pte)
> {
> 	if (pte_present(pte) && pte_is_shadow_stack(pte))
> 		return 1;
> 
>         return pte_flags(pte) & _PAGE_RW;
> }
> 
> Then if is_cow_mapping() returns true for shadow stack VMAs, the above
> code doesn't need to change.

One benefit of this change is can_follow_write_pte() does not need any changes. 
A shadow stack PTE not in copy-on-write status is pte_write().

However, there are places that use pte_write() to determine if the PTE can be
made _PAGE_RW.  One such case is in change_pte_range(), where

	preserve_write = prot_numa && pte_write(oldpte);

and later,

	if (preserve_write)
		ptent = pte_mk_savedwrite(ptent);

Currently, there are other checks and shadow stack PTEs won't become _PAGE_RW. 
I am wondering if this can be overlooked later when the code is modified.

Another potential issue is, because pte_write()==1, a shadow stack PTE is made a
write migration entry, and can later accidentally become _PAGE_RW.  I think the
page fault handler would catch that, but still call it out in case I miss
anything.

Yu-cheng



^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2020-04-08 18:18 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-05 18:19 [RFC PATCH v9 00/27] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 01/27] Documentation/x86: Add CET description Yu-cheng Yu
2020-02-06  0:16   ` Randy Dunlap
2020-02-06 20:17     ` Yu-cheng Yu
2020-02-25 20:02   ` Kees Cook
2020-02-28 15:55     ` Yu-cheng Yu
2020-02-26 17:57   ` Dave Hansen
2020-03-09 17:00     ` Yu-cheng Yu
2020-03-09 17:21       ` Dave Hansen
2020-03-09 19:27         ` Yu-cheng Yu
2020-03-09 19:35           ` Dave Hansen
2020-03-09 19:50             ` H.J. Lu
2020-03-09 20:16               ` Andy Lutomirski
2020-03-09 20:54                 ` H.J. Lu
2020-03-09 20:59                   ` Dave Hansen
2020-03-09 21:12                     ` H.J. Lu
2020-03-09 22:02                       ` Andy Lutomirski
2020-03-09 22:19                       ` Dave Hansen
2020-03-09 23:11                         ` H.J. Lu
2020-03-09 23:20                           ` Dave Hansen
2020-03-09 23:51                             ` H.J. Lu
2020-03-09 23:59                               ` Andy Lutomirski
2020-03-10  0:08                                 ` H.J. Lu
2020-03-10  1:21                                   ` Andy Lutomirski
2020-03-10  2:13                                     ` H.J. Lu
2020-02-05 18:19 ` [RFC PATCH v9 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
2020-02-25 20:02   ` Kees Cook
2020-02-05 18:19 ` [RFC PATCH v9 03/27] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
2020-02-25 20:04   ` Kees Cook
2020-02-05 18:19 ` [RFC PATCH v9 04/27] x86/cet: Add control-protection fault handler Yu-cheng Yu
2020-02-25 20:06   ` Kees Cook
2020-02-26 17:10   ` Dave Hansen
2020-03-05 20:44     ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 05/27] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack protection Yu-cheng Yu
2020-02-25 20:07   ` Kees Cook
2020-02-26 17:03   ` Dave Hansen
2020-02-26 19:57     ` Pavel Machek
2020-03-05 20:38     ` Yu-cheng Yu
2020-02-26 18:05   ` Dave Hansen
2020-02-27  1:02     ` H.J. Lu
2020-02-27  1:16       ` Dave Hansen
2020-02-27  2:11         ` H.J. Lu
2020-02-27  3:57           ` Andy Lutomirski
2020-02-27 18:03             ` Dave Hansen
2020-03-06 18:37     ` Yu-cheng Yu
2020-03-06 19:02       ` Dave Hansen
2020-03-06 21:16         ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 06/27] mm: Introduce VM_SHSTK for Shadow Stack memory Yu-cheng Yu
2020-02-25 20:07   ` Kees Cook
2020-02-26 18:07   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 07/27] Add guard pages around a Shadow Stack Yu-cheng Yu
2020-02-25 20:11   ` Kees Cook
2020-02-26 18:17   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 08/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
2020-02-25 20:12   ` Kees Cook
2020-02-26 18:20   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 09/27] x86/mm: Introduce _PAGE_DIRTY_SW Yu-cheng Yu
2020-02-25 20:12   ` Kees Cook
2020-02-26 21:35   ` Dave Hansen
2020-04-01 19:08     ` Yu-cheng Yu
2020-04-01 19:22       ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 10/27] x86/mm: Update pte_modify, pmd_modify, and _PAGE_CHG_MASK for _PAGE_DIRTY_SW Yu-cheng Yu
2020-02-26 22:02   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 11/27] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
2020-02-25 20:13   ` Kees Cook
2020-02-26 22:04   ` Dave Hansen
2020-04-03 15:42     ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 12/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW Yu-cheng Yu
2020-02-25 20:14   ` Kees Cook
2020-02-26 22:20   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 13/27] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
2020-02-25 20:16   ` Kees Cook
2020-02-26 22:47   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 14/27] mm: Handle Shadow Stack page fault Yu-cheng Yu
2020-02-25 20:20   ` Kees Cook
2020-03-05 18:30     ` Yu-cheng Yu
2020-02-27  0:08   ` Dave Hansen
2020-04-07 18:14     ` Yu-cheng Yu
2020-04-07 22:21       ` Dave Hansen
2020-04-08 18:18         ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 15/27] mm: Handle THP/HugeTLB " Yu-cheng Yu
2020-02-25 20:59   ` Kees Cook
2020-03-13 22:00     ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 16/27] mm: Update can_follow_write_pte() for Shadow Stack Yu-cheng Yu
2020-02-27  0:34   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 17/27] x86/cet/shstk: User-mode Shadow Stack support Yu-cheng Yu
2020-02-25 21:07   ` Kees Cook
2020-02-27  0:55   ` Dave Hansen
2020-02-05 18:19 ` [RFC PATCH v9 18/27] x86/cet/shstk: Introduce WRUSS instruction Yu-cheng Yu
2020-02-25 21:10   ` Kees Cook
2020-03-05 18:39     ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 19/27] x86/cet/shstk: Handle signals for Shadow Stack Yu-cheng Yu
2020-02-25 21:17   ` Kees Cook
2020-02-05 18:19 ` [RFC PATCH v9 20/27] ELF: UAPI and Kconfig additions for ELF program properties Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 21/27] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND Yu-cheng Yu
2020-02-25 21:18   ` Kees Cook
2020-02-05 18:19 ` [RFC PATCH v9 22/27] ELF: Add ELF program property parsing support Yu-cheng Yu
2020-02-25 21:20   ` Kees Cook
2020-02-05 18:19 ` [RFC PATCH v9 23/27] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 24/27] x86/cet/shstk: ELF header parsing for Shadow Stack Yu-cheng Yu
2020-02-25 21:22   ` Kees Cook
2020-02-05 18:19 ` [RFC PATCH v9 25/27] x86/cet/shstk: Handle thread " Yu-cheng Yu
2020-02-25 21:29   ` Kees Cook
2020-03-25 21:51     ` Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 26/27] mm/mmap: Add Shadow Stack pages to memory accounting Yu-cheng Yu
2020-02-05 18:19 ` [RFC PATCH v9 27/27] x86/cet/shstk: Add arch_prctl functions for Shadow Stack Yu-cheng Yu
2020-02-25 21:31 ` [RFC PATCH v9 00/27] Control-flow Enforcement: " Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).