linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack
@ 2021-07-22 20:51 Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 01/32] Documentation/x86: Add CET description Yu-cheng Yu
                   ` (32 more replies)
  0 siblings, 33 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Control-flow Enforcement (CET) is a new Intel processor feature that blocks
return/jump-oriented programming attacks.  Details are in "Intel 64 and
IA-32 Architectures Software Developer's Manual" [1].

CET can protect applications and the kernel.  This series enables only
application-level protection, and has three parts:

  - Shadow stack [2],
  - Indirect branch tracking, and
  - Selftests [3].

Linux distributions with CET are available now, and Intel processors with CET
are already on the market.  It would be nice if CET support can be accepted
into the kernel.

Changes in v28:
- Rebase to Linus tree v5.14-rc2.
- Patch #1: Update Document to indicate no-user-shstk also disables IBT.
- Patch #23: Update shstk_setup() with wrmsrl_safe().  Update return value.
- Patch #25: Split out copy_thread() changes.  Add support for old clone().
  Add comments.
- Add comments for get_xsave_addr() (Patch #25, #26).

Changes in v27:
- Eliminate signal context extension structure.  Simplify signal handling.
- Add a new patch to move VM_UFFD_MINOR_BIT to 38.
- Smaller changes are in each patch's log.
- Rebase to Linus tree v5.13-rc2.

[1] Intel 64 and IA-32 Architectures Software Developer's Manual:

    https://software.intel.com/en-us/download/intel-64-and-ia-32-
    architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

[2] Shadow Stack patches v27:

    https://lore.kernel.org/r/20210521221211.29077-1-yu-cheng.yu@intel.com/

[3] I am holding off the selftests changes and working to get Reviewed-by's.
    The earlier version of the selftests patches:

    https://lkml.kernel.org/r/20200521211720.20236-1-yu-cheng.yu@intel.com/

[4] The kernel ptrace patch is tested with an Intel-internal updated GDB.
    I am holding off the kernel ptrace patch to re-test it with my earlier
    patch for fixing regset holes.

Yu-cheng Yu (32):
  Documentation/x86: Add CET description
  x86/cet/shstk: Add Kconfig option for Shadow Stack
  x86/cpufeatures: Add CET CPU feature flags for Control-flow
    Enforcement Technology (CET)
  x86/cpufeatures: Introduce CPU setup and option parsing for CET
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/cet: Add control-protection fault handler
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Introduce _PAGE_COW
  drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  x86/mm: Update pte_modify for _PAGE_COW
  x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
    transition from _PAGE_DIRTY to _PAGE_COW
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory
  x86/mm: Shadow Stack page fault error checking
  x86/mm: Update maybe_mkwrite() for shadow stack
  mm: Fixup places that call pte_mkwrite() directly
  mm: Add guard pages around a shadow stack.
  mm/mmap: Add shadow stack pages to memory accounting
  mm: Update can_follow_write_pte() for shadow stack
  mm/mprotect: Exclude shadow stack from preserve_write
  mm: Re-introduce vm_flags to do_mmap()
  x86/cet/shstk: Add user-mode shadow stack support
  x86/process: Change copy_thread() argument 'arg' to 'stack_size'
  x86/cet/shstk: Handle thread shadow stack
  x86/cet/shstk: Introduce shadow stack token setup/verify routines
  x86/cet/shstk: Handle signals for shadow stack
  ELF: Introduce arch_setup_elf_property()
  x86/cet/shstk: Add arch_prctl functions for shadow stack
  mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h
  mm: Update arch_validate_flags() to test vma anonymous
  mm: Introduce PROT_SHADOW_STACK for shadow stack

 .../admin-guide/kernel-parameters.txt         |   7 +
 Documentation/filesystems/proc.rst            |   1 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 139 +++++++
 arch/arm64/include/asm/elf.h                  |   5 +
 arch/arm64/include/asm/mman.h                 |   4 +-
 arch/sparc/include/asm/mman.h                 |   4 +-
 arch/x86/Kconfig                              |  24 ++
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/ia32/ia32_signal.c                   |  25 +-
 arch/x86/include/asm/cet.h                    |  53 +++
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/elf.h                    |  11 +
 arch/x86/include/asm/fpu/types.h              |  23 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   4 +
 arch/x86/include/asm/mman.h                   |  88 ++++
 arch/x86/include/asm/mmu_context.h            |   3 +
 arch/x86/include/asm/msr-index.h              |  19 +
 arch/x86/include/asm/page_types.h             |   7 +
 arch/x86/include/asm/pgtable.h                | 300 ++++++++++++--
 arch/x86/include/asm/pgtable_types.h          |  48 ++-
 arch/x86/include/asm/processor.h              |   5 +
 arch/x86/include/asm/special_insns.h          |  30 ++
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/uapi/asm/mman.h              |  28 +-
 arch/x86/include/uapi/asm/prctl.h             |   4 +
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/cet_prctl.c                   |  60 +++
 arch/x86/kernel/cpu/common.c                  |  14 +
 arch/x86/kernel/cpu/cpuid-deps.c              |   2 +
 arch/x86/kernel/fpu/xstate.c                  |  11 +-
 arch/x86/kernel/idt.c                         |   4 +
 arch/x86/kernel/process.c                     |  21 +-
 arch/x86/kernel/process_64.c                  |  27 ++
 arch/x86/kernel/shstk.c                       | 375 ++++++++++++++++++
 arch/x86/kernel/signal.c                      |  13 +
 arch/x86/kernel/signal_compat.c               |   2 +-
 arch/x86/kernel/traps.c                       |  63 +++
 arch/x86/mm/fault.c                           |  19 +
 arch/x86/mm/mmap.c                            |  48 +++
 arch/x86/mm/pat/set_memory.c                  |   2 +-
 arch/x86/mm/pgtable.c                         |  25 ++
 drivers/gpu/drm/i915/gvt/gtt.c                |   2 +-
 fs/aio.c                                      |   2 +-
 fs/binfmt_elf.c                               |   4 +
 fs/proc/task_mmu.c                            |   3 +
 include/linux/elf.h                           |   6 +
 include/linux/mm.h                            |  20 +-
 include/linux/mman.h                          |   2 +-
 include/linux/pgtable.h                       |   7 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/linux/elf.h                      |  14 +
 ipc/shm.c                                     |   2 +-
 mm/gup.c                                      |  16 +-
 mm/huge_memory.c                              |  27 +-
 mm/memory.c                                   |   5 +-
 mm/migrate.c                                  |   3 +-
 mm/mmap.c                                     |  17 +-
 mm/mprotect.c                                 |  11 +-
 mm/nommu.c                                    |   4 +-
 mm/util.c                                     |   2 +-
 64 files changed, 1581 insertions(+), 115 deletions(-)
 create mode 100644 Documentation/x86/intel_cet.rst
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/include/asm/mman.h
 create mode 100644 arch/x86/kernel/cet_prctl.c
 create mode 100644 arch/x86/kernel/shstk.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 01/32] Documentation/x86: Add CET description
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 02/32] x86/cet/shstk: Add Kconfig option for Shadow Stack Yu-cheng Yu
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Explain no_user_shstk/no_user_ibt kernel parameters, and introduce a new
document on Control-flow Enforcement Technology (CET).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v28:
- Add a note to indicate disabling shadow stack also disables IBT.

 .../admin-guide/kernel-parameters.txt         |   7 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 139 ++++++++++++++++++
 3 files changed, 147 insertions(+)
 create mode 100644 Documentation/x86/intel_cet.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f713..3bc1a917dfef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3319,6 +3319,13 @@
 			noexec=on: enable non-executable mappings (default)
 			noexec=off: disable non-executable mappings
 
+	no_user_shstk	[X86-64] Disable Shadow Stack for user-mode
+			applications.  Disabling shadow stack also disables
+			IBT.
+
+	no_user_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
+			applications.
+
 	nosmap		[X86,PPC]
 			Disable SMAP (Supervisor Mode Access Prevention)
 			even if it is supported by processor.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 383048396336..c863c5ceb923 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -21,6 +21,7 @@ x86-specific Documentation
    tlb
    mtrr
    pat
+   intel_cet
    intel-iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
new file mode 100644
index 000000000000..104583353fb9
--- /dev/null
+++ b/Documentation/x86/intel_cet.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+[1] Overview
+============
+
+Control-flow Enforcement Technology (CET) is an Intel processor feature
+that provides protection against return/jump-oriented programming (ROP)
+attacks.  It can be set up to protect both applications and the kernel.
+Only user-mode protection is implemented in the 64-bit kernel, including
+shadow stack support for running legacy 32-bit applications.  IBT is not
+supported for 32-bit applications.
+
+CET introduces Shadow Stack and Indirect Branch Tracking.  Shadow stack is
+a secondary stack allocated from memory and cannot be directly modified by
+applications.  When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack.  Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy.  If the two differ, the processor raises a
+control-protection fault.  Indirect branch tracking verifies indirect
+CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
+opcodes.
+
+There are two Kconfig options:
+
+    X86_SHADOW_STACK, and X86_IBT.
+
+To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1
+or later are required.  To build a CET-enabled application, GLIBC v2.28 or
+later is also required.
+
+There are two command-line options for disabling CET features::
+
+    no_user_shstk - disables user shadow stack, and
+    no_user_ibt   - disables user indirect branch tracking.
+
+    Note: Disabling shadow stack also disables IBT.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET.
+
+[2] Application Enabling
+========================
+
+An application's CET capability is marked in its ELF header and can be
+verified from readelf/llvm-readelf output:
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: IBT, SHSTK
+
+If an application supports CET and is statically linked, it will run with
+CET protection.  If the application needs any shared libraries, the loader
+checks all dependencies and enables CET when all requirements are met.
+
+[3] Backward Compatibility
+==========================
+
+GLIBC provides a few CET tunables via the GLIBC_TUNABLES environment
+variable:
+
+GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
+    Turn off SHSTK/IBT.
+
+GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
+    This controls how dlopen() handles SHSTK legacy libraries::
+
+        on         - continue with SHSTK enabled;
+        permissive - continue with SHSTK off.
+
+Details can be found in the GLIBC manual pages.
+
+[4] CET arch_prctl()'s
+======================
+
+Several arch_prctl()'s have been added for CET:
+
+arch_prctl(ARCH_X86_CET_STATUS, u64 *addr)
+    Return CET feature status.
+
+    The parameter 'addr' is a pointer to a user buffer.
+    On returning to the caller, the kernel fills the following
+    information::
+
+        *addr       = shadow stack/indirect branch tracking status
+        *(addr + 1) = shadow stack base address
+        *(addr + 2) = shadow stack size
+
+arch_prctl(ARCH_X86_CET_DISABLE, unsigned int features)
+    Disable shadow stack and/or indirect branch tracking as specified in
+    'features'.  Return -EPERM if CET is locked.
+
+arch_prctl(ARCH_X86_CET_LOCK)
+    Lock in all CET features.  They cannot be turned off afterwards.
+
+Note:
+  There is no CET-enabling arch_prctl function.  By design, CET is enabled
+  automatically if the binary and the system can support it.
+
+[5] The implementation of the Shadow Stack
+==========================================
+
+Shadow Stack size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB).  In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB.  However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+The main program and its signal handlers use the same shadow stack.
+Because the shadow stack stores only return addresses, a large shadow
+stack covers the condition that both the program stack and the signal
+alternate stack run out.
+
+The kernel creates a restore token for the shadow stack restoring address
+and verifies that token when restoring from the signal handler.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty.  When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread.
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 02/32] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 01/32] Documentation/x86: Add CET description Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 03/32] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Shadow Stack provides protection against function return address
corruption.  It is active when the processor supports it, the kernel has
CONFIG_X86_SHADOW_STACK enabled, and the application is built for the
feature.  This is only implemented for the 64-bit kernel.  When it is
enabled, legacy non-Shadow Stack applications continue to work, but without
protection.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v25:
- Remove X86_CET and use X86_SHADOW_STACK directly.

v24:
- Update for the splitting X86_CET to X86_SHADOW_STACK and X86_IBT.

 arch/x86/Kconfig           | 22 ++++++++++++++++++++++
 arch/x86/Kconfig.assembler |  5 +++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 49270655e827..de992d3408b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86_64
 	depends on 64BIT
 	# Options that are inherently 64-bit kernel only:
 	select ARCH_HAS_GIGANTIC_PAGE
+	select ARCH_HAS_SHADOW_STACK
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select HAVE_ARCH_SOFT_DIRTY
@@ -1909,6 +1910,27 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config ARCH_HAS_SHADOW_STACK
+	def_bool n
+
+config X86_SHADOW_STACK
+	prompt "Intel Shadow Stack"
+	def_bool n
+	depends on AS_WRUSS
+	depends on ARCH_HAS_SHADOW_STACK
+	select ARCH_USES_HIGH_VMA_FLAGS
+	help
+	  Shadow Stack protection is a hardware feature that detects function
+	  return address corruption.  This helps mitigate ROP attacks.
+	  Applications must be enabled to use it, and old userspace does not
+	  get protection "for free".
+	  Support for this feature is present on Tiger Lake family of
+	  processors released in 2020 or later.  Enabling this feature
+	  increases kernel text size by 3.7 KB.
+	  See Documentation/x86/intel_cet.rst for more information.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08e2fc4..00c79dd93651 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -19,3 +19,8 @@ config AS_TPAUSE
 	def_bool $(as-instr,tpause %ecx)
 	help
 	  Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
+
+config AS_WRUSS
+	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+	help
+	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 03/32] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 01/32] Documentation/x86: Add CET description Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 02/32] x86/cet/shstk: Add Kconfig option for Shadow Stack Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET Yu-cheng Yu
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Add CPU feature flags for Control-flow Enforcement Technology (CET).

CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v25:
- Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.

v24:
- Update for splitting CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK and CONFIG_X86_IBT.
- Move DISABLE_IBT definition to the IBT series.

 arch/x86/include/asm/cpufeatures.h       | 2 ++
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/kernel/cpu/cpuid-deps.c         | 2 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..daa47bcd2050 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -350,6 +350,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* Shadow Stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
@@ -385,6 +386,7 @@
 #define X86_FEATURE_TSXLDTRK		(18*32+16) /* TSX Suspend Load Address Tracking */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR		(18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_IBT			(18*32+20) /* Indirect Branch Tracking */
 #define X86_FEATURE_AVX512_FP16		(18*32+23) /* AVX512 FP16 */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..b7728f7afb2b 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -65,6 +65,12 @@
 # define DISABLE_SGX	(1 << (X86_FEATURE_SGX & 31))
 #endif
 
+#ifdef CONFIG_X86_SHADOW_STACK
+#define DISABLE_SHSTK	0
+#else
+#define DISABLE_SHSTK	(1 << (X86_FEATURE_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -85,7 +91,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_SHSTK)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK19	0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index defda61f372d..e21d97cc20e4 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -75,6 +75,8 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_SGX_LC,			X86_FEATURE_SGX	      },
 	{ X86_FEATURE_SGX1,			X86_FEATURE_SGX       },
 	{ X86_FEATURE_SGX2,			X86_FEATURE_SGX1      },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
+	{ X86_FEATURE_IBT,			X86_FEATURE_SHSTK    },
 	{}
 };
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (2 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 03/32] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-08-09 16:06   ` Borislav Petkov
  2021-07-22 20:51 ` [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Yu-cheng Yu
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Introduce CPU setup and boot option parsing for CET features.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v25:
- Remove software-defined X86_FEATURE_CET.

v24:
- Update #ifdef placement to reflect Kconfig changes of splitting shadow stack and ibt.

 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c                | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..a8df907e8017 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_CET_BIT		23 /* enable Control-flow Enforcement */
+#define X86_CR4_CET		_BITUL(X86_CR4_CET_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 64b805bd6a54..714dd97870ba 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -505,6 +505,14 @@ static __init int setup_disable_pku(char *arg)
 __setup("nopku", setup_disable_pku);
 #endif /* CONFIG_X86_64 */
 
+static __always_inline void setup_cet(struct cpuinfo_x86 *c)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return;
+
+	cr4_set_bits(X86_CR4_CET);
+}
+
 /*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
@@ -1249,6 +1257,11 @@ static void __init cpu_parse_early_param(void)
 	if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
 		setup_clear_cpu_cap(X86_FEATURE_XSAVES);
 
+	if (cmdline_find_option_bool(boot_command_line, "no_user_shstk"))
+		setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+	if (cmdline_find_option_bool(boot_command_line, "no_user_ibt"))
+		setup_clear_cpu_cap(X86_FEATURE_IBT);
+
 	arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg));
 	if (arglen <= 0)
 		return;
@@ -1590,6 +1603,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	setup_cet(c);
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (3 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-08-09 16:46   ` Borislav Petkov
  2021-07-22 20:51 ` [PATCH v28 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
                   ` (27 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Control-flow Enforcement Technology (CET) introduces these MSRs:

    MSR_IA32_U_CET (user-mode CET settings),
    MSR_IA32_PL3_SSP (user-mode shadow stack pointer),

    MSR_IA32_PL0_SSP (kernel-mode shadow stack pointer),
    MSR_IA32_PL1_SSP (Privilege Level 1 shadow stack pointer),
    MSR_IA32_PL2_SSP (Privilege Level 2 shadow stack pointer),
    MSR_IA32_S_CET (kernel-mode CET settings),
    MSR_IA32_INT_SSP_TAB (exception shadow stack table).

The two user-mode MSRs belong to XFEATURE_CET_USER.  The first three of
kernel-mode MSRs belong to XFEATURE_CET_KERNEL.  Both XSAVES states are
supervisor states.  This means that there is no direct, unprivileged access
to these states, making it harder for an attacker to subvert CET.

For sigreturn and future ptrace() support, shadow stack address and MSR
reserved bits are checked before written to the supervisor states.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v28:
- Add XFEATURE_MASK_CET_USER to XFEATURES_INIT_FPSTATE_HANDLED (Rebase to
  upstream changes).

v25:
- Update xsave_cpuid_features[].  Now CET XSAVES features depend on
  X86_FEATURE_SHSTK (vs. the software-defined X86_FEATURE_CET).

 arch/x86/include/asm/fpu/types.h  | 23 +++++++++++++++++++++--
 arch/x86/include/asm/fpu/xstate.h |  6 ++++--
 arch/x86/include/asm/msr-index.h  | 19 +++++++++++++++++++
 arch/x86/kernel/fpu/xstate.c      | 11 ++++++++++-
 4 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f5a38a5f3ae1..035eb0ec665e 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
 	XFEATURE_PASID,
-	XFEATURE_RSRVD_COMP_11,
-	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -135,6 +135,8 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
@@ -237,6 +239,23 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	u64 user_cet;			/* user control-flow settings */
+	u64 user_ssp;			/* user shadow stack pointer */
+};
+
+/*
+ * State component 12 is Control-flow Enforcement kernel states
+ */
+struct cet_kernel_state {
+	u64 kernel_ssp;			/* kernel shadow stack */
+	u64 pl1_ssp;			/* privilege level 1 shadow stack */
+	u64 pl2_ssp;			/* privilege level 2 shadow stack */
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 109dfcc75299..18cf228ec33c 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -44,7 +44,8 @@
 	(XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+					    XFEATURE_MASK_CET_USER)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -71,7 +72,8 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+					      XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..b529f42ddaae 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -939,4 +939,23 @@
 #define MSR_VM_IGNNE                    0xc0010115
 #define MSR_VM_HSAVE_PA                 0xc0010117
 
+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET		0x000006a0 /* user mode cet setting */
+#define MSR_IA32_S_CET		0x000006a2 /* kernel mode cet setting */
+#define CET_SHSTK_EN		BIT_ULL(0)
+#define CET_WRSS_EN		BIT_ULL(1)
+#define CET_ENDBR_EN		BIT_ULL(2)
+#define CET_LEG_IW_EN		BIT_ULL(3)
+#define CET_NO_TRACK_EN		BIT_ULL(4)
+#define CET_SUPPRESS_DISABLE	BIT_ULL(5)
+#define CET_RESERVED		(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
+#define CET_SUPPRESS		BIT_ULL(10)
+#define CET_WAIT_ENDBR		BIT_ULL(11)
+
+#define MSR_IA32_PL0_SSP	0x000006a4 /* kernel shadow stack pointer */
+#define MSR_IA32_PL1_SSP	0x000006a5 /* ring-1 shadow stack pointer */
+#define MSR_IA32_PL2_SSP	0x000006a6 /* ring-2 shadow stack pointer */
+#define MSR_IA32_PL3_SSP	0x000006a7 /* user shadow stack pointer */
+#define MSR_IA32_INT_SSP_TAB	0x000006a8 /* exception shadow stack table */
+
 #endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8def1b7f8fb..389bdfed03c1 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -38,6 +38,8 @@ static const char *xfeature_names[] =
 	"Processor Trace (unused)"	,
 	"Protection Keys User registers",
 	"PASID state",
+	"Control-flow User registers"	,
+	"Control-flow Kernel registers"	,
 	"unknown xstate feature"	,
 };
 
@@ -53,6 +55,8 @@ static short xsave_cpuid_features[] __initdata = {
 	X86_FEATURE_INTEL_PT,
 	X86_FEATURE_PKU,
 	X86_FEATURE_ENQCMD,
+	X86_FEATURE_SHSTK, /* XFEATURE_CET_USER */
+	X86_FEATURE_SHSTK, /* XFEATURE_CET_KERNEL */
 };
 
 /*
@@ -236,6 +240,8 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
+	print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
 }
 
 /*
@@ -372,6 +378,7 @@ static void __init print_xstate_offset_size(void)
 	 XFEATURE_MASK_PKRU |			\
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
+	 XFEATURE_MASK_CET_USER |		\
 	 XFEATURE_MASK_PASID)
 
 /*
@@ -532,6 +539,8 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
+	XCHECK_SZ(sz, nr, XFEATURE_CET_USER,   struct cet_user_state);
+	XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
@@ -541,7 +550,7 @@ static void check_xstate_against_struct(int nr)
 	if ((nr < XFEATURE_YMM) ||
 	    (nr >= XFEATURE_MAX) ||
 	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
+	    ((nr >= XFEATURE_RSRVD_COMP_13) && (nr <= XFEATURE_LBR))) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 06/32] x86/cet: Add control-protection fault handler
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (4 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-08-09 17:51   ` Borislav Petkov
  2021-07-22 20:51 ` [PATCH v28 07/32] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Yu-cheng Yu
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..a90791433152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..9f1bdaabc246 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 06743ec054d2..049ea3dcc6cb 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..58664374ae8a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -607,6 +608,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 5a3c221f4c9d..a1a153ea3cc3 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -235,7 +235,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 07/32] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (5 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 08/32] x86/mm: Move pmd_write(), pud_write() up in the file Yu-cheng Yu
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov, Christoph Hellwig

The x86 family of processors do not directly create read-only and Dirty
PTEs.  These PTEs are created by software.  One such case is that kernel
read-only pages are historically setup as Dirty.

New processors that support Shadow Stack regard read-only and Dirty PTEs as
shadow stack pages.  This results in ambiguity between shadow stack and
kernel read-only pages.  To resolve this, removed Dirty from kernel read-
only pages.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable_types.h | 6 +++---
 arch/x86/mm/pat/set_memory.c         | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..3781a79b6388 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -190,10 +190,10 @@ enum page_cache_mode {
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index ad8a5c586a35..a05e630cb531 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1940,7 +1940,7 @@ int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
 }
 
 int set_memory_rw(unsigned long addr, int numpages)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 08/32] x86/mm: Move pmd_write(), pud_write() up in the file
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (6 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 07/32] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-08-09 18:02   ` Borislav Petkov
  2021-07-22 20:51 ` [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
                   ` (24 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

To prepare the introduction of _PAGE_COW, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..0ddeda0bc0c0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -156,6 +156,18 @@ static inline int pte_write(pte_t pte)
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_RW;
+}
+
 static inline int pte_huge(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_PSE;
@@ -1099,12 +1111,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pmd_t *pmdp)
@@ -1126,12 +1132,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
-	return pud_flags(pud) & _PAGE_RW;
-}
-
 #ifndef pmdp_establish
 #define pmdp_establish pmdp_establish
 static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (7 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 08/32] x86/mm: Move pmd_write(), pud_write() up in the file Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-08-16 10:43   ` Borislav Petkov
  2021-07-22 20:51 ` [PATCH v28 10/32] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
                   ` (23 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux).  That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0, Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set by hardware
and cannot normally be set by hardware on a Write=0 PTE.  Software must
normally be involved to create one of these PTEs, so software can simply
opt to not create them.

In places where Linux normally creates Write=0, Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY.  In other
words, whenever Linux needs to create Write=0, Dirty=1, it instead creates
Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1.  This
clearly separates shadow stack from other data, and results in the
following:

(a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
(b) A R/O page that has been COW'ed: (Write=0, Cow=1)
    The user page is in a R/O VMA, and get_user_pages() needs a writable
    copy.  The page fault handler creates a copy of the page and sets
    the new copy's PTE as Write=0 and Cow=1.
(c) A shadow stack PTE: (Write=0, Dirty=1)
(d) A shared shadow stack PTE: (Write=0, Cow=1)
    When a shadow stack page is being shared among processes (this happens
    at fork()), its PTE is made Dirty=0, so the next shadow stack access
    causes a fault, and the page is duplicated and Dirty=1 is set again.
    This is the COW equivalent for shadow stack pages, even though it's
    copy-on-access rather than copy-on-write.
(e) A page where the processor observed a Write=1 PTE, started a write, set
    Dirty=1, but then observed a Write=0 PTE.  That's possible today, but
    will not happen on processors that support shadow stack.

Define _PAGE_COW and update pte_*() helpers and apply the same changes to
pmd and pud.

After this, there are six free bits left in the 64-bit PTE, and no more
free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not
implemented for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h       | 196 ++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h |  42 +++++-
 2 files changed, 217 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ddeda0bc0c0..9f1ba76ed79a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -121,9 +121,20 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	/*
+	 * A dirty PTE has Dirty=1 or Cow=1.
+	 */
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -131,9 +142,20 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
+{
+	/*
+	 * A dirty PMD has Dirty=1 or Cow=1.
+	 */
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -141,9 +163,12 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	/*
+	 * A dirty PUD has Dirty=1 or Cow=1.
+	 */
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -153,13 +178,23 @@ static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are always writable - but not by normal
+	 * instructions, and only by shadow stack operations.  Therefore,
+	 * the W=0,D=1 test with pte_shstk().
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are always writable - but not by normal
+	 * instructions, and only by shadow stack operations.  Therefore,
+	 * the W=0,D=1 test with pmd_shstk().
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -297,6 +332,24 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+static inline pte_t pte_mkcow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -316,7 +369,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -326,7 +379,16 @@ static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (RW=0, Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pte_dirty(pte))
+		pte = pte_mkcow(pte);
+	return pte;
 }
 
 static inline pte_t pte_mkexec(pte_t pte)
@@ -336,7 +398,18 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pteval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_COW;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	return pte_clear_cow(pte);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -346,7 +419,12 @@ static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_RW);
+	pte = pte_set_flags(pte, _PAGE_RW);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_cow(pte);
+
+	return pte;
 }
 
 static inline pte_t pte_mkhuge(pte_t pte)
@@ -393,6 +471,24 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pmd_uffd_wp(pmd_t pmd)
 {
@@ -417,17 +513,36 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pmd_dirty(pmd))
+		pmd = pmd_mkcow(pmd);
+	return pmd;
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pmdval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pmd_write(pmd))
+		dirty = _PAGE_COW;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd_clear_cow(pmd);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -447,7 +562,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_RW);
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_cow(pmd);
+	return pmd;
 }
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -464,6 +583,24 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+static inline pud_t pud_mkcow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_COW);
+}
+
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_COW);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
@@ -471,17 +608,32 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pud_dirty(pud))
+		pud = pud_mkcow(pud);
+	return pud;
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pud_write(pud))
+		dirty = _PAGE_COW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -501,7 +653,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	if (pud_dirty(pud))
+		pud = pud_clear_cow(pud);
+	return pud;
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3781a79b6388..1bfab70ff9ac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -115,6 +125,36 @@
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
+ * (b) A R/O page that has been COW'ed: (Write=0, Cow=1)
+ *     The user page is in a R/O VMA, and get_user_pages() needs a
+ *     writable copy.  The page fault handler creates a copy of the page
+ *     and sets the new copy's PTE as Write=0, Cow=1.
+ * (c) A shadow stack PTE: (Write=0, Dirty=1)
+ * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1)
+ *     When a shadow stack page is being shared among processes (this
+ *     happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next
+ *     shadow stack access causes a fault, and the page is duplicated and
+ *     _PAGE_DIRTY is set again.  This is the COW equivalent for shadow
+ *     stack pages, even though it's copy-on-access rather than
+ *     copy-on-write.
+ * (e) A page where the processor observed a Write=1 PTE, started a write,
+ *     set Dirty=1, but then observed a Write=0 PTE (changed by another
+ *     thread).  That's possible today, but will not happen on processors
+ *     that support shadow stack.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW	(_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 10/32] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (8 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 11/32] x86/mm: Update pte_modify for _PAGE_COW Yu-cheng Yu
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov, David Airlie, Joonas Lahtinen,
	Jani Nikula, Daniel Vetter, Rodrigo Vivi, Zhenyu Wang, Zhi Wang

After the introduction of _PAGE_COW, a modified page's PTE can have either
_PAGE_DIRTY or _PAGE_COW.  Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
---
 drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index cc2c05e18206..ca232c822484 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1210,7 +1210,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
 	}
 
 	/* Clear dirty field. */
-	se->val64 &= ~_PAGE_DIRTY;
+	se->val64 &= ~_PAGE_DIRTY_BITS;
 
 	ops->clear_pse(se);
 	ops->clear_ips(se);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 11/32] x86/mm: Update pte_modify for _PAGE_COW
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (9 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 10/32] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-07-22 20:51 ` [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Yu-cheng Yu
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

The read-only and Dirty PTE has been used to indicate copy-on-write pages.
However, newer x86 processors also regard a read-only and Dirty PTE as a
shadow stack page.  In order to separate the two, the software-defined
_PAGE_COW is created to replace _PAGE_DIRTY for the copy-on-write case, and
pte_*() are updated.

Pte_modify() changes a PTE to 'newprot', but it doesn't use the pte_*().
Introduce fixup_dirty_pte(), which sets a dirty PTE, based on _PAGE_RW,
to either _PAGE_DIRTY or _PAGE_COW.

Apply the same changes to pmd_modify().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 37 ++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9f1ba76ed79a..cf7316e968df 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -771,6 +771,23 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
+static inline pteval_t fixup_dirty_pte(pteval_t pteval)
+{
+	pte_t pte = __pte(pteval);
+
+	/*
+	 * Fix up potential shadow stack page flags because the RO, Dirty
+	 * PTE is special.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		if (pte_dirty(pte)) {
+			pte = pte_mkclean(pte);
+			pte = pte_mkdirty(pte);
+		}
+	}
+	return pte_val(pte);
+}
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pteval_t val = pte_val(pte), oldval = val;
@@ -781,16 +798,36 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+	val = fixup_dirty_pte(val);
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
 	return __pte(val);
 }
 
+static inline int pmd_write(pmd_t pmd);
+static inline pmdval_t fixup_dirty_pmd(pmdval_t pmdval)
+{
+	pmd_t pmd = __pmd(pmdval);
+
+	/*
+	 * Fix up potential shadow stack page flags because the RO, Dirty
+	 * PMD is special.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		if (pmd_dirty(pmd)) {
+			pmd = pmd_mkclean(pmd);
+			pmd = pmd_mkdirty(pmd);
+		}
+	}
+	return pmd_val(pmd);
+}
+
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
 	pmdval_t val = pmd_val(pmd), oldval = val;
 
 	val &= _HPAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val = fixup_dirty_pmd(val);
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
 	return __pmd(val);
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (10 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 11/32] x86/mm: Update pte_modify for _PAGE_COW Yu-cheng Yu
@ 2021-07-22 20:51 ` Yu-cheng Yu
  2021-08-16 16:01   ` Borislav Petkov
  2021-07-22 20:52 ` [PATCH v28 13/32] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Yu-cheng Yu
                   ` (20 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].

When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a read-only PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE.  However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.

The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue.  Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cf7316e968df..df4ce715560a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1278,6 +1278,24 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+	/*
+	 * If Shadow Stack is enabled, pte_wrprotect() moves _PAGE_DIRTY
+	 * to _PAGE_COW (see comments at pte_wrprotect()).
+	 * When a thread reads a RW=1, Dirty=0 PTE and before changing it
+	 * to RW=0, Dirty=0, another thread could have written to the page
+	 * and the PTE is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
+	 * PTE changes and update old_pte, then try again.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		old_pte = READ_ONCE(*ptep);
+		do {
+			new_pte = pte_wrprotect(old_pte);
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1322,6 +1340,24 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+	/*
+	 * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+	 * to _PAGE_COW (see comments at pmd_wrprotect()).
+	 * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+	 * to RW=0, Dirty=0, another thread could have written to the page
+	 * and the PMD is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
+	 * PMD changes and update old_pmd, then try again.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		old_pmd = READ_ONCE(*pmdp);
+		do {
+			new_pmd = pmd_wrprotect(old_pmd);
+		} while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));
+
+		return;
+	}
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 13/32] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (11 preceding siblings ...)
  2021-07-22 20:51 ` [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory Yu-cheng Yu
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Axel Rasmussen, Peter Xu

To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..1d4e5012c27d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,7 +370,7 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT	37
+# define VM_UFFD_MINOR_BIT	38
 # define VM_UFFD_MINOR		BIT(VM_UFFD_MINOR_BIT)	/* UFFD minor faults */
 #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 # define VM_UFFD_MINOR		VM_NONE
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (12 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 13/32] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-08-16 16:35   ` Borislav Petkov
  2021-07-22 20:52 ` [PATCH v28 15/32] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
                   ` (18 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

A shadow stack PTE must be read-only and have _PAGE_DIRTY set.  However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages.  These
two cases are handled differently for page faults.  Introduce
VM_SHADOW_STACK to track shadow stack VMAs.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
 Documentation/filesystems/proc.rst | 1 +
 arch/x86/mm/mmap.c                 | 2 ++
 fs/proc/task_mmu.c                 | 3 +++
 include/linux/mm.h                 | 8 ++++++++
 4 files changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 042c418f4090..3473f1aa7e89 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -553,6 +553,7 @@ encoded manner. The codes are the following:
     mt    arm64 MTE allocation tags are enabled
     um    userfaultfd missing tracking
     uw    userfaultfd wr-protect tracking
+    ss    shadow stack page
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..f3f52c5e2fd6 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
 
 const char *arch_vma_name(struct vm_area_struct *vma)
 {
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return "[shadow stack]";
 	return NULL;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..02c70198b989 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -662,6 +662,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+		[ilog2(VM_SHADOW_STACK)]= "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1d4e5012c27d..4a9985e50819 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -319,11 +319,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -339,6 +341,12 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_X86_SHADOW_STACK
+# define VM_SHADOW_STACK	VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK	VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 15/32] x86/mm: Shadow Stack page fault error checking
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (13 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

Shadow stack accesses are those that are performed by the CPU where it
expects to encounter a shadow stack mapping.  These accesses are performed
implicitly by CALL/RET at the site of the shadow stack pointer.  These
accesses are made explicitly by shadow stack management instructions like
WRUSSQ.

Shadow stacks accesses to shadow-stack mapping can see faults in normal,
valid operation just like regular accesses to regular mappings.  Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping.

In handling a shadow stack page fault, verify it occurs within a shadow
stack mapping.  It is always an error otherwise.  For valid shadow stack
accesses, set FAULT_FLAG_WRITE to effect copy-on-write.  Because clearing
_PAGE_DIRTY (vs. _PAGE_RW) is used to trigger the fault, shadow stack read
fault and shadow stack write fault are not differentiated and both are
handled as a write access.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/trap_pf.h |  2 ++
 arch/x86/mm/fault.c            | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  *   bit 15 ==				1: SGX MMU page-fault
  */
 enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 	X86_PF_SGX	=		1 << 15,
 };
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b2eefdefc108..ad3350297e4b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1100,6 +1100,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Verify a shadow stack access is within a shadow stack VMA.
+	 * It is always an error otherwise.  Normal data access to a
+	 * shadow stack area is checked in the case followed.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (!(vma->vm_flags & VM_SHADOW_STACK))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1293,6 +1304,14 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * Clearing _PAGE_DIRTY is used to detect shadow stack access.
+	 * This method cannot distinguish shadow stack read vs. write.
+	 * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
+	 * copy-on-write.
+	 */
+	if (error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_INSTR)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (14 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 15/32] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-08-16 17:03   ` Borislav Petkov
  2021-07-22 20:52 ` [PATCH v28 17/32] mm: Fixup places that call pte_mkwrite() directly Yu-cheng Yu
                   ` (16 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

When serving a page fault, maybe_mkwrite() makes a PTE writable if its vma
has VM_WRITE.

A shadow stack vma has VM_SHADOW_STACK.  Its PTEs have _PAGE_DIRTY, but not
_PAGE_WRITE.  In fork(), _PAGE_DIRTY is cleared to cause copy-on-write,
and in the page fault handler, _PAGE_DIRTY is restored and the shadow stack
page is writable again.

Introduce an x86 version of maybe_mkwrite(), which sets proper PTE bits
according to VM flags.

Apply the same changes to maybe_pmd_mkwrite().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pgtable.h |  6 ++++++
 arch/x86/mm/pgtable.c          | 20 ++++++++++++++++++++
 include/linux/mm.h             |  2 ++
 mm/huge_memory.c               |  2 ++
 4 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index df4ce715560a..bfe4ea2b652d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -280,6 +280,9 @@ static inline int pmd_trans_huge(pmd_t pmd)
 	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
 }
 
+#define maybe_pmd_mkwrite maybe_pmd_mkwrite
+extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
@@ -1632,6 +1635,9 @@ static inline bool arch_faults_on_old_pte(void)
 	return false;
 }
 
+#define maybe_mkwrite maybe_mkwrite
+extern pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);
+
 #endif	/* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3364fe62b903..ba449d12ec32 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -610,6 +610,26 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 }
 #endif
 
+pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	else if (likely(vma->vm_flags & VM_SHADOW_STACK))
+		pte = pte_mkwrite_shstk(pte);
+	return pte;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	else if (likely(vma->vm_flags & VM_SHADOW_STACK))
+		pmd = pmd_mkwrite_shstk(pmd);
+	return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4a9985e50819..4548f75cef14 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1015,12 +1015,14 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
+#ifndef maybe_mkwrite
 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
 		pte = pte_mkwrite(pte);
 	return pte;
 }
+#endif
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..c8dd5913884e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -491,12 +491,14 @@ static int __init setup_transparent_hugepage(char *str)
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
 
+#ifndef maybe_pmd_mkwrite
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
 		pmd = pmd_mkwrite(pmd);
 	return pmd;
 }
+#endif
 
 #ifdef CONFIG_MEMCG
 static inline struct deferred_split *get_deferred_split_queue(struct page *page)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 17/32] mm: Fixup places that call pte_mkwrite() directly
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (15 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 18/32] mm: Add guard pages around a shadow stack Yu-cheng Yu
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

When serving a page fault, maybe_mkwrite() makes a PTE writable if it is in
a writable vma.  A shadow stack vma is writable, but its PTEs need
_PAGE_DIRTY to be set to become writable.  For this reason, maybe_mkwrite()
has been updated.

There are a few places that call pte_mkwrite() directly, but have the
same result as from maybe_mkwrite().  These sites need to be updated for
shadow stack as well.  Thus, change them to maybe_mkwrite():

- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly
  and call pte_mkwrite(), which is the same as maybe_mkwrite().  Change
  them to maybe_mkwrite().

- In do_numa_page(), if the numa entry was writable, then pte_mkwrite()
  is called directly.  Fix it by doing maybe_mkwrite().  Make the same
  changes to do_huge_pmd_numa_page().

- In change_pte_range(), pte_mkwrite() is called directly.  Replace it with
  maybe_mkwrite().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v25:
- Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().

 mm/huge_memory.c | 2 +-
 mm/memory.c      | 5 ++---
 mm/migrate.c     | 3 +--
 mm/mprotect.c    | 2 +-
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c8dd5913884e..b9a6fc7af693 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1515,7 +1515,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
 	if (was_writable)
-		pmd = pmd_mkwrite(pmd);
+		pmd = maybe_pmd_mkwrite(pmd, vma);
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 747a01d495f2..c82a8f38fb04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3781,8 +3781,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
 	entry = mk_pte(page, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
@@ -4407,7 +4406,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
 	if (was_writable)
-		pte = pte_mkwrite(pte);
+		pte = maybe_mkwrite(pte, vma);
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index 34a9ad3e0a4f..f8c1ce0c187b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2794,8 +2794,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		}
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
-		if (vma->vm_flags & VM_WRITE)
-			entry = pte_mkwrite(pte_mkdirty(entry));
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..9b424f2fd3a9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -135,7 +135,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			if (dirty_accountable && pte_dirty(ptent) &&
 					(pte_soft_dirty(ptent) ||
 					 !(vma->vm_flags & VM_SOFTDIRTY))) {
-				ptent = pte_mkwrite(ptent);
+				ptent = maybe_mkwrite(ptent, vma);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			pages++;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 18/32] mm: Add guard pages around a shadow stack.
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (16 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 17/32] mm: Fixup places that call pte_mkwrite() directly Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 19/32] mm/mmap: Add shadow stack pages to memory accounting Yu-cheng Yu
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
first and the last elements in the range, effectively touches those memory
areas.

The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
255 * 4 = 1020 bytes by INCSSPD.  Both ranges are far from PAGE_SIZE.
Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
CALL, and RET from going beyond.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v25:
- Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.

v24:
- Instead changing vm_*_gap(), create x86-specific versions.

 arch/x86/include/asm/page_types.h |  7 +++++
 arch/x86/mm/mmap.c                | 46 +++++++++++++++++++++++++++++++
 include/linux/mm.h                |  4 +++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index a506a411474d..e1533fdc08b4 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -73,6 +73,13 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
 
 extern void initmem_init(void);
 
+#define vm_start_gap vm_start_gap
+struct vm_area_struct;
+extern unsigned long vm_start_gap(struct vm_area_struct *vma);
+
+#define vm_end_gap vm_end_gap
+extern unsigned long vm_end_gap(struct vm_area_struct *vma);
+
 #endif	/* !__ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index f3f52c5e2fd6..81f9325084d3 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -250,3 +250,49 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
 		return false;
 	return true;
 }
+
+/*
+ * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).  INCSSPQ
+ * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
+ * touches the first and the last element in the range, which triggers a
+ * page fault if the range is not in a shadow stack.  Because of this,
+ * creating 4-KB guard pages around a shadow stack prevents these
+ * instructions from going beyond.
+ */
+#define SHADOW_STACK_GUARD_GAP PAGE_SIZE
+
+unsigned long vm_start_gap(struct vm_area_struct *vma)
+{
+	unsigned long vm_start = vma->vm_start;
+	unsigned long gap = 0;
+
+	if (vma->vm_flags & VM_GROWSDOWN)
+		gap = stack_guard_gap;
+	else if (vma->vm_flags & VM_SHADOW_STACK)
+		gap = SHADOW_STACK_GUARD_GAP;
+
+	if (gap != 0) {
+		vm_start -= gap;
+		if (vm_start > vma->vm_start)
+			vm_start = 0;
+	}
+	return vm_start;
+}
+
+unsigned long vm_end_gap(struct vm_area_struct *vma)
+{
+	unsigned long vm_end = vma->vm_end;
+	unsigned long gap = 0;
+
+	if (vma->vm_flags & VM_GROWSUP)
+		gap = stack_guard_gap;
+	else if (vma->vm_flags & VM_SHADOW_STACK)
+		gap = SHADOW_STACK_GUARD_GAP;
+
+	if (gap != 0) {
+		vm_end += gap;
+		if (vm_end < vma->vm_end)
+			vm_end = -PAGE_SIZE;
+	}
+	return vm_end;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4548f75cef14..354f38d21eed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2729,6 +2729,7 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
+#ifndef vm_start_gap
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
 	unsigned long vm_start = vma->vm_start;
@@ -2740,7 +2741,9 @@ static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 	}
 	return vm_start;
 }
+#endif
 
+#ifndef vm_end_gap
 static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
 {
 	unsigned long vm_end = vma->vm_end;
@@ -2752,6 +2755,7 @@ static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
 	}
 	return vm_end;
 }
+#endif
 
 static inline unsigned long vma_pages(struct vm_area_struct *vma)
 {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 19/32] mm/mmap: Add shadow stack pages to memory accounting
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (17 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 18/32] mm: Add guard pages around a shadow stack Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 20/32] mm: Update can_follow_write_pte() for shadow stack Yu-cheng Yu
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

Account shadow stack pages to stack memory.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v26:
- Remove redundant #ifdef CONFIG_MMU.

v25:
- Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().

v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().
- Change VM_SHSTK to VM_SHADOW_STACK.

 arch/x86/include/asm/pgtable.h | 3 +++
 arch/x86/mm/pgtable.c          | 5 +++++
 include/linux/pgtable.h        | 7 +++++++
 mm/mmap.c                      | 5 +++++
 4 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bfe4ea2b652d..0983a91b464c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1638,6 +1638,9 @@ static inline bool arch_faults_on_old_pte(void)
 #define maybe_mkwrite maybe_mkwrite
 extern pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);
 
+#define is_shadow_stack_mapping is_shadow_stack_mapping
+extern bool is_shadow_stack_mapping(vm_flags_t vm_flags);
+
 #endif	/* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ba449d12ec32..945f6b5a42e5 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -888,3 +888,8 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #endif /* CONFIG_X86_64 */
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+bool is_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+	return vm_flags & VM_SHADOW_STACK;
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index d147480cdefc..eca0a7b80b3e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1508,6 +1508,13 @@ static inline bool arch_has_pfn_modify_check(void)
 }
 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
 
+#ifndef is_shadow_stack_mapping
+static inline bool is_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+	return false;
+}
+#endif
+
 /*
  * Architecture PAGE_KERNEL_* fallbacks
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..6be9ff4007ab 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1721,6 +1721,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	if (file && is_file_hugepages(file))
 		return 0;
 
+	if (is_shadow_stack_mapping(vm_flags))
+		return 1;
+
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
@@ -3370,6 +3373,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->stack_vm += npages;
 	else if (is_data_mapping(flags))
 		mm->data_vm += npages;
+	else if (is_shadow_stack_mapping(flags))
+		mm->stack_vm += npages;
 }
 
 static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 20/32] mm: Update can_follow_write_pte() for shadow stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (18 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 19/32] mm/mmap: Add shadow stack pages to memory accounting Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 21/32] mm/mprotect: Exclude shadow stack from preserve_write Yu-cheng Yu
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

Can_follow_write_pte() ensures a read-only page is COWed by checking the
FOLL_COW flag, and uses pte_dirty() to validate the flag is still valid.

Like a writable data page, a shadow stack page is writable, and becomes
read-only during copy-on-write, but it is always dirty.  Thus, in the
can_follow_write_pte() check, it belongs to the writable page case and
should be excluded from the read-only page pte_dirty() check.  Apply
the same changes to can_follow_write_pmd().

While at it, also split the long line into smaller ones.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v26:
- Instead of passing vm_flags, pass down vma pointer to can_follow_write_*().

v25:
- Split long line into smaller ones.

v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

 mm/gup.c         | 16 ++++++++++++----
 mm/huge_memory.c | 16 ++++++++++++----
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..f2d49731035e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -478,10 +478,18 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
+					struct vm_area_struct *vma)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	if (pte_write(pte))
+		return true;
+	if ((flags & (FOLL_FORCE | FOLL_COW)) != (FOLL_FORCE | FOLL_COW))
+		return false;
+	if (!pte_dirty(pte))
+		return false;
+	if (is_shadow_stack_mapping(vma->vm_flags))
+		return false;
+	return true;
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -524,7 +532,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags, vma)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b9a6fc7af693..d35acb59dde9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1346,10 +1346,18 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
  * FOLL_FORCE can write to even unwritable pmd's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags,
+					struct vm_area_struct *vma)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	if (pmd_write(pmd))
+		return true;
+	if ((flags & (FOLL_FORCE | FOLL_COW)) != (FOLL_FORCE | FOLL_COW))
+		return false;
+	if (!pmd_dirty(pmd))
+		return false;
+	if (is_shadow_stack_mapping(vma->vm_flags))
+		return false;
+	return true;
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1362,7 +1370,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags, vma))
 		goto out;
 
 	/* Avoid dumping huge zero page */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 21/32] mm/mprotect: Exclude shadow stack from preserve_write
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (19 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 20/32] mm: Update can_follow_write_pte() for shadow stack Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 22/32] mm: Re-introduce vm_flags to do_mmap() Yu-cheng Yu
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
preserved to avoid the additional write fault after the NUMA hinting fault.
However, pte_write() now includes both normal writable and shadow stack
(RW=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no need
to preserve it.

Exclude shadow stack from preserve_write test, and apply the same change to
change_huge_pmd().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
v25:
- Move is_shadow_stack_mapping() to a separate line.

v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

 mm/huge_memory.c | 7 +++++++
 mm/mprotect.c    | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d35acb59dde9..bb38ecdf992f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1776,6 +1776,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		return 0;
 
 	preserve_write = prot_numa && pmd_write(*pmd);
+
+	/*
+	 * Preserve only normal writable huge PMD, but not shadow
+	 * stack (RW=0, Dirty=1).
+	 */
+	if (is_shadow_stack_mapping(vma->vm_flags))
+		preserve_write = false;
 	ret = 1;
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9b424f2fd3a9..94cb799216ec 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,6 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
 
+			/*
+			 * Preserve only normal writable PTE, but not shadow
+			 * stack (RW=0, Dirty=1).
+			 */
+			if (is_shadow_stack_mapping(vma->vm_flags))
+				preserve_write = false;
+
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 22/32] mm: Re-introduce vm_flags to do_mmap()
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (20 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 21/32] mm/mprotect: Exclude shadow stack from preserve_write Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 23/32] x86/cet/shstk: Add user-mode shadow stack support Yu-cheng Yu
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Peter Collingbourne, Kirill A . Shutemov, Andrew Morton

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

    commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap().  Thus, re-introduce vm_flags to do_mmap().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: linux-mm@kvack.org
---
 fs/aio.c           |  2 +-
 include/linux/mm.h |  3 ++-
 ipc/shm.c          |  2 +-
 mm/mmap.c          | 10 +++++-----
 mm/nommu.c         |  4 ++--
 mm/util.c          |  2 +-
 6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 76ce0cc3ee4e..92e09b0863ad 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -526,7 +526,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
 				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+				 MAP_SHARED, 0, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 354f38d21eed..07e642af59d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2617,7 +2617,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf);
 extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..fb7a3a230b79 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1556,7 +1556,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 6be9ff4007ab..100db6e46831 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1406,11 +1406,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  */
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
-			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long flags, vm_flags_t vm_flags,
+			unsigned long pgoff, unsigned long *populate,
+			struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	vm_flags_t vm_flags;
 	int pkey = 0;
 
 	*populate = 0;
@@ -1470,7 +1470,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
@@ -3036,7 +3036,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, 0, pgoff, &populate, NULL);
 	fput(file);
 out:
 	mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 3a93d4054810..5b6dcf42659a 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1061,6 +1061,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long len,
 			unsigned long prot,
 			unsigned long flags,
+			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
 			struct list_head *uf)
@@ -1068,7 +1069,6 @@ unsigned long do_mmap(struct file *file,
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	struct rb_node *rb;
-	vm_flags_t vm_flags;
 	unsigned long capabilities, result;
 	int ret;
 
@@ -1087,7 +1087,7 @@ unsigned long do_mmap(struct file *file,
 
 	/* we've determined that we can make the mapping, now translate what we
 	 * now know into VMA flags */
-	vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+	vm_flags |= determine_vm_flags(file, prot, flags, capabilities);
 
 	/* we're going to need to record the mapping */
 	region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c
index 9043d03750a7..9db194fc2030 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -516,7 +516,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
 			      &uf);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 23/32] x86/cet/shstk: Add user-mode shadow stack support
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (21 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 22/32] mm: Re-introduce vm_flags to do_mmap() Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 24/32] x86/process: Change copy_thread() argument 'arg' to 'stack_size' Yu-cheng Yu
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v28:
- Update shstk_setup() with wrmsrl_safe().  Return success when shadow
  stack feature is not present, because this is a setup init function
  and when the feature is not present, no setup is necessary.

v27:
- Change 'struct cet_status' to 'struct thread_shstk', and change member
  types from unsigned long to u64.
- Re-order local variables in reverse order of length.
- WARN_ON_ONCE() when vm_munmap() fails.

 arch/x86/include/asm/cet.h       |  30 +++++++
 arch/x86/include/asm/processor.h |   5 ++
 arch/x86/kernel/Makefile         |   1 +
 arch/x86/kernel/shstk.c          | 134 +++++++++++++++++++++++++++++++
 4 files changed, 170 insertions(+)
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..6432baf4de1f
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+/*
+ * Per-thread CET status
+ */
+struct thread_shstk {
+	u64	base;
+	u64	size;
+};
+
+#ifdef CONFIG_X86_SHADOW_STACK
+int shstk_setup(void);
+void shstk_free(struct task_struct *p);
+void shstk_disable(void);
+#else
+static inline int shstk_setup(void) { return 0; }
+static inline void shstk_free(struct task_struct *p) {}
+static inline void shstk_disable(void) {}
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f3020c54e2cb..10497634b7a4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/cet.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -527,6 +528,10 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+#ifdef CONFIG_X86_SHADOW_STACK
+	struct thread_shstk	shstk;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3e625c61f008..9e064845e497 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -150,6 +150,7 @@ obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
+obj-$(CONFIG_X86_SHADOW_STACK)		+= shstk.o
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..5993aa8db338
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/internal.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+
+static void start_update_msrs(void)
+{
+	fpregs_lock();
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		fpregs_restore_userregs();
+}
+
+static void end_update_msrs(void)
+{
+	fpregs_unlock();
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, populate;
+
+	mmap_write_lock(mm);
+	addr = do_mmap(NULL, 0, size, PROT_READ, flags, VM_SHADOW_STACK, 0,
+		       &populate, NULL);
+	mmap_write_unlock(mm);
+
+	return addr;
+}
+
+int shstk_setup(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long addr, size;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return 0;
+
+	size = round_up(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G), PAGE_SIZE);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	start_update_msrs();
+	err = wrmsrl_safe(MSR_IA32_PL3_SSP, addr + size);
+	if (!err)
+		wrmsrl_safe(MSR_IA32_U_CET, CET_SHSTK_EN);
+	end_update_msrs();
+
+	if (!err) {
+		shstk->base = addr;
+		shstk->size = size;
+	}
+
+	return err;
+}
+
+void shstk_free(struct task_struct *tsk)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+	    !shstk->size ||
+	    !shstk->base)
+		return;
+
+	if (!tsk->mm)
+		return;
+
+	while (1) {
+		int r;
+
+		r = vm_munmap(shstk->base, shstk->size);
+
+		/*
+		 * vm_munmap() returns -EINTR when mmap_lock is held by
+		 * something else, and that lock should not be held for a
+		 * long time.  Retry it for the case.
+		 */
+		if (r == -EINTR) {
+			cond_resched();
+			continue;
+		}
+
+		/*
+		 * For all other types of vm_munmap() failure, either the
+		 * system is out of memory or there is bug.
+		 */
+		WARN_ON_ONCE(r);
+		break;
+	}
+
+	shstk->base = 0;
+	shstk->size = 0;
+}
+
+void shstk_disable(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	u64 msr_val;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+	    !shstk->size ||
+	    !shstk->base)
+		return;
+
+	start_update_msrs();
+	rdmsrl(MSR_IA32_U_CET, msr_val);
+	wrmsrl(MSR_IA32_U_CET, msr_val & ~CET_SHSTK_EN);
+	wrmsrl(MSR_IA32_PL3_SSP, 0);
+	end_update_msrs();
+
+	shstk_free(current);
+}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 24/32] x86/process: Change copy_thread() argument 'arg' to 'stack_size'
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (22 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 23/32] x86/cet/shstk: Add user-mode shadow stack support Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

The single call site of copy_thread() passes stack size in 'arg'.  To make
this clear and in preparation of using this argument for shadow stack
allocation, change 'arg' to 'stack_size'.  No functional changes.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/kernel/process.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..e6e4d8bc9023 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -116,8 +116,9 @@ static int set_new_tls(struct task_struct *p, unsigned long tls)
 		return do_set_thread_area_64(p, ARCH_SET_FS, tls);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+int copy_thread(unsigned long clone_flags, unsigned long sp,
+		unsigned long stack_size, struct task_struct *p,
+		unsigned long tls)
 {
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
@@ -158,7 +159,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		p->thread.pkru = pkru_get_init_value();
 		memset(childregs, 0, sizeof(struct pt_regs));
-		kthread_frame_init(frame, sp, arg);
+		kthread_frame_init(frame, sp, stack_size);
 		return 0;
 	}
 
@@ -191,7 +192,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 		 */
 		childregs->sp = 0;
 		childregs->ip = 0;
-		kthread_frame_init(frame, sp, arg);
+		kthread_frame_init(frame, sp, stack_size);
 		return 0;
 	}
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (23 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 24/32] x86/process: Change copy_thread() argument 'arg' to 'stack_size' Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 21:05   ` Dave Hansen
  2021-07-22 20:52 ` [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines Yu-cheng Yu
                   ` (7 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

For clone() with CLONE_VM, except vfork, the child and the parent must have
separate shadow stacks.  Thus, the kernel allocates, and frees on thread
exit a new shadow stack for the child.

Use stack_size passed from clone3() syscall for thread shadow stack size.
A compat-mode thread shadow stack size is further reduced to 1/4.  This
allows more threads to run in a 32-bit address space.

The earlier version of clone() did not have stack_size passed in.  In that
case, use RLIMIT_STACK size and cap to 4 GB.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v28:
- Split out copy_thread() argument name changes to a new patch.
- Add compatibility for earlier clone(), which does not pass stack_size.
- Add comment for get_xsave_addr(), explain the handling of null return
  value.

 arch/x86/include/asm/cet.h         |  5 +++
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/kernel/process.c          |  6 +++
 arch/x86/kernel/shstk.c            | 63 +++++++++++++++++++++++++++++-
 4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 6432baf4de1f..4314a41ab3c9 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -17,10 +17,15 @@ struct thread_shstk {
 
 #ifdef CONFIG_X86_SHADOW_STACK
 int shstk_setup(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+			     unsigned long stack_size);
 void shstk_free(struct task_struct *p);
 void shstk_disable(void);
 #else
 static inline int shstk_setup(void) { return 0; }
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+					   unsigned long clone_flags,
+					   unsigned long stack_size) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 static inline void shstk_disable(void) {}
 #endif
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..e1dd083261a5 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -12,6 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
 #include <asm/debugreg.h>
+#include <asm/cet.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -146,6 +147,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e6e4d8bc9023..bade6a594d63 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,6 +43,7 @@
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
 #include <asm/frame.h>
+#include <asm/cet.h>
 
 #include "process.h"
 
@@ -103,6 +104,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -200,6 +202,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	if (clone_flags & CLONE_SETTLS)
 		ret = set_new_tls(p, tls);
 
+	/* Allocate a new shadow stack for pthread */
+	if (!ret)
+		ret = shstk_alloc_thread_stack(p, clone_flags, stack_size);
+
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 5993aa8db338..a3fecd608388 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -75,6 +75,61 @@ int shstk_setup(void)
 	return err;
 }
 
+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+			     unsigned long stack_size)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	struct cet_user_state *state;
+	unsigned long addr;
+
+	/*
+	 * Earlier clone() does not pass stack_size.  Use RLIMIT_STACK and
+	 * cap to 4 GB.
+	 */
+	if (!stack_size)
+		stack_size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
+
+	if (!shstk->size)
+		return 0;
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+	/*
+	 * This is in clone() syscall and fpu__copy() already copies xstates
+	 * from the parent.  If get_xsave_addr() returns null, then XFEATURE_
+	 * CET_USER is still in init state, which certainly is an error.
+	 */
+	state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
+	if (!state)
+		return -EINVAL;
+
+	/*
+	 * Compat-mode pthreads share a limited address space.
+	 * If each function call takes an average of four slots
+	 * stack space, allocate 1/4 of stack size for shadow stack.
+	 */
+	if (in_compat_syscall())
+		stack_size /= 4;
+
+	stack_size = round_up(stack_size, PAGE_SIZE);
+	addr = alloc_shstk(stack_size);
+	if (IS_ERR_VALUE(addr)) {
+		shstk->base = 0;
+		shstk->size = 0;
+		return PTR_ERR((void *)addr);
+	}
+
+	state->user_ssp = (u64)(addr + stack_size);
+	shstk->base = addr;
+	shstk->size = stack_size;
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -84,7 +139,13 @@ void shstk_free(struct task_struct *tsk)
 	    !shstk->base)
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	while (1) {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (24 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 21:15   ` Dave Hansen
  2021-07-22 20:52 ` [PATCH v28 27/32] x86/cet/shstk: Handle signals for shadow stack Yu-cheng Yu
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack.  This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

The restore token can be used as an extra protection for signal handling.
To deliver a signal, create a shadow stack restore token and put the token
and the signal restorer address on the shadow stack.  In sigreturn, verify
the token and restore from it the shadow stack pointer.

Introduce token setup and verify routines.  Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.  It is
used to construct user signal stack as described above.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v28:
- Add comments for get_xsave_addr().

v27:
- For shstk_check_rstor_token(), instead of an input param, use current
  shadow stack pointer.
- In response to comments, fix/simplify a few syntax/format issues.

v25:
- Update inline assembly syntax, use %[].
- Change token address from (unsigned long) to (u64/u32 __user *).
- Change -EPERM to -EFAULT.

 arch/x86/include/asm/cet.h           |   7 ++
 arch/x86/include/asm/special_insns.h |  30 ++++++
 arch/x86/kernel/shstk.c              | 138 +++++++++++++++++++++++++++
 3 files changed, 175 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 4314a41ab3c9..aa533700ba31 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -21,6 +21,9 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 			     unsigned long stack_size);
 void shstk_free(struct task_struct *p);
 void shstk_disable(void);
+int shstk_setup_rstor_token(bool ia32, unsigned long restorer,
+			    unsigned long *new_ssp);
+int shstk_check_rstor_token(bool ia32, unsigned long *new_ssp);
 #else
 static inline int shstk_setup(void) { return 0; }
 static inline int shstk_alloc_thread_stack(struct task_struct *p,
@@ -28,6 +31,10 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
 					   unsigned long stack_size) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 static inline void shstk_disable(void) {}
+static inline int shstk_setup_rstor_token(bool ia32, unsigned long restorer,
+					  unsigned long *new_ssp) { return 0; }
+static inline int shstk_check_rstor_token(bool ia32,
+					  unsigned long *new_ssp) { return 0; }
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index f3fbb84ff8a7..c6df3773b44c 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -222,6 +222,36 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static inline int write_user_shstk_32(u32 __user *addr, u32 val)
+{
+	if (WARN_ONCE(!IS_ENABLED(CONFIG_IA32_EMULATION) &&
+		      !IS_ENABLED(CONFIG_X86_X32),
+		      "%s used but not supported.\n", __func__)) {
+		return -EFAULT;
+	}
+
+	asm_volatile_goto("1: wrussd %[val], (%[addr])\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: [addr] "r" (addr), [val] "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EFAULT;
+}
+
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: [addr] "r" (addr), [val] "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EFAULT;
+}
+#endif /* CONFIG_X86_SHADOW_STACK */
+
 #define nop() asm volatile ("nop")
 
 static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index a3fecd608388..89c7da3cdb92 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -20,6 +20,7 @@
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
 #include <asm/cet.h>
+#include <asm/special_insns.h>
 
 static void start_update_msrs(void)
 {
@@ -193,3 +194,140 @@ void shstk_disable(void)
 
 	shstk_free(current);
 }
+
+static unsigned long get_user_shstk_addr(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	unsigned long ssp = 0;
+
+	fpregs_lock();
+
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+	} else {
+		struct cet_user_state *p;
+
+		/*
+		 * When !fpregs_state_valid() and get_xsave_addr() returns
+		 * null, XFEAUTRE_CET_USER is in init state.  Shadow stack
+		 * pointer is null in this case, so return zero.
+		 */
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
+		if (p)
+			ssp = p->user_ssp;
+	}
+
+	fpregs_unlock();
+
+	return ssp;
+}
+
+/*
+ * Create a restore token on the shadow stack.  A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(bool ia32, unsigned long ssp,
+			       unsigned long *token_addr)
+{
+	unsigned long addr;
+
+	/* Aligned to 8 is aligned to 4, so test 8 first */
+	if ((!ia32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
+		return -EINVAL;
+
+	addr = ALIGN_DOWN(ssp, 8) - 8;
+
+	/* Is the token for 64-bit? */
+	if (!ia32)
+		ssp |= BIT(0);
+
+	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+		return -EFAULT;
+
+	*token_addr = addr;
+
+	return 0;
+}
+
+/*
+ * Create a restore token on shadow stack, and then push the user-mode
+ * function return address.
+ */
+int shstk_setup_rstor_token(bool ia32, unsigned long ret_addr,
+			    unsigned long *new_ssp)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long ssp, token_addr;
+	int err;
+
+	if (!shstk->size)
+		return 0;
+
+	if (!ret_addr)
+		return -EINVAL;
+
+	ssp = get_user_shstk_addr();
+	if (!ssp)
+		return -EINVAL;
+
+	err = create_rstor_token(ia32, ssp, &token_addr);
+	if (err)
+		return err;
+
+	if (ia32) {
+		ssp = token_addr - sizeof(u32);
+		err = write_user_shstk_32((u32 __user *)ssp, (u32)ret_addr);
+	} else {
+		ssp = token_addr - sizeof(u64);
+		err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
+	}
+
+	if (!err)
+		*new_ssp = ssp;
+
+	return err;
+}
+
+/*
+ * Verify token_addr points to a valid token, and then set *new_ssp
+ * according to the token.
+ */
+int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp)
+{
+	unsigned long token_addr;
+	unsigned long token;
+	bool shstk32;
+
+	token_addr = get_user_shstk_addr();
+
+	if (get_user(token, (unsigned long __user *)token_addr))
+		return -EFAULT;
+
+	/* Is mode flag correct? */
+	shstk32 = !(token & BIT(0));
+	if (proc32 ^ shstk32)
+		return -EINVAL;
+
+	/* Is busy flag set? */
+	if (token & BIT(1))
+		return -EINVAL;
+
+	/* Mask out flags */
+	token &= ~3UL;
+
+	/*
+	 * Restore address aligned?
+	 */
+	if ((!proc32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
+		return -EINVAL;
+
+	/*
+	 * Token placed properly?
+	 */
+	if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	*new_ssp = token;
+
+	return 0;
+}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 27/32] x86/cet/shstk: Handle signals for shadow stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (25 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 28/32] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

A signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn.  Thus, when setting up a signal frame, the
kernel:

- installs a shadow stack restore token pointing to the current shadow
  stack address, and

- installs the restorer address below the restore token.

In sigreturn, the restore token is verified and shadow stack pointer is
restored.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
---
v27:
- Eliminate saving shadow stack pointer to signal context.

v25:
- Update commit log/comments for the sc_ext struct.
- Use restorer address already calculated.
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
- Eliminate writing to MSR_IA32_U_CET for shadow stack.
- Change wrmsrl() to wrmsrl_safe() and handle error.

 arch/x86/ia32/ia32_signal.c | 25 +++++++++++++++++-----
 arch/x86/include/asm/cet.h  |  4 ++++
 arch/x86/kernel/shstk.c     | 42 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c    | 13 ++++++++++++
 4 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 5e3d9b7fd5fb..d7a30bc98e66 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -35,6 +35,7 @@
 #include <asm/sigframe.h>
 #include <asm/sighandling.h>
 #include <asm/smap.h>
+#include <asm/cet.h>
 
 static inline void reload_segments(struct sigcontext_32 *sc)
 {
@@ -113,6 +114,10 @@ COMPAT_SYSCALL_DEFINE0(sigreturn)
 
 	if (ia32_restore_sigcontext(regs, &frame->sc))
 		goto badframe;
+
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	return regs->ax;
 
 badframe:
@@ -138,6 +143,9 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn)
 	if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (compat_restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
@@ -262,6 +270,9 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
 			restorer = &frame->retcode;
 	}
 
+	if (setup_signal_shadow_stack(1, restorer))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -319,6 +330,15 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig,
 
 	frame = get_sigframe(ksig, regs, sizeof(*frame), &fp);
 
+	if (ksig->ka.sa.sa_flags & SA_RESTORER)
+		restorer = ksig->ka.sa.sa_restorer;
+	else
+		restorer = current->mm->context.vdso +
+			vdso_image_32.sym___kernel_rt_sigreturn;
+
+	if (setup_signal_shadow_stack(1, restorer))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -334,11 +354,6 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig,
 	unsafe_put_user(0, &frame->uc.uc_link, Efault);
 	unsafe_compat_save_altstack(&frame->uc.uc_stack, regs->sp, Efault);
 
-	if (ksig->ka.sa.sa_flags & SA_RESTORER)
-		restorer = ksig->ka.sa.sa_restorer;
-	else
-		restorer = current->mm->context.vdso +
-			vdso_image_32.sym___kernel_rt_sigreturn;
 	unsafe_put_user(ptr_to_compat(restorer), &frame->pretcode, Efault);
 
 	/*
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index aa533700ba31..2f7940d68ce3 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -24,6 +24,8 @@ void shstk_disable(void);
 int shstk_setup_rstor_token(bool ia32, unsigned long restorer,
 			    unsigned long *new_ssp);
 int shstk_check_rstor_token(bool ia32, unsigned long *new_ssp);
+int setup_signal_shadow_stack(int ia32, void __user *restorer);
+int restore_signal_shadow_stack(void);
 #else
 static inline int shstk_setup(void) { return 0; }
 static inline int shstk_alloc_thread_stack(struct task_struct *p,
@@ -35,6 +37,8 @@ static inline int shstk_setup_rstor_token(bool ia32, unsigned long restorer,
 					  unsigned long *new_ssp) { return 0; }
 static inline int shstk_check_rstor_token(bool ia32,
 					  unsigned long *new_ssp) { return 0; }
+static inline int setup_signal_shadow_stack(int ia32, void __user *restorer) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 89c7da3cdb92..b3d64cfa28eb 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -331,3 +331,45 @@ int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp)
 
 	return 0;
 }
+
+int setup_signal_shadow_stack(int ia32, void __user *restorer)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long new_ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || !shstk->size)
+		return 0;
+
+	err = shstk_setup_rstor_token(ia32, (unsigned long)restorer,
+				      &new_ssp);
+	if (err)
+		return err;
+
+	start_update_msrs();
+	err = wrmsrl_safe(MSR_IA32_PL3_SSP, new_ssp);
+	end_update_msrs();
+
+	return err;
+}
+
+int restore_signal_shadow_stack(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	int ia32 = in_ia32_syscall();
+	unsigned long new_ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || !shstk->size)
+		return 0;
+
+	err = shstk_check_rstor_token(ia32, &new_ssp);
+	if (err)
+		return err;
+
+	start_update_msrs();
+	err = wrmsrl_safe(MSR_IA32_PL3_SSP, new_ssp);
+	end_update_msrs();
+
+	return err;
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index f4d21e470083..661e46803b84 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <asm/syscall.h>
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/cet.h>
 
 #ifdef CONFIG_X86_64
 /*
@@ -471,6 +472,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
 	frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp);
 	uc_flags = frame_uc_flags(regs);
 
+	if (setup_signal_shadow_stack(0, ksig->ka.sa.sa_restorer))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -576,6 +580,9 @@ static int x32_setup_rt_frame(struct ksignal *ksig,
 
 	uc_flags = frame_uc_flags(regs);
 
+	if (setup_signal_shadow_stack(0, ksig->ka.sa.sa_restorer))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -674,6 +681,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
@@ -932,6 +942,9 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
 	if (restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (compat_restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 28/32] ELF: Introduce arch_setup_elf_property()
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (26 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 27/32] x86/cet/shstk: Handle signals for shadow stack Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 29/32] x86/cet/shstk: Add arch_prctl functions for shadow stack Yu-cheng Yu
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Catalin Marinas, Mark Brown

An ELF file's .note.gnu.property indicates arch features supported by the
file.  These features are extracted by arch_parse_elf_property() and stored
in 'arch_elf_state'.

Introduce x86 feature definitions and arch_setup_elf_property(), which
enables such features.  The first use-case of this function is Shadow
Stack.

ARM64 is the other arch that has ARCH_USE_GNU_PROPERTY and arch_parse_elf_
property().  Add arch_setup_elf_property() for it.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Brown <broonie@kernel.org>
---
v27:
- Make X86_64 select ARCH_USE_GNU_PROPERTY and ARCH_BINFMT_ELF_STATE and
  remove #ifdef's.
- Add link to x86-64-psABI document.

 arch/arm64/include/asm/elf.h |  5 +++++
 arch/x86/Kconfig             |  2 ++
 arch/x86/include/asm/elf.h   | 11 +++++++++++
 arch/x86/kernel/process_64.c | 27 +++++++++++++++++++++++++++
 fs/binfmt_elf.c              |  4 ++++
 include/linux/elf.h          |  6 ++++++
 include/uapi/linux/elf.h     | 14 ++++++++++++++
 7 files changed, 69 insertions(+)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 8d1c8dcb87fd..d37bc7915935 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -281,6 +281,11 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
 	return 0;
 }
 
+static inline int arch_setup_elf_property(struct arch_elf_state *arch)
+{
+	return 0;
+}
+
 static inline int arch_elf_pt_proc(void *ehdr, void *phdr,
 				   struct file *f, bool is_interp,
 				   struct arch_elf_state *state)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index de992d3408b2..7a2991457497 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
 	select ARCH_HAS_SHADOW_STACK
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_GNU_PROPERTY
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
@@ -61,6 +62,7 @@ config X86
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
 	select ARCH_32BIT_OFF_T			if X86_32
+	select ARCH_BINFMT_ELF_STATE
 	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 29fea180a665..3281a3d01bd2 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -394,6 +394,17 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
 
 extern bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs);
 
+struct arch_elf_state {
+	unsigned int gnu_property;
+};
+
+#define INIT_ARCH_ELF_STATE {	\
+	.gnu_property = 0,	\
+}
+
+#define arch_elf_pt_proc(ehdr, phdr, elf, interp, state) (0)
+#define arch_check_elf(ehdr, interp, interp_ehdr, state) (0)
+
 /* Do not change the values. See get_align_mask() */
 enum align_flags {
 	ALIGN_VA_32	= BIT(0),
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ec0d836a13b1..4271963fdd8c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -861,3 +861,30 @@ unsigned long KSTK_ESP(struct task_struct *task)
 {
 	return task_pt_regs(task)->sp;
 }
+
+int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+			    bool compat, struct arch_elf_state *state)
+{
+	if (type != GNU_PROPERTY_X86_FEATURE_1_AND)
+		return 0;
+
+	if (datasz != sizeof(unsigned int))
+		return -ENOEXEC;
+
+	state->gnu_property = *(unsigned int *)data;
+	return 0;
+}
+
+int arch_setup_elf_property(struct arch_elf_state *state)
+{
+	int r = 0;
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
+
+	if (state->gnu_property & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+		r = shstk_setup();
+#endif
+
+	return r;
+}
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 439ed81e755a..6a1936dab6b4 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1248,6 +1248,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 	set_binfmt(&elf_format);
 
+	retval = arch_setup_elf_property(&arch_state);
+	if (retval < 0)
+		goto out;
+
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
 	retval = ARCH_SETUP_ADDITIONAL_PAGES(bprm, elf_ex, !!interpreter);
 	if (retval < 0)
diff --git a/include/linux/elf.h b/include/linux/elf.h
index c9a46c4e183b..be04d15e937f 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -92,9 +92,15 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
 {
 	return 0;
 }
+
+static inline int arch_setup_elf_property(struct arch_elf_state *arch)
+{
+	return 0;
+}
 #else
 extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
 				   bool compat, struct arch_elf_state *arch);
+extern int arch_setup_elf_property(struct arch_elf_state *arch);
 #endif
 
 #ifdef CONFIG_ARCH_HAVE_ELF_PROT
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 61bf4774b8f2..f50b3ce7bb75 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -456,4 +456,18 @@ typedef struct elf64_note {
 /* Bits for GNU_PROPERTY_AARCH64_FEATURE_1_BTI */
 #define GNU_PROPERTY_AARCH64_FEATURE_1_BTI	(1U << 0)
 
+/*
+ * See the x86 64 psABI at:
+ * https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI
+ * .note.gnu.property types for x86:
+ */
+/* 0xc0000000 and 0xc0000001 are reserved */
+#define GNU_PROPERTY_X86_FEATURE_1_AND		0xc0000002
+
+/* Bits for GNU_PROPERTY_X86_FEATURE_1_AND */
+#define GNU_PROPERTY_X86_FEATURE_1_IBT		0x00000001
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK	0x00000002
+#define GNU_PROPERTY_X86_FEATURE_1_VALID (GNU_PROPERTY_X86_FEATURE_1_IBT | \
+					   GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+
 #endif /* _UAPI_LINUX_ELF_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 29/32] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (27 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 28/32] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 30/32] mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h Yu-cheng Yu
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu

arch_prctl(ARCH_X86_CET_STATUS, u64 *args)
    Get CET feature status.

    The parameter 'args' is a pointer to a user buffer.  The kernel returns
    the following information:

    *args = shadow stack/IBT status
    *(args + 1) = shadow stack base address
    *(args + 2) = shadow stack size

    32-bit binaries use the same interface, but only lower 32-bits of each
    item.

arch_prctl(ARCH_X86_CET_DISABLE, unsigned int features)
    Disable CET features specified in 'features'.  Return -EPERM if CET is
    locked.

arch_prctl(ARCH_X86_CET_LOCK)
    Lock in CET features.

Also change do_arch_prctl_common()'s parameter 'cpuid_enabled' to
'arg2', as it is now also passed to prctl_cet().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/cet.h        |  7 ++++
 arch/x86/include/uapi/asm/prctl.h |  4 +++
 arch/x86/kernel/Makefile          |  1 +
 arch/x86/kernel/cet_prctl.c       | 60 +++++++++++++++++++++++++++++++
 arch/x86/kernel/process.c         |  6 ++--
 5 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cet_prctl.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 2f7940d68ce3..c76a85fbd59f 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -13,6 +13,7 @@ struct task_struct;
 struct thread_shstk {
 	u64	base;
 	u64	size;
+	u64	locked:1;
 };
 
 #ifdef CONFIG_X86_SHADOW_STACK
@@ -41,6 +42,12 @@ static inline int setup_signal_shadow_stack(int ia32, void __user *restorer) { r
 static inline int restore_signal_shadow_stack(void) { return 0; }
 #endif
 
+#ifdef CONFIG_X86_SHADOW_STACK
+int prctl_cet(int option, u64 arg2);
+#else
+static inline int prctl_cet(int option, u64 arg2) { return -EINVAL; }
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..9245bf629120 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,8 @@
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
 
+#define ARCH_X86_CET_STATUS		0x3001
+#define ARCH_X86_CET_DISABLE		0x3002
+#define ARCH_X86_CET_LOCK		0x3003
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9e064845e497..39e826b5cabd 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -151,6 +151,7 @@ obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
 obj-$(CONFIG_X86_SHADOW_STACK)		+= shstk.o
+obj-$(CONFIG_X86_SHADOW_STACK)		+= shstk.o cet_prctl.o
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
new file mode 100644
index 000000000000..b426d200e070
--- /dev/null
+++ b/arch/x86/kernel/cet_prctl.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/errno.h>
+#include <linux/uaccess.h>
+#include <linux/prctl.h>
+#include <linux/compat.h>
+#include <linux/mman.h>
+#include <linux/elfcore.h>
+#include <linux/processor.h>
+#include <asm/prctl.h>
+#include <asm/cet.h>
+
+/* See Documentation/x86/intel_cet.rst. */
+
+static int cet_copy_status_to_user(struct thread_shstk *shstk, u64 __user *ubuf)
+{
+	u64 buf[3] = {};
+
+	if (shstk->size) {
+		buf[0] |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
+		buf[1] = shstk->base;
+		buf[2] = shstk->size;
+	}
+
+	return copy_to_user(ubuf, buf, sizeof(buf));
+}
+
+int prctl_cet(int option, u64 arg2)
+{
+	struct thread_shstk *shstk;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return -ENOTSUPP;
+
+	shstk = &current->thread.shstk;
+
+	if (option == ARCH_X86_CET_STATUS)
+		return cet_copy_status_to_user(shstk, (u64 __user *)arg2);
+
+	switch (option) {
+	case ARCH_X86_CET_DISABLE:
+		if (shstk->locked)
+			return -EPERM;
+
+		if (arg2 & ~GNU_PROPERTY_X86_FEATURE_1_VALID)
+			return -EINVAL;
+		if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+			shstk_disable();
+		return 0;
+
+	case ARCH_X86_CET_LOCK:
+		if (arg2)
+			return -EINVAL;
+		shstk->locked = 1;
+		return 0;
+
+	default:
+		return -ENOSYS;
+	}
+}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bade6a594d63..7d8ccebdcab1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -1006,14 +1006,14 @@ unsigned long get_wchan(struct task_struct *p)
 }
 
 long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long cpuid_enabled)
+			  unsigned long arg2)
 {
 	switch (option) {
 	case ARCH_GET_CPUID:
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
-		return set_cpuid_mode(task, cpuid_enabled);
+		return set_cpuid_mode(task, arg2);
 	}
 
-	return -EINVAL;
+	return prctl_cet(option, arg2);
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 30/32] mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (28 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 29/32] x86/cet/shstk: Add arch_prctl functions for shadow stack Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 31/32] mm: Update arch_validate_flags() to test vma anonymous Yu-cheng Yu
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

To prepare the introduction of PROT_SHADOW_STACK and be consistent with other
architectures, move arch_vm_get_page_prot() and arch_calc_vm_prot_bits() to
arch/x86/include/asm/mman.h.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/mman.h      | 30 ++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/mman.h | 28 +++-------------------------
 2 files changed, 33 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/include/asm/mman.h

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
new file mode 100644
index 000000000000..629f6c81263a
--- /dev/null
+++ b/arch/x86/include/asm/mman.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_MMAN_H
+#define _ASM_X86_MMAN_H
+
+#include <linux/mm.h>
+#include <uapi/asm/mman.h>
+
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) (		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+#endif
+
+#endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index d4a8d0424bfb..f28fa4acaeaf 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -1,31 +1,9 @@
 /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _ASM_X86_MMAN_H
-#define _ASM_X86_MMAN_H
+#ifndef _UAPI_ASM_X86_MMAN_H
+#define _UAPI_ASM_X86_MMAN_H
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
 
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
-/*
- * Take the 4 protection key bits out of the vma->vm_flags
- * value and turn them in to the bits that we can put in
- * to a pte.
- *
- * Only override these if Protection Keys are available
- * (which is only on 64-bit).
- */
-#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
-		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
-		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
-		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
-		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
-
-#define arch_calc_vm_prot_bits(prot, key) (		\
-		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
-		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
-		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
-		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
-#endif
-
 #include <asm-generic/mman.h>
 
-#endif /* _ASM_X86_MMAN_H */
+#endif /* _UAPI_ASM_X86_MMAN_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 31/32] mm: Update arch_validate_flags() to test vma anonymous
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (29 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 30/32] mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 20:52 ` [PATCH v28 32/32] mm: Introduce PROT_SHADOW_STACK for shadow stack Yu-cheng Yu
  2021-07-22 21:08 ` [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Dave Hansen
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov, Catalin Marinas,
	Vincenzo Frascino, Will Deacon

When newer VM flags are being created, such as VM_MTE, it becomes necessary
for mmap/mprotect to verify if certain flags are being applied to an
anonymous VMA.

To solve this, one approach is adding a VM flag to track that MAP_ANONYMOUS
is specified [1], and then using the flag in arch_validate_flags().

Another approach is passing the VMA to arch_validate_flags(), and check
vma_is_anonymous().

To prepare the introduction of PROT_SHADOW_STACK, which creates a shadow
stack mapping and can be applied only to an anonymous VMA, update
arch_validate_flags() to pass in the VMA.

[1] commit 9f3419315f3c ("arm64: mte: Add PROT_MTE support to mmap() and mprotect()"),

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/mman.h | 4 ++--
 arch/sparc/include/asm/mman.h | 4 ++--
 include/linux/mman.h          | 2 +-
 mm/mmap.c                     | 2 +-
 mm/mprotect.c                 | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index e3e28f7daf62..7c45e7578f78 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -74,7 +74,7 @@ static inline bool arch_validate_prot(unsigned long prot,
 }
 #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
 
-static inline bool arch_validate_flags(unsigned long vm_flags)
+static inline bool arch_validate_flags(struct vm_area_struct *vma, unsigned long vm_flags)
 {
 	if (!system_supports_mte())
 		return true;
@@ -82,6 +82,6 @@ static inline bool arch_validate_flags(unsigned long vm_flags)
 	/* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */
 	return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
 }
-#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+#define arch_validate_flags(vma, vm_flags) arch_validate_flags(vma, vm_flags)
 
 #endif /* ! __ASM_MMAN_H__ */
diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index 274217e7ed70..0ec4975f167d 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -60,11 +60,11 @@ static inline int sparc_validate_prot(unsigned long prot, unsigned long addr)
 	return 1;
 }
 
-#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+#define arch_validate_flags(vma, vm_flags) arch_validate_flags(vma, vm_flags)
 /* arch_validate_flags() - Ensure combination of flags is valid for a
  *	VMA.
  */
-static inline bool arch_validate_flags(unsigned long vm_flags)
+static inline bool arch_validate_flags(struct vm_area_struct *vma, unsigned long vm_flags)
 {
 	/* If ADI is being enabled on this VMA, check for ADI
 	 * capability on the platform and ensure VMA is suitable
diff --git a/include/linux/mman.h b/include/linux/mman.h
index ebb09a964272..b6a9414e806c 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -116,7 +116,7 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
  *
  * Returns true if the VM_* flags are valid.
  */
-static inline bool arch_validate_flags(unsigned long flags)
+static inline bool arch_validate_flags(struct vm_area_struct *vma, unsigned long flags)
 {
 	return true;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 100db6e46831..fe7afd968087 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1853,7 +1853,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Allow architectures to sanity-check the vm_flags */
-	if (!arch_validate_flags(vma->vm_flags)) {
+	if (!arch_validate_flags(vma, vma->vm_flags)) {
 		error = -EINVAL;
 		if (file)
 			goto unmap_and_free_vma;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94cb799216ec..e826ecb68e3a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -621,7 +621,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		}
 
 		/* Allow architectures to sanity-check the new flags */
-		if (!arch_validate_flags(newflags)) {
+		if (!arch_validate_flags(vma, newflags)) {
 			error = -EINVAL;
 			goto out;
 		}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v28 32/32] mm: Introduce PROT_SHADOW_STACK for shadow stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (30 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 31/32] mm: Update arch_validate_flags() to test vma anonymous Yu-cheng Yu
@ 2021-07-22 20:52 ` Yu-cheng Yu
  2021-07-22 21:08 ` [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Dave Hansen
  32 siblings, 0 replies; 62+ messages in thread
From: Yu-cheng Yu @ 2021-07-22 20:52 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Kirill A . Shutemov

There are three possible options to create a shadow stack allocation API:
an arch_prctl, a new syscall, or adding PROT_SHADOW_STACK to mmap() and
mprotect().  Each has its advantages and compromises.

An arch_prctl() is the least intrusive.  However, the existing x86
arch_prctl() takes only two parameters.  Multiple parameters must be
passed in a memory buffer.  There is a proposal to pass more parameters in
registers [1], but no active discussion on that.

A new syscall minimizes compatibility issues and offers an extensible frame
work to other architectures, but this will likely result in some overlap of
mmap()/mprotect().

The introduction of PROT_SHADOW_STACK to mmap()/mprotect() takes advantage
of existing APIs.  The x86-specific PROT_SHADOW_STACK is translated to
VM_SHADOW_STACK and a shadow stack mapping is created without reinventing
the wheel.  There are potential pitfalls though.  The most obvious one
would be using this as a bypass to shadow stack protection.  However, the
attacker would have to get to the syscall first.

[1] https://lore.kernel.org/lkml/20200828121624.108243-1-hjl.tools@gmail.com/

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/mman.h      | 60 +++++++++++++++++++++++++++++++-
 arch/x86/include/uapi/asm/mman.h |  2 ++
 include/linux/mm.h               |  1 +
 3 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
index 629f6c81263a..b77933923b9a 100644
--- a/arch/x86/include/asm/mman.h
+++ b/arch/x86/include/asm/mman.h
@@ -20,11 +20,69 @@
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
 
-#define arch_calc_vm_prot_bits(prot, key) (		\
+#define pkey_vm_prot_bits(prot, key) (			\
 		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
 		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
 		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+#else
+#define pkey_vm_prot_bits(prot, key) (0)
 #endif
 
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+						   unsigned long pkey)
+{
+	unsigned long vm_prot_bits = pkey_vm_prot_bits(prot, pkey);
+
+	if (prot & PROT_SHADOW_STACK)
+		vm_prot_bits |= VM_SHADOW_STACK;
+
+	return vm_prot_bits;
+}
+
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+
+#ifdef CONFIG_X86_SHADOW_STACK
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+{
+	unsigned long valid = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM |
+			      PROT_SHADOW_STACK;
+
+	if (prot & ~valid)
+		return false;
+
+	if (prot & PROT_SHADOW_STACK) {
+		if (!current->thread.shstk.size)
+			return false;
+
+		/*
+		 * A shadow stack mapping is indirectly writable by only
+		 * the CALL and WRUSS instructions, but not other write
+		 * instructions).  PROT_SHADOW_STACK and PROT_WRITE are
+		 * mutually exclusive.
+		 */
+		if (prot & PROT_WRITE)
+			return false;
+	}
+
+	return true;
+}
+
+#define arch_validate_prot arch_validate_prot
+
+static inline bool arch_validate_flags(struct vm_area_struct *vma, unsigned long vm_flags)
+{
+	/*
+	 * Shadow stack must be anonymous and not shared.
+	 */
+	if ((vm_flags & VM_SHADOW_STACK) && !vma_is_anonymous(vma))
+		return false;
+
+	return true;
+}
+
+#define arch_validate_flags(vma, vm_flags) arch_validate_flags(vma, vm_flags)
+
+#endif /* CONFIG_X86_SHADOW_STACK */
+
 #endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index f28fa4acaeaf..4c36b263cf0a 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -4,6 +4,8 @@
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
 
+#define PROT_SHADOW_STACK	0x10	/* shadow stack pages */
+
 #include <asm-generic/mman.h>
 
 #endif /* _UAPI_ASM_X86_MMAN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 07e642af59d3..041e7e8ff702 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -349,6 +349,7 @@ extern unsigned int kobjsize(const void *objp);
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+# define VM_ARCH_CLEAR	VM_SHADOW_STACK
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack
  2021-07-22 20:52 ` [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
@ 2021-07-22 21:05   ` Dave Hansen
  2021-07-23 17:30     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2021-07-22 21:05 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On 7/22/21 1:52 PM, Yu-cheng Yu wrote:
> +	if (!stack_size)
> +		stack_size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
> +
> +	if (!shstk->size)
> +		return 0;
> +
> +	/*
> +	 * For CLONE_VM, except vfork, the child needs a separate shadow
> +	 * stack.
> +	 */
> +	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> +		return 0;
> +
> +	/*
> +	 * This is in clone() syscall and fpu__copy() already copies xstates
> +	 * from the parent.  If get_xsave_addr() returns null, then XFEATURE_
> +	 * CET_USER is still in init state, which certainly is an error.
> +	 */
> +	state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> +	if (!state)
> +		return -EINVAL;

I don't care much for that comment.

This code is meant to copy shadow stack config information into children
when it is already enabled.  We *just* checked for that above in the
"shstk->size" check.  The fact that this is called from clone() is
irrelevant.  The shadow stack enabling status *is*.

I think I'd rather this be more along the lines of:

	/*
	 * 'tsk' is configured with a shadow stack and the fpu.state is
	 * up to date since it was just copied from the parent.  There
	 * must be a valid non-init CET state location in the buffer.
	 */

There is also a strong enough assumption violation that I'd probably
WARN() in addition to returning -EINVAL.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack
  2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (31 preceding siblings ...)
  2021-07-22 20:52 ` [PATCH v28 32/32] mm: Introduce PROT_SHADOW_STACK for shadow stack Yu-cheng Yu
@ 2021-07-22 21:08 ` Dave Hansen
  2021-07-23 17:28   ` Yu, Yu-cheng
  32 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2021-07-22 21:08 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On 7/22/21 1:51 PM, Yu-cheng Yu wrote:
> Linux distributions with CET are available now, and Intel processors with CET
> are already on the market.  It would be nice if CET support can be accepted
> into the kernel.
> 
> Changes in v28:
> - Rebase to Linus tree v5.14-rc2.
> - Patch #1: Update Document to indicate no-user-shstk also disables IBT.
> - Patch #23: Update shstk_setup() with wrmsrl_safe().  Update return value.
> - Patch #25: Split out copy_thread() changes.  Add support for old clone().
>   Add comments.
> - Add comments for get_xsave_addr() (Patch #25, #26).

Could you characterize where this whole thing is?

Are we at the point where the feedback is slowing down?  What kind of
feedback are you getting?  How stable is the ABI versus the last revision?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines
  2021-07-22 20:52 ` [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines Yu-cheng Yu
@ 2021-07-22 21:15   ` Dave Hansen
  2021-07-23 18:01     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2021-07-22 21:15 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On 7/22/21 1:52 PM, Yu-cheng Yu wrote:
> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
> +		rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +	} else {
> +		struct cet_user_state *p;
> +
> +		/*
> +		 * When !fpregs_state_valid() and get_xsave_addr() returns
> +		 * null, XFEAUTRE_CET_USER is in init state.  Shadow stack
> +		 * pointer is null in this case, so return zero.
> +		 */
> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
> +		if (p)
> +			ssp = p->user_ssp;
> +	}
> +
> +	fpregs_unlock();

Why are we even calling into this code if shadow stacks might be
disabled?  Seems like we should have just errored out long before
getting here.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack
  2021-07-22 21:08 ` [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Dave Hansen
@ 2021-07-23 17:28   ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-07-23 17:28 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar, Dave Martin,
	Weijiang Yang, Pengfei Xu, Haitao Huang, Rick P Edgecombe

On 7/22/2021 2:08 PM, Dave Hansen wrote:
> On 7/22/21 1:51 PM, Yu-cheng Yu wrote:
>> Linux distributions with CET are available now, and Intel processors with CET
>> are already on the market.  It would be nice if CET support can be accepted
>> into the kernel.
>>
>> Changes in v28:
>> - Rebase to Linus tree v5.14-rc2.
>> - Patch #1: Update Document to indicate no-user-shstk also disables IBT.
>> - Patch #23: Update shstk_setup() with wrmsrl_safe().  Update return value.
>> - Patch #25: Split out copy_thread() changes.  Add support for old clone().
>>    Add comments.
>> - Add comments for get_xsave_addr() (Patch #25, #26).
> 
> Could you characterize where this whole thing is?
> 
> Are we at the point where the feedback is slowing down?  What kind of
> feedback are you getting?  How stable is the ABI versus the last revision?
> 

The ABI has not changed since last version, except the addition of 
shadow stack support for legacy clone().  This does not de-stabilize the 
ABI.

Looking back at recent feedback:

- Boris had given lots of comments on code flow, syntax, etc.  Those are 
all addressed.

- Andy L. commented on the signal handling part, especially the 
introduction of a ucontext extension.  That is eliminated and now there 
is the UC_WAIT_ENDBR flag.

- Kirill commented a few issues on mm patches.  Those are addressed.

- Peter Z. requested splitting shadow stack and ibt.  That is done.

As for running/testing of the series, overall it is stable.

Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack
  2021-07-22 21:05   ` Dave Hansen
@ 2021-07-23 17:30     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-07-23 17:30 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar, Dave Martin,
	Weijiang Yang, Pengfei Xu, Haitao Huang, Rick P Edgecombe

On 7/22/2021 2:05 PM, Dave Hansen wrote:
> On 7/22/21 1:52 PM, Yu-cheng Yu wrote:
>> +	if (!stack_size)
>> +		stack_size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
>> +
>> +	if (!shstk->size)
>> +		return 0;
>> +
>> +	/*
>> +	 * For CLONE_VM, except vfork, the child needs a separate shadow
>> +	 * stack.
>> +	 */
>> +	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
>> +		return 0;
>> +
>> +	/*
>> +	 * This is in clone() syscall and fpu__copy() already copies xstates
>> +	 * from the parent.  If get_xsave_addr() returns null, then XFEATURE_
>> +	 * CET_USER is still in init state, which certainly is an error.
>> +	 */
>> +	state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
>> +	if (!state)
>> +		return -EINVAL;
> 
> I don't care much for that comment.
> 
> This code is meant to copy shadow stack config information into children
> when it is already enabled.  We *just* checked for that above in the
> "shstk->size" check.  The fact that this is called from clone() is
> irrelevant.  The shadow stack enabling status *is*.
> 
> I think I'd rather this be more along the lines of:
> 
> 	/*
> 	 * 'tsk' is configured with a shadow stack and the fpu.state is
> 	 * up to date since it was just copied from the parent.  There
> 	 * must be a valid non-init CET state location in the buffer.
> 	 */
> 
> There is also a strong enough assumption violation that I'd probably
> WARN() in addition to returning -EINVAL.
> 

Yes, I will update the comment and put in the WARN().

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines
  2021-07-22 21:15   ` Dave Hansen
@ 2021-07-23 18:01     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-07-23 18:01 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On 7/22/2021 2:15 PM, Dave Hansen wrote:
> On 7/22/21 1:52 PM, Yu-cheng Yu wrote:
>> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
>> +		rdmsrl(MSR_IA32_PL3_SSP, ssp);
>> +	} else {
>> +		struct cet_user_state *p;
>> +
>> +		/*
>> +		 * When !fpregs_state_valid() and get_xsave_addr() returns
>> +		 * null, XFEAUTRE_CET_USER is in init state.  Shadow stack
>> +		 * pointer is null in this case, so return zero.
>> +		 */
>> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
>> +		if (p)
>> +			ssp = p->user_ssp;
>> +	}
>> +
>> +	fpregs_unlock();
> 
> Why are we even calling into this code if shadow stacks might be
> disabled?  Seems like we should have just errored out long before
> getting here.
> 

That is true.  When this function is called, shadow stack is enabled. 
If get_xsave_addr() returns null, it is possible xstates is messed up. 
Maybe I can update the comments to explain it?

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET
  2021-07-22 20:51 ` [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET Yu-cheng Yu
@ 2021-08-09 16:06   ` Borislav Petkov
  2021-08-10 15:39     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-09 16:06 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe

On Thu, Jul 22, 2021 at 01:51:51PM -0700, Yu-cheng Yu wrote:
>  /*
>   * Some CPU features depend on higher CPUID levels, which may not always
>   * be available due to CPUID level capping or broken virtualization
> @@ -1249,6 +1257,11 @@ static void __init cpu_parse_early_param(void)
>  	if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
>  		setup_clear_cpu_cap(X86_FEATURE_XSAVES);
>  
> +	if (cmdline_find_option_bool(boot_command_line, "no_user_shstk"))
> +		setup_clear_cpu_cap(X86_FEATURE_SHSTK);
> +	if (cmdline_find_option_bool(boot_command_line, "no_user_ibt"))
> +		setup_clear_cpu_cap(X86_FEATURE_IBT);

Patch 1 says:

"Disabling shadow stack also disables IBT."

I don't see that here.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2021-07-22 20:51 ` [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Yu-cheng Yu
@ 2021-08-09 16:46   ` Borislav Petkov
  2021-08-10 15:50     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-09 16:46 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe

On Thu, Jul 22, 2021 at 01:51:52PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index a7c413432b33..b529f42ddaae 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -939,4 +939,23 @@
>  #define MSR_VM_IGNNE                    0xc0010115
>  #define MSR_VM_HSAVE_PA                 0xc0010117
>  
> +/* Control-flow Enforcement Technology MSRs */
> +#define MSR_IA32_U_CET		0x000006a0 /* user mode cet setting */
> +#define MSR_IA32_S_CET		0x000006a2 /* kernel mode cet setting */
> +#define CET_SHSTK_EN		BIT_ULL(0)
> +#define CET_WRSS_EN		BIT_ULL(1)
> +#define CET_ENDBR_EN		BIT_ULL(2)
> +#define CET_LEG_IW_EN		BIT_ULL(3)
> +#define CET_NO_TRACK_EN		BIT_ULL(4)
> +#define CET_SUPPRESS_DISABLE	BIT_ULL(5)
> +#define CET_RESERVED		(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
> +#define CET_SUPPRESS		BIT_ULL(10)
> +#define CET_WAIT_ENDBR		BIT_ULL(11)
> +
> +#define MSR_IA32_PL0_SSP	0x000006a4 /* kernel shadow stack pointer */
> +#define MSR_IA32_PL1_SSP	0x000006a5 /* ring-1 shadow stack pointer */
> +#define MSR_IA32_PL2_SSP	0x000006a6 /* ring-2 shadow stack pointer */
> +#define MSR_IA32_PL3_SSP	0x000006a7 /* user shadow stack pointer */
> +#define MSR_IA32_INT_SSP_TAB	0x000006a8 /* exception shadow stack table */
> +
>  #endif /* _ASM_X86_MSR_INDEX_H */

Merge the following hunk ontop of yours pls:

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b529f42ddaae..14ce136bcfa8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -362,6 +362,26 @@
 
 
 #define MSR_CORE_PERF_LIMIT_REASONS	0x00000690
+
+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET			0x000006a0 /* user mode cet setting */
+#define MSR_IA32_S_CET			0x000006a2 /* kernel mode cet setting */
+#define CET_SHSTK_EN			BIT_ULL(0)
+#define CET_WRSS_EN			BIT_ULL(1)
+#define CET_ENDBR_EN			BIT_ULL(2)
+#define CET_LEG_IW_EN			BIT_ULL(3)
+#define CET_NO_TRACK_EN			BIT_ULL(4)
+#define CET_SUPPRESS_DISABLE		BIT_ULL(5)
+#define CET_RESERVED			(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
+#define CET_SUPPRESS			BIT_ULL(10)
+#define CET_WAIT_ENDBR			BIT_ULL(11)
+
+#define MSR_IA32_PL0_SSP		0x000006a4 /* kernel shadow stack pointer */
+#define MSR_IA32_PL1_SSP		0x000006a5 /* ring-1 shadow stack pointer */
+#define MSR_IA32_PL2_SSP		0x000006a6 /* ring-2 shadow stack pointer */
+#define MSR_IA32_PL3_SSP		0x000006a7 /* user shadow stack pointer */
+#define MSR_IA32_INT_SSP_TAB		0x000006a8 /* exception shadow stack table */
+
 #define MSR_GFX_PERF_LIMIT_REASONS	0x000006B0
 #define MSR_RING_PERF_LIMIT_REASONS	0x000006B1
 
@@ -939,23 +959,4 @@
 #define MSR_VM_IGNNE                    0xc0010115
 #define MSR_VM_HSAVE_PA                 0xc0010117
 
-/* Control-flow Enforcement Technology MSRs */
-#define MSR_IA32_U_CET		0x000006a0 /* user mode cet setting */
-#define MSR_IA32_S_CET		0x000006a2 /* kernel mode cet setting */
-#define CET_SHSTK_EN		BIT_ULL(0)
-#define CET_WRSS_EN		BIT_ULL(1)
-#define CET_ENDBR_EN		BIT_ULL(2)
-#define CET_LEG_IW_EN		BIT_ULL(3)
-#define CET_NO_TRACK_EN		BIT_ULL(4)
-#define CET_SUPPRESS_DISABLE	BIT_ULL(5)
-#define CET_RESERVED		(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
-#define CET_SUPPRESS		BIT_ULL(10)
-#define CET_WAIT_ENDBR		BIT_ULL(11)
-
-#define MSR_IA32_PL0_SSP	0x000006a4 /* kernel shadow stack pointer */
-#define MSR_IA32_PL1_SSP	0x000006a5 /* ring-1 shadow stack pointer */
-#define MSR_IA32_PL2_SSP	0x000006a6 /* ring-2 shadow stack pointer */
-#define MSR_IA32_PL3_SSP	0x000006a7 /* user shadow stack pointer */
-#define MSR_IA32_INT_SSP_TAB	0x000006a8 /* exception shadow stack table */
-
 #endif /* _ASM_X86_MSR_INDEX_H */


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 06/32] x86/cet: Add control-protection fault handler
  2021-07-22 20:51 ` [PATCH v28 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
@ 2021-08-09 17:51   ` Borislav Petkov
  2021-08-10 16:06     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-09 17:51 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Michael Kerrisk

On Thu, Jul 22, 2021 at 01:51:53PM -0700, Yu-cheng Yu wrote:
> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> +	struct task_struct *tsk;
> +
> +	if (!user_mode(regs)) {
> +		pr_emerg("PANIC: unexpected kernel control protection fault\n");

No need for that call...

> +		die("kernel control protection fault", regs, error_code);

... as this one can say "unexpected" in the string too.

> +		panic("Machine halted.");
> +	}
> +
> +	cond_local_irq_enable(regs);
> +
> +	if (!boot_cpu_has(X86_FEATURE_SHSTK))

cpu_feature_enabled()

> +		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
> +
> +	tsk = current;
> +	tsk->thread.error_code = error_code;
> +	tsk->thread.trap_nr = X86_TRAP_CP;
> +
> +	/*
> +	 * Ratelimit to prevent log spamming.
> +	 */
> +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> +	    __ratelimit(&cpf_rate)) {
> +		unsigned long ssp;
> +		int cpf_type;
> +
> +		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
> +
> +		rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
> +			 tsk->comm, task_pid_nr(tsk),
> +			 regs->ip, regs->sp, ssp, error_code,
> +			 control_protection_err[cpf_type]);
> +		print_vma_addr(KERN_CONT " in ", regs->ip);
> +		pr_cont("\n");
> +	}
> +
> +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> +	cond_local_irq_disable(regs);
> +}
> +#endif
> +
>  static bool do_int3(struct pt_regs *regs)
>  {
>  	int res;
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index 5a3c221f4c9d..a1a153ea3cc3 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -235,7 +235,8 @@ typedef struct siginfo {
>  #define SEGV_ADIPERR	7	/* Precise MCD exception */
>  #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
>  #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
> -#define NSIGSEGV	9
> +#define SEGV_CPERR	10	/* Control protection fault */
> +#define NSIGSEGV	10
>  
>  /*
>   * SIGBUS si_codes
> -- 

Was there a manpage patch for the user-visible bits?

I seem to remember something flying by very vaguely ...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 08/32] x86/mm: Move pmd_write(), pud_write() up in the file
  2021-07-22 20:51 ` [PATCH v28 08/32] x86/mm: Move pmd_write(), pud_write() up in the file Yu-cheng Yu
@ 2021-08-09 18:02   ` Borislav Petkov
  0 siblings, 0 replies; 62+ messages in thread
From: Borislav Petkov @ 2021-08-09 18:02 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Thu, Jul 22, 2021 at 01:51:55PM -0700, Yu-cheng Yu wrote:
> To prepare the introduction of _PAGE_COW, move pmd_write() and
> pud_write() up in the file, so that they can be used by other
> helpers below.

Add

"No functional changes."

here.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET
  2021-08-09 16:06   ` Borislav Petkov
@ 2021-08-10 15:39     ` Yu, Yu-cheng
  2021-08-10 16:51       ` Borislav Petkov
  0 siblings, 1 reply; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-10 15:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On 8/9/2021 9:06 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:51:51PM -0700, Yu-cheng Yu wrote:
>>   /*
>>    * Some CPU features depend on higher CPUID levels, which may not always
>>    * be available due to CPUID level capping or broken virtualization
>> @@ -1249,6 +1257,11 @@ static void __init cpu_parse_early_param(void)
>>   	if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
>>   		setup_clear_cpu_cap(X86_FEATURE_XSAVES);
>>   
>> +	if (cmdline_find_option_bool(boot_command_line, "no_user_shstk"))
>> +		setup_clear_cpu_cap(X86_FEATURE_SHSTK);
>> +	if (cmdline_find_option_bool(boot_command_line, "no_user_ibt"))
>> +		setup_clear_cpu_cap(X86_FEATURE_IBT);
> 
> Patch 1 says:
> 
> "Disabling shadow stack also disables IBT."
> 
> I don't see that here.
> 

We have X86_FEATURE_IBT dependent on X86_FEATURE_SHSTK (patch #3).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2021-08-09 16:46   ` Borislav Petkov
@ 2021-08-10 15:50     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-10 15:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On 8/9/2021 9:46 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:51:52PM -0700, Yu-cheng Yu wrote:
>> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
>> index a7c413432b33..b529f42ddaae 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -939,4 +939,23 @@
>>   #define MSR_VM_IGNNE                    0xc0010115
>>   #define MSR_VM_HSAVE_PA                 0xc0010117
>>   
>> +/* Control-flow Enforcement Technology MSRs */
>> +#define MSR_IA32_U_CET		0x000006a0 /* user mode cet setting */
>> +#define MSR_IA32_S_CET		0x000006a2 /* kernel mode cet setting */
>> +#define CET_SHSTK_EN		BIT_ULL(0)
>> +#define CET_WRSS_EN		BIT_ULL(1)
>> +#define CET_ENDBR_EN		BIT_ULL(2)
>> +#define CET_LEG_IW_EN		BIT_ULL(3)
>> +#define CET_NO_TRACK_EN		BIT_ULL(4)
>> +#define CET_SUPPRESS_DISABLE	BIT_ULL(5)
>> +#define CET_RESERVED		(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
>> +#define CET_SUPPRESS		BIT_ULL(10)
>> +#define CET_WAIT_ENDBR		BIT_ULL(11)
>> +
>> +#define MSR_IA32_PL0_SSP	0x000006a4 /* kernel shadow stack pointer */
>> +#define MSR_IA32_PL1_SSP	0x000006a5 /* ring-1 shadow stack pointer */
>> +#define MSR_IA32_PL2_SSP	0x000006a6 /* ring-2 shadow stack pointer */
>> +#define MSR_IA32_PL3_SSP	0x000006a7 /* user shadow stack pointer */
>> +#define MSR_IA32_INT_SSP_TAB	0x000006a8 /* exception shadow stack table */
>> +
>>   #endif /* _ASM_X86_MSR_INDEX_H */
> 
> Merge the following hunk ontop of yours pls:
> 

I will do that.

Yu-cheng

> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index b529f42ddaae..14ce136bcfa8 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -362,6 +362,26 @@
>   
>   
>   #define MSR_CORE_PERF_LIMIT_REASONS	0x00000690
> +
> +/* Control-flow Enforcement Technology MSRs */
> +#define MSR_IA32_U_CET			0x000006a0 /* user mode cet setting */
> +#define MSR_IA32_S_CET			0x000006a2 /* kernel mode cet setting */
> +#define CET_SHSTK_EN			BIT_ULL(0)
> +#define CET_WRSS_EN			BIT_ULL(1)
> +#define CET_ENDBR_EN			BIT_ULL(2)
> +#define CET_LEG_IW_EN			BIT_ULL(3)
> +#define CET_NO_TRACK_EN			BIT_ULL(4)
> +#define CET_SUPPRESS_DISABLE		BIT_ULL(5)
> +#define CET_RESERVED			(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
> +#define CET_SUPPRESS			BIT_ULL(10)
> +#define CET_WAIT_ENDBR			BIT_ULL(11)
> +
> +#define MSR_IA32_PL0_SSP		0x000006a4 /* kernel shadow stack pointer */
> +#define MSR_IA32_PL1_SSP		0x000006a5 /* ring-1 shadow stack pointer */
> +#define MSR_IA32_PL2_SSP		0x000006a6 /* ring-2 shadow stack pointer */
> +#define MSR_IA32_PL3_SSP		0x000006a7 /* user shadow stack pointer */
> +#define MSR_IA32_INT_SSP_TAB		0x000006a8 /* exception shadow stack table */
> +
>   #define MSR_GFX_PERF_LIMIT_REASONS	0x000006B0
>   #define MSR_RING_PERF_LIMIT_REASONS	0x000006B1
>   
> @@ -939,23 +959,4 @@
>   #define MSR_VM_IGNNE                    0xc0010115
>   #define MSR_VM_HSAVE_PA                 0xc0010117
>   
> -/* Control-flow Enforcement Technology MSRs */
> -#define MSR_IA32_U_CET		0x000006a0 /* user mode cet setting */
> -#define MSR_IA32_S_CET		0x000006a2 /* kernel mode cet setting */
> -#define CET_SHSTK_EN		BIT_ULL(0)
> -#define CET_WRSS_EN		BIT_ULL(1)
> -#define CET_ENDBR_EN		BIT_ULL(2)
> -#define CET_LEG_IW_EN		BIT_ULL(3)
> -#define CET_NO_TRACK_EN		BIT_ULL(4)
> -#define CET_SUPPRESS_DISABLE	BIT_ULL(5)
> -#define CET_RESERVED		(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
> -#define CET_SUPPRESS		BIT_ULL(10)
> -#define CET_WAIT_ENDBR		BIT_ULL(11)
> -
> -#define MSR_IA32_PL0_SSP	0x000006a4 /* kernel shadow stack pointer */
> -#define MSR_IA32_PL1_SSP	0x000006a5 /* ring-1 shadow stack pointer */
> -#define MSR_IA32_PL2_SSP	0x000006a6 /* ring-2 shadow stack pointer */
> -#define MSR_IA32_PL3_SSP	0x000006a7 /* user shadow stack pointer */
> -#define MSR_IA32_INT_SSP_TAB	0x000006a8 /* exception shadow stack table */
> -
>   #endif /* _ASM_X86_MSR_INDEX_H */
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 06/32] x86/cet: Add control-protection fault handler
  2021-08-09 17:51   ` Borislav Petkov
@ 2021-08-10 16:06     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-10 16:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Michael Kerrisk

On 8/9/2021 10:51 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:51:53PM -0700, Yu-cheng Yu wrote:
[...]
>> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
>> index 5a3c221f4c9d..a1a153ea3cc3 100644
>> --- a/include/uapi/asm-generic/siginfo.h
>> +++ b/include/uapi/asm-generic/siginfo.h
>> @@ -235,7 +235,8 @@ typedef struct siginfo {
>>   #define SEGV_ADIPERR	7	/* Precise MCD exception */
>>   #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
>>   #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
>> -#define NSIGSEGV	9
>> +#define SEGV_CPERR	10	/* Control protection fault */
>> +#define NSIGSEGV	10
>>   
>>   /*
>>    * SIGBUS si_codes
>> -- 
> 
> Was there a manpage patch for the user-visible bits?
> 
> I seem to remember something flying by very vaguely ...
> 

Yes, man page patches:

https://lore.kernel.org/linux-man/20210226172634.26905-1-yu-cheng.yu@intel.com/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET
  2021-08-10 15:39     ` Yu, Yu-cheng
@ 2021-08-10 16:51       ` Borislav Petkov
  0 siblings, 0 replies; 62+ messages in thread
From: Borislav Petkov @ 2021-08-10 16:51 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe

On Tue, Aug 10, 2021 at 08:39:00AM -0700, Yu, Yu-cheng wrote:
> We have X86_FEATURE_IBT dependent on X86_FEATURE_SHSTK (patch #3).

Ah, do_clear_cpu_cap() will handle the deps, missed that.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-07-22 20:51 ` [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
@ 2021-08-16 10:43   ` Borislav Petkov
  2021-08-17 18:24     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-16 10:43 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Thu, Jul 22, 2021 at 01:51:56PM -0700, Yu-cheng Yu wrote:
> @@ -153,13 +178,23 @@ static inline int pud_young(pud_t pud)
>  
>  static inline int pte_write(pte_t pte)
>  {
> -	return pte_flags(pte) & _PAGE_RW;
> +	/*
> +	 * Shadow stack pages are always writable - but not by normal
> +	 * instructions, and only by shadow stack operations.  Therefore,
> +	 * the W=0,D=1 test with pte_shstk().
> +	 */
> +	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);

Well, this is weird: if some kernel code queries a shstk page and this
here function says it is writable but then goes and tries to write into
it and that write fails, then it'll confuse the user.

IOW, from where I'm standing, that should be:

	return (pte_flags(pte) & _PAGE_RW) && !pte_shstk(pte);

as in, a writable page is one which has _PAGE_RW and it is *not* a
shadow stack page because latter is special and not really writable.

Hmmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2021-07-22 20:51 ` [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Yu-cheng Yu
@ 2021-08-16 16:01   ` Borislav Petkov
  2021-08-17 18:33     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-16 16:01 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Thu, Jul 22, 2021 at 01:51:59PM -0700, Yu-cheng Yu wrote:
> When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
> shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].
> 
> When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
> become a transient shadow stack PTE in two cases:
> 
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE.  However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
> 
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cf7316e968df..df4ce715560a 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1278,6 +1278,24 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +	/*
> +	 * If Shadow Stack is enabled, pte_wrprotect() moves _PAGE_DIRTY
> +	 * to _PAGE_COW (see comments at pte_wrprotect()).
> +	 * When a thread reads a RW=1, Dirty=0 PTE and before changing it
> +	 * to RW=0, Dirty=0, another thread could have written to the page
> +	 * and the PTE is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
> +	 * PTE changes and update old_pte, then try again.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pte_t old_pte, new_pte;
> +
> +		old_pte = READ_ONCE(*ptep);
> +		do {
> +			new_pte = pte_wrprotect(old_pte);
> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> +		return;
> +	}
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>  }
>  
> @@ -1322,6 +1340,24 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pmd_t *pmdp)
>  {
> +	/*
> +	 * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
> +	 * to _PAGE_COW (see comments at pmd_wrprotect()).
> +	 * When a thread reads a RW=1, Dirty=0 PMD and before changing it
> +	 * to RW=0, Dirty=0, another thread could have written to the page
> +	 * and the PMD is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
> +	 * PMD changes and update old_pmd, then try again.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pmd_t old_pmd, new_pmd;
> +
> +		old_pmd = READ_ONCE(*pmdp);
> +		do {
> +			new_pmd = pmd_wrprotect(old_pmd);
> +		} while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));

Why is that try_cmpxchg() call doing casting to its operands instead of
like the pte one above?

I.e., why aren't you doing here the same thing as above:

		...
		} while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));

?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2021-07-22 20:52 ` [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory Yu-cheng Yu
@ 2021-08-16 16:35   ` Borislav Petkov
  2021-08-17 18:35     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-16 16:35 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Thu, Jul 22, 2021 at 01:52:01PM -0700, Yu-cheng Yu wrote:
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index eb97468dfe4c..02c70198b989 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -662,6 +662,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
>  		[ilog2(VM_UFFD_MINOR)]	= "ui",
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
> +		[ilog2(VM_SHADOW_STACK)]= "ss",
> +#endif


ERROR: spaces required around that '=' (ctx:VxW)
#109: FILE: fs/proc/task_mmu.c:666:
+		[ilog2(VM_SHADOW_STACK)]= "ss",
 		                        ^

total: 1 errors, 0 warnings, 49 lines checked

Please integrate scripts/checkpatch.pl into your patch creation
workflow. Some of the warnings/errors *actually* make sense.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack
  2021-07-22 20:52 ` [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
@ 2021-08-16 17:03   ` Borislav Petkov
  2021-08-17 18:36     ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-16 17:03 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Thu, Jul 22, 2021 at 01:52:03PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3364fe62b903..ba449d12ec32 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -610,6 +610,26 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pte = pte_mkwrite(pte);
> +	else if (likely(vma->vm_flags & VM_SHADOW_STACK))
> +		pte = pte_mkwrite_shstk(pte);
> +	return pte;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	else if (likely(vma->vm_flags & VM_SHADOW_STACK))
> +		pmd = pmd_mkwrite_shstk(pmd);

What are all those likely()ies here for?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-16 10:43   ` Borislav Petkov
@ 2021-08-17 18:24     ` Yu, Yu-cheng
  2021-08-17 19:54       ` Borislav Petkov
  0 siblings, 1 reply; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-17 18:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On 8/16/2021 3:43 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:51:56PM -0700, Yu-cheng Yu wrote:
>> @@ -153,13 +178,23 @@ static inline int pud_young(pud_t pud)
>>   
>>   static inline int pte_write(pte_t pte)
>>   {
>> -	return pte_flags(pte) & _PAGE_RW;
>> +	/*
>> +	 * Shadow stack pages are always writable - but not by normal
>> +	 * instructions, and only by shadow stack operations.  Therefore,
>> +	 * the W=0,D=1 test with pte_shstk().
>> +	 */
>> +	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
> 
> Well, this is weird: if some kernel code queries a shstk page and this
> here function says it is writable but then goes and tries to write into
> it and that write fails, then it'll confuse the user.
> 
> IOW, from where I'm standing, that should be:
> 
> 	return (pte_flags(pte) & _PAGE_RW) && !pte_shstk(pte);
> 
> as in, a writable page is one which has _PAGE_RW and it is *not* a
> shadow stack page because latter is special and not really writable.
> > Hmmm?
> 

Indeed, this can be looked at in a few ways.  We can visualize 
pte_write() as 'CPU can write to it with MOV' or 'CPU can write to it 
with any opcodes'.  Depending on whatever pte_write() is, copy-on-write 
code can be adjusted accordingly.

Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2021-08-16 16:01   ` Borislav Petkov
@ 2021-08-17 18:33     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-17 18:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On 8/16/2021 9:01 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:51:59PM -0700, Yu-cheng Yu wrote:
>> When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
>> shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].
>>
>> When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
>> become a transient shadow stack PTE in two cases:
>>
>> The first case is that some processors can start a write but end up seeing
>> a read-only PTE by the time they get to the Dirty bit, creating a transient
>> shadow stack PTE.  However, this will not occur on processors supporting
>> Shadow Stack, and a TLB flush is not necessary.
>>
>> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
>> atomically, a transient shadow stack PTE can be created as a result.
>> Thus, prevent that with cmpxchg.
>>
>> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
>> insights to the issue.  Jann Horn provided the cmpxchg solution.
>>
>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>> Reviewed-by: Kees Cook <keescook@chromium.org>
>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>   arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index cf7316e968df..df4ce715560a 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1278,6 +1278,24 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>   static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>   				      unsigned long addr, pte_t *ptep)
>>   {
>> +	/*
>> +	 * If Shadow Stack is enabled, pte_wrprotect() moves _PAGE_DIRTY
>> +	 * to _PAGE_COW (see comments at pte_wrprotect()).
>> +	 * When a thread reads a RW=1, Dirty=0 PTE and before changing it
>> +	 * to RW=0, Dirty=0, another thread could have written to the page
>> +	 * and the PTE is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
>> +	 * PTE changes and update old_pte, then try again.
>> +	 */
>> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> +		pte_t old_pte, new_pte;
>> +
>> +		old_pte = READ_ONCE(*ptep);
>> +		do {
>> +			new_pte = pte_wrprotect(old_pte);
>> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
>> +
>> +		return;
>> +	}
>>   	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>>   }
>>   
>> @@ -1322,6 +1340,24 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
>>   static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>>   				      unsigned long addr, pmd_t *pmdp)
>>   {
>> +	/*
>> +	 * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
>> +	 * to _PAGE_COW (see comments at pmd_wrprotect()).
>> +	 * When a thread reads a RW=1, Dirty=0 PMD and before changing it
>> +	 * to RW=0, Dirty=0, another thread could have written to the page
>> +	 * and the PMD is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
>> +	 * PMD changes and update old_pmd, then try again.
>> +	 */
>> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> +		pmd_t old_pmd, new_pmd;
>> +
>> +		old_pmd = READ_ONCE(*pmdp);
>> +		do {
>> +			new_pmd = pmd_wrprotect(old_pmd);
>> +		} while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));
> 
> Why is that try_cmpxchg() call doing casting to its operands instead of
> like the pte one above?
> 
> I.e., why aren't you doing here the same thing as above:
> 
> 		...
> 		} while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
> 
> ?

If !(CONFIG_PGTABLE_LEVELS > 2), we don't have pmd_t.pmd.

> 
> Thx.
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2021-08-16 16:35   ` Borislav Petkov
@ 2021-08-17 18:35     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-17 18:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On 8/16/2021 9:35 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:52:01PM -0700, Yu-cheng Yu wrote:
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index eb97468dfe4c..02c70198b989 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -662,6 +662,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
>>   		[ilog2(VM_UFFD_MINOR)]	= "ui",
>>   #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>> +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
>> +		[ilog2(VM_SHADOW_STACK)]= "ss",
>> +#endif
> 
> 
> ERROR: spaces required around that '=' (ctx:VxW)
> #109: FILE: fs/proc/task_mmu.c:666:
> +		[ilog2(VM_SHADOW_STACK)]= "ss",
>   		                        ^
> 
> total: 1 errors, 0 warnings, 49 lines checked
> 
> Please integrate scripts/checkpatch.pl into your patch creation
> workflow. Some of the warnings/errors *actually* make sense.
> 

I will add a space there.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack
  2021-08-16 17:03   ` Borislav Petkov
@ 2021-08-17 18:36     ` Yu, Yu-cheng
  0 siblings, 0 replies; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-17 18:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On 8/16/2021 10:03 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:52:03PM -0700, Yu-cheng Yu wrote:
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index 3364fe62b903..ba449d12ec32 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -610,6 +610,26 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
>>   }
>>   #endif
>>   
>> +pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>> +{
>> +	if (likely(vma->vm_flags & VM_WRITE))
>> +		pte = pte_mkwrite(pte);
>> +	else if (likely(vma->vm_flags & VM_SHADOW_STACK))
>> +		pte = pte_mkwrite_shstk(pte);
>> +	return pte;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>> +{
>> +	if (likely(vma->vm_flags & VM_WRITE))
>> +		pmd = pmd_mkwrite(pmd);
>> +	else if (likely(vma->vm_flags & VM_SHADOW_STACK))
>> +		pmd = pmd_mkwrite_shstk(pmd);
> 
> What are all those likely()ies here for?
> 

I will remove those.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-17 18:24     ` Yu, Yu-cheng
@ 2021-08-17 19:54       ` Borislav Petkov
  2021-08-17 20:13         ` Andy Lutomirski
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-17 19:54 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Tue, Aug 17, 2021 at 11:24:29AM -0700, Yu, Yu-cheng wrote:
> Indeed, this can be looked at in a few ways.  We can visualize pte_write()
> as 'CPU can write to it with MOV' or 'CPU can write to it with any opcodes'.
> Depending on whatever pte_write() is, copy-on-write code can be adjusted
> accordingly.

Can be?

I think you should exclude shadow stack pages from being writable
and treat them as read-only. How the CPU writes them is immaterial -
pte/pmd_write() is used by normal kernel code to query whether the page
is writable or not by any instruction - not by the CPU.

And since normal kernel code cannot write shadow stack pages, then for
that code those pages are read-only.

If special kernel code using shadow stack management insns needs
to modify a shadow stack, then it can check whether a page is
pte/pmd_shstk() but that code is special anyway.

Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
			      ^^^^^^^
is simply wrong.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-17 19:54       ` Borislav Petkov
@ 2021-08-17 20:13         ` Andy Lutomirski
  2021-08-17 20:24           ` Borislav Petkov
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2021-08-17 20:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Yu, Yu-cheng, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov



> On Aug 17, 2021, at 12:53 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Tue, Aug 17, 2021 at 11:24:29AM -0700, Yu, Yu-cheng wrote:
>> Indeed, this can be looked at in a few ways.  We can visualize pte_write()
>> as 'CPU can write to it with MOV' or 'CPU can write to it with any opcodes'.
>> Depending on whatever pte_write() is, copy-on-write code can be adjusted
>> accordingly.
> 
> Can be?
> 
> I think you should exclude shadow stack pages from being writable
> and treat them as read-only. How the CPU writes them is immaterial -
> pte/pmd_write() is used by normal kernel code to query whether the page
> is writable or not by any instruction - not by the CPU.
> 
> And since normal kernel code cannot write shadow stack pages, then for
> that code those pages are read-only.
> 
> If special kernel code using shadow stack management insns needs
> to modify a shadow stack, then it can check whether a page is
> pte/pmd_shstk() but that code is special anyway.
> 
> Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
>                  ^^^^^^^
> is simply wrong.

But it *is* writable using WRUSS, and it’s also writable by CALL, WRSS, etc.

Now if the mm code tries to write protect it and expects sensible semantics, the results could be interesting. At the very least, someone would need to validate that RET reading a read only shadow stack page does the right thing.

> 
> Thx.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-17 20:13         ` Andy Lutomirski
@ 2021-08-17 20:24           ` Borislav Petkov
  2021-08-17 20:51             ` Andy Lutomirski
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-17 20:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu, Yu-cheng, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe, Kirill A . Shutemov

On Tue, Aug 17, 2021 at 01:13:09PM -0700, Andy Lutomirski wrote:
> > If special kernel code using shadow stack management insns needs
> > to modify a shadow stack, then it can check whether a page is
> > pte/pmd_shstk() but that code is special anyway.
> > 
> > Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
> >                  ^^^^^^^
> > is simply wrong.
> 
> But it *is* writable using WRUSS, and it’s also writable by CALL,

Well, if we have to be precise, CALL doesn't write it directly - it
causes for shadow stack to be written as part of CALL's execution. Yeah
yeah, potato potato.

> WRSS, etc.

Thus the "special kernel code" thing above. I've left it in instead of
snipping it.

> Now if the mm code tries to write protect it and expects sensible
> semantics, the results could be interesting. At the very least,
> someone would need to validate that RET reading a read only shadow
> stack page does the right thing.

Huh?

A shadow stack page is RO (W=0).

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-17 20:24           ` Borislav Petkov
@ 2021-08-17 20:51             ` Andy Lutomirski
  2021-08-17 21:01               ` Borislav Petkov
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2021-08-17 20:51 UTC (permalink / raw)
  To: Borislav Petkov, luto
  Cc: Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List,
	linux-doc, linux-mm, linux-arch, Linux API, Arnd Bergmann,
	Balbir Singh, Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra (Intel),
	Randy Dunlap, Shankar, Ravi V, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A. Shutemov

On Tue, Aug 17, 2021, at 1:24 PM, Borislav Petkov wrote:
> On Tue, Aug 17, 2021 at 01:13:09PM -0700, Andy Lutomirski wrote:
> > > If special kernel code using shadow stack management insns needs
> > > to modify a shadow stack, then it can check whether a page is
> > > pte/pmd_shstk() but that code is special anyway.
> > > 
> > > Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
> > >                  ^^^^^^^
> > > is simply wrong.
> > 
> > But it *is* writable using WRUSS, and it’s also writable by CALL,
> 
> Well, if we have to be precise, CALL doesn't write it directly - it
> causes for shadow stack to be written as part of CALL's execution. Yeah
> yeah, potato potato.

Potahto.

> 
> > WRSS, etc.
> 
> Thus the "special kernel code" thing above. I've left it in instead of
> snipping it.
> 

WRSS can be used from user mode depending on the configuration.

> > Now if the mm code tries to write protect it and expects sensible
> > semantics, the results could be interesting. At the very least,
> > someone would need to validate that RET reading a read only shadow
> > stack page does the right thing.
> 
> Huh?
> 
> A shadow stack page is RO (W=0).

Double-you shmouble-you.  You can't write it with MOV, but you can write it from user code and from kernel code.  As far as the mm is concerned, I think it should be considered writable.

Although... anyone who tries to copy_to_user() it is going to be a bit surprised.  Hmm.

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-17 20:51             ` Andy Lutomirski
@ 2021-08-17 21:01               ` Borislav Petkov
  2021-08-18 16:38                 ` Yu, Yu-cheng
  0 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2021-08-17 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: luto, Yu-cheng Yu, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List,
	linux-doc, linux-mm, linux-arch, Linux API, Arnd Bergmann,
	Balbir Singh, Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra (Intel),
	Randy Dunlap, Shankar, Ravi V, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A. Shutemov

On Tue, Aug 17, 2021 at 01:51:52PM -0700, Andy Lutomirski wrote:
> WRSS can be used from user mode depending on the configuration.

My point being, if you're going to do shadow stack management
operations, you should check whether the target you're writing to is a
shadow stack page. Clearly userspace can't do that but userspace will
get notified of that pretty timely.

> Double-you shmouble-you. You can't write it with MOV, but you can
> write it from user code and from kernel code. As far as the mm is
> concerned, I think it should be considered writable.

Because?

> Although... anyone who tries to copy_to_user() it is going to be a bit
> surprised. Hmm.

Ok, so you see the confusion.

In any case, I don't think you can simply look at a shadow stack page as
simple writable page. There are cases where it is going to be fun.

So why are we even saying that a shadow stack page is writable? Why
can't we simply say that a shadow stack page is, well, something
special?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-17 21:01               ` Borislav Petkov
@ 2021-08-18 16:38                 ` Yu, Yu-cheng
  2021-08-21 16:27                   ` Borislav Petkov
  0 siblings, 1 reply; 62+ messages in thread
From: Yu, Yu-cheng @ 2021-08-18 16:38 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski
  Cc: luto, the arch/x86 maintainers, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Linux Kernel Mailing List, linux-doc, linux-mm,
	linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra (Intel),
	Randy Dunlap, Shankar, Ravi V, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A. Shutemov

On 8/17/2021 2:01 PM, Borislav Petkov wrote:
> On Tue, Aug 17, 2021 at 01:51:52PM -0700, Andy Lutomirski wrote:
>> WRSS can be used from user mode depending on the configuration.
> 
> My point being, if you're going to do shadow stack management
> operations, you should check whether the target you're writing to is a
> shadow stack page. Clearly userspace can't do that but userspace will
> get notified of that pretty timely.
> 
>> Double-you shmouble-you. You can't write it with MOV, but you can
>> write it from user code and from kernel code. As far as the mm is
>> concerned, I think it should be considered writable.
> 
> Because?
> 
>> Although... anyone who tries to copy_to_user() it is going to be a bit
>> surprised. Hmm.
> 
> Ok, so you see the confusion.
> 

copy_to_user() can run into normal read-only areas too.  The caller can 
handle that just fine.

> In any case, I don't think you can simply look at a shadow stack page as
> simple writable page. There are cases where it is going to be fun.
> 
> So why are we even saying that a shadow stack page is writable? Why
> can't we simply say that a shadow stack page is, well, something
> special?
> 

We can visualize the type of a mm area by looking at vma->vm_flags, e.g. 
maybe_mkwrite(), and PTE macros as lower-level operatives.  These two 
have some relations but not one-to-one.  Note that a PTE in a writable 
area is not always pte_write().

I have considered and implemented a shadow stack PTE either pte_write() 
or not.  Making shadow stack as pte_write() results in less arch_* 
macros and less confusion in copy-on-write code.  That is one more thing 
to consider.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW
  2021-08-18 16:38                 ` Yu, Yu-cheng
@ 2021-08-21 16:27                   ` Borislav Petkov
  0 siblings, 0 replies; 62+ messages in thread
From: Borislav Petkov @ 2021-08-21 16:27 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: Andy Lutomirski, luto, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List,
	linux-doc, linux-mm, linux-arch, Linux API, Arnd Bergmann,
	Balbir Singh, Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra (Intel),
	Randy Dunlap, Shankar, Ravi V, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe, Kirill A. Shutemov

On Wed, Aug 18, 2021 at 09:38:30AM -0700, Yu, Yu-cheng wrote:
> We can visualize the type of a mm area by looking at vma->vm_flags, e.g.

visualize?

> maybe_mkwrite(), and PTE macros as lower-level operatives.  These two have
> some relations but not one-to-one.  Note that a PTE in a writable area is
> not always pte_write().
> 
> I have considered and implemented a shadow stack PTE either pte_write() or
> not.  Making shadow stack as pte_write() results in less arch_* macros and
> less confusion in copy-on-write code.  That is one more thing to consider.

Ok, even though I'm still not 100% convinced by both amluto's and your
arguments. Let's try it and see what happens...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-08-21 16:26 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 20:51 [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 01/32] Documentation/x86: Add CET description Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 02/32] x86/cet/shstk: Add Kconfig option for Shadow Stack Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 03/32] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 04/32] x86/cpufeatures: Introduce CPU setup and option parsing for CET Yu-cheng Yu
2021-08-09 16:06   ` Borislav Petkov
2021-08-10 15:39     ` Yu, Yu-cheng
2021-08-10 16:51       ` Borislav Petkov
2021-07-22 20:51 ` [PATCH v28 05/32] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Yu-cheng Yu
2021-08-09 16:46   ` Borislav Petkov
2021-08-10 15:50     ` Yu, Yu-cheng
2021-07-22 20:51 ` [PATCH v28 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-08-09 17:51   ` Borislav Petkov
2021-08-10 16:06     ` Yu, Yu-cheng
2021-07-22 20:51 ` [PATCH v28 07/32] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 08/32] x86/mm: Move pmd_write(), pud_write() up in the file Yu-cheng Yu
2021-08-09 18:02   ` Borislav Petkov
2021-07-22 20:51 ` [PATCH v28 09/32] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
2021-08-16 10:43   ` Borislav Petkov
2021-08-17 18:24     ` Yu, Yu-cheng
2021-08-17 19:54       ` Borislav Petkov
2021-08-17 20:13         ` Andy Lutomirski
2021-08-17 20:24           ` Borislav Petkov
2021-08-17 20:51             ` Andy Lutomirski
2021-08-17 21:01               ` Borislav Petkov
2021-08-18 16:38                 ` Yu, Yu-cheng
2021-08-21 16:27                   ` Borislav Petkov
2021-07-22 20:51 ` [PATCH v28 10/32] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 11/32] x86/mm: Update pte_modify for _PAGE_COW Yu-cheng Yu
2021-07-22 20:51 ` [PATCH v28 12/32] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Yu-cheng Yu
2021-08-16 16:01   ` Borislav Petkov
2021-08-17 18:33     ` Yu, Yu-cheng
2021-07-22 20:52 ` [PATCH v28 13/32] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 14/32] mm: Introduce VM_SHADOW_STACK for shadow stack memory Yu-cheng Yu
2021-08-16 16:35   ` Borislav Petkov
2021-08-17 18:35     ` Yu, Yu-cheng
2021-07-22 20:52 ` [PATCH v28 15/32] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 16/32] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
2021-08-16 17:03   ` Borislav Petkov
2021-08-17 18:36     ` Yu, Yu-cheng
2021-07-22 20:52 ` [PATCH v28 17/32] mm: Fixup places that call pte_mkwrite() directly Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 18/32] mm: Add guard pages around a shadow stack Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 19/32] mm/mmap: Add shadow stack pages to memory accounting Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 20/32] mm: Update can_follow_write_pte() for shadow stack Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 21/32] mm/mprotect: Exclude shadow stack from preserve_write Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 22/32] mm: Re-introduce vm_flags to do_mmap() Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 23/32] x86/cet/shstk: Add user-mode shadow stack support Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 24/32] x86/process: Change copy_thread() argument 'arg' to 'stack_size' Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 25/32] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
2021-07-22 21:05   ` Dave Hansen
2021-07-23 17:30     ` Yu, Yu-cheng
2021-07-22 20:52 ` [PATCH v28 26/32] x86/cet/shstk: Introduce shadow stack token setup/verify routines Yu-cheng Yu
2021-07-22 21:15   ` Dave Hansen
2021-07-23 18:01     ` Yu, Yu-cheng
2021-07-22 20:52 ` [PATCH v28 27/32] x86/cet/shstk: Handle signals for shadow stack Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 28/32] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 29/32] x86/cet/shstk: Add arch_prctl functions for shadow stack Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 30/32] mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 31/32] mm: Update arch_validate_flags() to test vma anonymous Yu-cheng Yu
2021-07-22 20:52 ` [PATCH v28 32/32] mm: Introduce PROT_SHADOW_STACK for shadow stack Yu-cheng Yu
2021-07-22 21:08 ` [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Dave Hansen
2021-07-23 17:28   ` Yu, Yu-cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).