linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/39] Shadow stacks for userspace
@ 2023-01-19 21:22 Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description Rick Edgecombe
                   ` (41 more replies)
  0 siblings, 42 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow 
Enforcement Technology (CET). CET consists of two related security features: 
shadow stacks and indirect branch tracking. This series implements just the 
shadow stack part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against return 
oriented programming attacks. It works by maintaining a secondary (shadow) 
stack using a special memory type that has protections against modification. 
When executing a CALL instruction, the processor pushes the return address to 
both the normal stack and to the special permission shadow stack. Upon RET, 
the processor pops the shadow stack copy and compares it to the normal stack 
copy. For more details, see the coverletter from v1 [0].

The main change in this version is the removal of the attempt to prevent 32 bit 
signals from being registered with shadow stack enabled. Peterz originally 
raised the issue that shadow stack support in 32 bit signals was in a half 
working state. The reason for that was 32 bit signals are not easy to support 
for shadow stack, and also there is not a huge demand for shadow stack support 
in 32 bit apps using 32 bit emulation on 64 bit kernels. At that point the 
solution was to prevent shadow stack from being enabled on 32 bit processes. 
But Peterz pointed that 64 bit apps can transition to 32 bit outside of kernel
interaction by making a far call to a 32 bit segment.

So the next solution was to prevent 32 bit signals from being registered when
shadow stack was enabled. This turned out to be hard to do, due to signals
being per-process and shadow stack being per task.

But it turns out this far call scenario was already mostly not possible due to 
the HW not supporting shadow stacks located outside of the 32 bit address space 
when in 32 bit mode. During the transition to 32 bit mode with an SSP pointing 
outside of the 32 bit address space, HW generates a #GP which in turn triggers 
a segfault. So basically there is already a barrier in place for this far call 
scenario for the most part. Creation of shadow stack memory is tightly 
controlled, so the solution in this version is just to *ensure* that shadow 
stacks can never be allocated in the 32 bit address space. For more information 
see the new patch: "x86/mm: Introduce MAP_ABOVE4G", and the documentation in 
patch 1.

Additionally:
 - A smattering of small changes from Boris and Kees
 - Fixed my spellcheck setup and then fixed a bunch of spelling issues in the
   commit logs.
 - An update to the pte_modify() PAGE_COW solution
 
I left tested-by tags in place per discussion with testers. Testers, please
retest.

Previous version [1].

[0] https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/20221203003606.6838-1-rick.p.edgecombe@intel.com/

Kirill A. Shutemov (1):
  x86: Introduce userspace API for shadow stack

Mike Rapoport (1):
  x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (14):
  x86/fpu: Add helper for modifying xstate
  x86/mm: Introduce _PAGE_COW
  x86/mm: Start actually marking _PAGE_COW
  mm: Handle faultless write upgrades for shstk
  mm: Don't allow write GUPs to shadow stack memory
  x86/mm: Introduce MAP_ABOVE4G
  mm: Warn on shadow stack memory in wrong vma
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Support WRSS for userspace
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Wire in shadow stack interface
  selftests/x86: Add shadow stack test
  x86/fpu: Add helper for initing features
  x86/shstk: Add ARCH_SHSTK_STATUS

Yu-cheng Yu (23):
  Documentation/x86: Add CET shadow stack description
  x86/shstk: Add Kconfig option for shadow stack
  x86/cpufeatures: Add CPU feature flags for shadow stacks
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86: Add user control-protection fault handler
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Update pte_modify for _PAGE_COW
  x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
    transition from _PAGE_DIRTY to _PAGE_COW
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory
  x86/mm: Check shadow stack page fault errors
  x86/mm: Update maybe_mkwrite() for shadow stack
  mm: Fixup places that call pte_mkwrite() directly
  mm: Add guard pages around a shadow stack.
  mm/mmap: Add shadow stack pages to memory accounting
  mm: Re-introduce vm_flags to do_mmap()
  x86/shstk: Add user-mode shadow stack support
  x86/shstk: Handle thread shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle signals for shadow stack
  x86: Add PTRACE interface for shadow stack

 Documentation/filesystems/proc.rst            |   1 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/shstk.rst                   | 176 +++++
 arch/arm/kernel/signal.c                      |   2 +-
 arch/arm64/kernel/signal.c                    |   2 +-
 arch/arm64/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal_64.c                 |   2 +-
 arch/x86/Kconfig                              |  24 +
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |  16 +-
 arch/x86/include/asm/fpu/api.h                |   9 +
 arch/x86/include/asm/fpu/regset.h             |   7 +-
 arch/x86/include/asm/fpu/sched.h              |   3 +-
 arch/x86/include/asm/fpu/types.h              |  16 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   2 +-
 arch/x86/include/asm/mmu_context.h            |   2 +
 arch/x86/include/asm/msr.h                    |  11 +
 arch/x86/include/asm/pgtable.h                | 338 ++++++++-
 arch/x86/include/asm/pgtable_types.h          |  65 +-
 arch/x86/include/asm/processor.h              |   8 +
 arch/x86/include/asm/shstk.h                  |  40 ++
 arch/x86/include/asm/special_insns.h          |  13 +
 arch/x86/include/asm/tlbflush.h               |   3 +-
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/asm/traps.h                  |  12 +
 arch/x86/include/uapi/asm/mman.h              |   4 +
 arch/x86/include/uapi/asm/prctl.h             |  12 +
 arch/x86/kernel/Makefile                      |   4 +
 arch/x86/kernel/cet.c                         | 152 ++++
 arch/x86/kernel/cpu/common.c                  |  35 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/cpu/proc.c                    |  23 +
 arch/x86/kernel/fpu/core.c                    |  59 +-
 arch/x86/kernel/fpu/regset.c                  |  87 +++
 arch/x86/kernel/fpu/xstate.c                  | 148 ++--
 arch/x86/kernel/fpu/xstate.h                  |   6 +
 arch/x86/kernel/idt.c                         |   2 +-
 arch/x86/kernel/process.c                     |  18 +-
 arch/x86/kernel/process_64.c                  |   9 +-
 arch/x86/kernel/ptrace.c                      |  12 +
 arch/x86/kernel/shstk.c                       | 492 +++++++++++++
 arch/x86/kernel/signal.c                      |   1 +
 arch/x86/kernel/signal_32.c                   |   2 +-
 arch/x86/kernel/signal_64.c                   |   8 +-
 arch/x86/kernel/sys_x86_64.c                  |   6 +-
 arch/x86/kernel/traps.c                       |  87 ---
 arch/x86/mm/fault.c                           |  38 +
 arch/x86/mm/pat/set_memory.c                  |   2 +-
 arch/x86/mm/pgtable.c                         |   6 +
 arch/x86/xen/enlighten_pv.c                   |   2 +-
 arch/x86/xen/xen-asm.S                        |   2 +-
 fs/aio.c                                      |   2 +-
 fs/proc/array.c                               |   6 +
 fs/proc/task_mmu.c                            |   3 +
 include/linux/mm.h                            |  59 +-
 include/linux/mman.h                          |   4 +
 include/linux/pgtable.h                       |  35 +
 include/linux/proc_fs.h                       |   2 +
 include/linux/syscalls.h                      |   1 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/asm-generic/unistd.h             |   2 +-
 include/uapi/linux/elf.h                      |   2 +
 ipc/shm.c                                     |   2 +-
 kernel/sys_ni.c                               |   1 +
 mm/gup.c                                      |   2 +-
 mm/huge_memory.c                              |  12 +-
 mm/memory.c                                   |   7 +-
 mm/migrate_device.c                           |   4 +-
 mm/mmap.c                                     |  12 +-
 mm/nommu.c                                    |   4 +-
 mm/userfaultfd.c                              |  10 +-
 mm/util.c                                     |   2 +-
 tools/testing/selftests/x86/Makefile          |   4 +-
 .../testing/selftests/x86/test_shadow_stack.c | 667 ++++++++++++++++++
 78 files changed, 2578 insertions(+), 259 deletions(-)
 create mode 100644 Documentation/x86/shstk.rst
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:38   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 02/39] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
                   ` (40 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce a new document on Control-flow Enforcement Technology (CET).

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Literal format tweaks (Bagas Sanjaya)
 - Update EOPNOTSUPP text due to unification after comment from (Kees)
 - Update 32 bit signal support with new behavior
 - Remove capitalization on shadow stack (Boris)
 - Fix typo

v4:
 - Drop clearcpuid piece (Boris)
 - Add some info about 32 bit

v3:
 - Clarify kernel IBT is supported by the kernel. (Kees, Andrew Cooper)
 - Clarify which arch_prctl's can take multiple bits. (Kees)
 - Describe ASLR characteristics of thread shadow stacks. (Kees)
 - Add exec section. (Andrew Cooper)
 - Fix some capitalization (Bagas Sanjaya)
 - Update new location of enablement status proc.
 - Add info about new user_shstk software capability.
 - Add more info about what the kernel pushes to the shadow stack on
   signal.

v2:
 - Updated to new arch_prctl() API
 - Add bit about new proc status

 Documentation/x86/index.rst |   1 +
 Documentation/x86/shstk.rst | 166 ++++++++++++++++++++++++++++++++++++
 2 files changed, 167 insertions(+)
 create mode 100644 Documentation/x86/shstk.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index c73d133fd37c..8ac64d7de4dc 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -22,6 +22,7 @@ x86-specific Documentation
    mtrr
    pat
    intel-hfi
+   shstk
    iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
new file mode 100644
index 000000000000..f2e6f323cf68
--- /dev/null
+++ b/Documentation/x86/shstk.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Control-flow Enforcement Technology (CET) Shadow Stack
+======================================================
+
+CET Background
+==============
+
+Control-flow Enforcement Technology (CET) is term referring to several
+related x86 processor features that provides protection against control
+flow hijacking attacks. The HW feature itself can be set up to protect
+both applications and the kernel.
+
+CET introduces shadow stack and indirect branch tracking (IBT). Shadow stack
+is a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+shadow stack and kernel IBT are supported.
+
+Requirements to use Shadow Stack
+================================
+
+To use userspace shadow stack you need HW that supports it, a kernel
+configured with it and userspace libraries compiled with it.
+
+The kernel Kconfig option is X86_USER_SHADOW_STACK, and it can be disabled
+with the kernel parameter: nousershstk.
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "user_shstk" means that userspace shadow stack is supported on the current
+kernel and HW.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output::
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+Enabling arch_prctl()'s
+=======================
+
+Elf features should be enabled by the loader using the below arch_prctl's. They
+are only supported in 64 bit user applications.
+
+arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
+    Enable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
+    Disable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
+    Lock in features at their current enabled or disabled status. 'features'
+    is a mask of all features to lock. All bits set are processed, unset bits
+    are ignored. The mask is ORed with the existing value. So any feature bits
+    set here cannot be enabled or disabled afterwards.
+
+The return values are as follows. On success, return 0. On error, errno can
+be::
+
+        -EPERM if any of the passed feature are locked.
+        -ENOTSUPP if the feature is not supported by the hardware or
+         kernel.
+        -EINVAL arguments (non existing feature, etc)
+
+The feature's bits supported are::
+
+    ARCH_SHSTK_SHSTK - Shadow stack
+    ARCH_SHSTK_WRSS  - WRSS
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc Status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+    x86_Thread_features: shstk wrss
+    x86_Thread_features_locked: shstk wrss
+
+Implementation of the Shadow Stack
+==================================
+
+Shadow Stack Size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+By default, the main program and its signal handlers use the same shadow
+stack. Because the shadow stack stores only return addresses, a large
+shadow stack covers the condition that both the program stack and the
+signal alternate stack run out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+    |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+                    (bit 63 set to 1)
+    |        ...| - Other state may be added in the future
+
+
+32 bit ABI signals are not supported in shadow stack processes. Linux prevents
+32 bit execution while shadow stack is enabled by the allocating shadow stack's
+outside of the 32 bit address space. When execution enters 32 bit mode, either
+via far call or returning to userspace, a #GP is generated by the hardware
+which, will be delivered to the process as a segfault. When transitioning to
+userspace the register's state will be as if the userspace ip being returned to
+caused the segfault.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack's behave like mmap() with respect to
+ASLR behavior.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 02/39] x86/shstk: Add Kconfig option for shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:40   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
                   ` (39 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-shadow stack applications continue to work, but
without protection.

Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with shadow stacks, create CONFIG_CET to signify
that at least one CET feature is configured.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Remove capitalization of shadow stack (Boris)

v3:
 - Add X86_CET (Kees)
 - Add back WRUSS dependency (Kees)
 - Fix verbiage (Dave)
 - Change from promt to bool (Kirill)
 - Add more to commit log

v2:
 - Remove already wrong kernel size increase info (tlgx)
 - Change prompt to remove "Intel" (tglx)
 - Update line about what CPUs are supported (Dave)

Yu-cheng v25:
 - Remove X86_CET and use X86_SHADOW_STACK directly.

 arch/x86/Kconfig           | 24 ++++++++++++++++++++++++
 arch/x86/Kconfig.assembler |  5 +++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..d0037181bc15 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1851,6 +1851,11 @@ config CC_HAS_IBT
 		  (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
 		  $(as-instr,endbr64)
 
+config X86_CET
+	def_bool n
+	help
+	  CET features configured (Shadow stack or IBT)
+
 config X86_KERNEL_IBT
 	prompt "Indirect Branch Tracking"
 	def_bool y
@@ -1858,6 +1863,7 @@ config X86_KERNEL_IBT
 	# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
 	depends on !LD_IS_LLD || LLD_VERSION >= 140000
 	select OBJTOOL
+	select X86_CET
 	help
 	  Build the kernel with support for Indirect Branch Tracking, a
 	  hardware support course-grain forward-edge Control Flow Integrity
@@ -1952,6 +1958,24 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config X86_USER_SHADOW_STACK
+	bool "X86 userspace shadow stack"
+	depends on AS_WRUSS
+	depends on X86_64
+	select ARCH_USES_HIGH_VMA_FLAGS
+	select X86_CET
+	help
+	  Shadow stack protection is a hardware feature that detects function
+	  return address corruption.  This helps mitigate ROP attacks.
+	  Applications must be enabled to use it, and old userspace does not
+	  get protection "for free".
+
+	  CPUs supporting shadow stacks were first released in 2020.
+
+	  See Documentation/x86/shstk.rst for more information.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08e2fc4..00c79dd93651 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -19,3 +19,8 @@ config AS_TPAUSE
 	def_bool $(as-instr,tpause %ecx)
 	help
 	  Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
+
+config AS_WRUSS
+	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+	help
+	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 02/39] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:44   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
                   ` (38 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Drop "shstk" from cpuinfo (Boris)
 - Remove capitalization on shadow stack (Boris)

v3:
 - Add user specific shadow stack cpu cap (Andrew Cooper)
 - Drop reviewed-bys from Boris and Kees due to the above change.

v2:
 - Remove IBT reference in commit log (Kees)
 - Describe xsaves dependency using text from (Dave)

v1:
 - Remove IBT, can be added in a follow on IBT series.

 arch/x86/include/asm/cpufeatures.h       | 2 ++
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/kernel/cpu/cpuid-deps.c         | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 7b319acda31a..a8551b6c8041 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -307,6 +307,7 @@
 #define X86_FEATURE_SGX_EDECCSSA	(11*32+18) /* "" SGX EDECCSSA user leaf function */
 #define X86_FEATURE_CALL_DEPTH		(11*32+19) /* "" Call depth tracking for RSB stuffing */
 #define X86_FEATURE_MSR_TSX_CTRL	(11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
+#define X86_FEATURE_USER_SHSTK		(11*32+21) /* Shadow stack support for user mode applications */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
@@ -373,6 +374,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* "" Shadow stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..505f78ddca82 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK	0
+#else
+#define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -114,7 +120,7 @@
 #define DISABLED_MASK9	(DISABLE_SGX)
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
-			 DISABLE_CALL_DEPTH_TRACKING)
+			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	0
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index d95221117129..c3e4e5246df9 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -79,6 +79,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (2 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:46   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
                   ` (37 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Setting CR4.CET is a prerequisite for utilizing any CET features, most of
which also require setting MSRs.

Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
and is configured with kernel IBT. However, future patches that enable
userspace shadow stack support will need the bit set as well. So change
the logic to enable it in either case.

Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Remove #ifdeffery (Boris)

v4:
 - Add back dedicated command line disable: "nousershtk" (Boris)

v3:
 - Remove stay new line (Boris)
 - Simplify commit log (Andrew Cooper)

v2:
 - In the shadow stack case, go back to only setting CR4.CET if the
   kernel is compiled with user shadow stack support.
 - Clear MSR_IA32_U_CET as well. (PeterZ)

 arch/x86/kernel/cpu/common.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cec654e674ff..80507a5ba0ca 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -599,27 +599,43 @@ __noendbr void ibt_restore(u64 save)
 
 static __always_inline void setup_cet(struct cpuinfo_x86 *c)
 {
-	u64 msr = CET_ENDBR_EN;
+	bool user_shstk, kernel_ibt;
 
-	if (!HAS_KERNEL_IBT ||
-	    !cpu_feature_enabled(X86_FEATURE_IBT))
+	if (!IS_ENABLED(CONFIG_X86_CET))
 		return;
 
-	wrmsrl(MSR_IA32_S_CET, msr);
+	kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
+	user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+		     IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK);
+
+	if (!kernel_ibt && !user_shstk)
+		return;
+
+	if (user_shstk)
+		set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
+
+	if (kernel_ibt)
+		wrmsrl(MSR_IA32_S_CET, CET_ENDBR_EN);
+	else
+		wrmsrl(MSR_IA32_S_CET, 0);
+
 	cr4_set_bits(X86_CR4_CET);
 
-	if (!ibt_selftest()) {
+	if (kernel_ibt && !ibt_selftest()) {
 		pr_err("IBT selftest: Failed!\n");
 		wrmsrl(MSR_IA32_S_CET, 0);
 		setup_clear_cpu_cap(X86_FEATURE_IBT);
-		return;
 	}
 }
 
 __noendbr void cet_disable(void)
 {
-	if (cpu_feature_enabled(X86_FEATURE_IBT))
-		wrmsrl(MSR_IA32_S_CET, 0);
+	if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
+	      cpu_feature_enabled(X86_FEATURE_SHSTK)))
+		return;
+
+	wrmsrl(MSR_IA32_S_CET, 0);
+	wrmsrl(MSR_IA32_U_CET, 0);
 }
 
 /*
@@ -1476,6 +1492,9 @@ static void __init cpu_parse_early_param(void)
 	if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
 		setup_clear_cpu_cap(X86_FEATURE_XSAVES);
 
+	if (cmdline_find_option_bool(boot_command_line, "nousershstk"))
+		setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK);
+
 	arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg));
 	if (arglen <= 0)
 		return;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (3 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:46   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
                   ` (36 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stack register state can be managed with XSAVE. The registers
can logically be separated into two groups:
        * Registers controlling user-mode operation
        * Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for each group
of those groups of registers. This lets an OS manage them separately if
it chooses. Future patches for host userspace and KVM guests will only
utilize the user-mode registers, so only configure XSAVE to save
user-mode registers. This state will add 16 bytes to the xsave buffer
size.

Future patches will use the user-mode XSAVE area to save guest user-mode
CET state. However, VMCS includes new fields for guest CET supervisor
states. KVM can use these to save and restore guest supervisor state, so
host supervisor XSAVE support is not required.

Adding this exacerbates the already unwieldy if statement in
check_xstate_against_struct() that handles warning about un-implemented
xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
it actually check's the xfeature. This ends up exceeding 80 chars, but was
better on balance than other options explored. Pass the bool as pointer to
make it clear that XCHECK_SZ() can change the variable.

While configuring user-mode XSAVE, clarify kernel-mode registers are not
managed by XSAVE by defining the xfeature in
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
This serves more of a documentation as code purpose, and functionally,
only enables a few safety checks.

Both XSAVE state components are supervisor states, even the state
controlling user-mode operation. This is a departure from earlier features
like protection keys where the PKRU state is a normal user
(non-supervisor) state. Having the user state be supervisor-managed
ensures there is no direct, unprivileged access to it, making it harder
for an attacker to subvert CET.

To facilitate this privileged access, define the two user-mode CET MSRs,
and the bits defined in those MSRs relevant to future shadow stack
enablement patches.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Move comments from end of lines in cet_user_state struct (Boris)

v3:
 - Add missing "is" in commit log (Boris)
 - Change to case statement for struct size checking (Boris)
 - Adjust commas on xfeature_names (Kees, Boris)

v2:
 - Change name to XFEATURE_CET_KERNEL_UNUSED (peterz)

KVM refresh:
 - Reword commit log using some verbiage posted by Dave Hansen
 - Remove unlikely to be used supervisor cet xsave struct
 - Clarify that supervisor cet state is not saved by xsave
 - Remove unused supervisor MSRs

 arch/x86/include/asm/fpu/types.h  | 16 +++++-
 arch/x86/include/asm/fpu/xstate.h |  6 ++-
 arch/x86/kernel/fpu/xstate.c      | 90 +++++++++++++++----------------
 3 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb7cd1139d97..26abde698fc0 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
 	XFEATURE_PASID,
-	XFEATURE_RSRVD_COMP_11,
-	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL_UNUSED,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL_UNUSED)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 #define XFEATURE_MASK_XTILE_CFG		(1 << XFEATURE_XTILE_CFG)
 #define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,16 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	/* user control-flow settings */
+	u64 user_cet;
+	/* user shadow stack pointer */
+	u64 user_ssp;
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
 #define XFEATURE_MASK_USER_DYNAMIC	XFEATURE_MASK_XTILE_DATA
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+					    XFEATURE_MASK_CET_USER)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+					      XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 714166cc25f2..13a80521dd51 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,26 +39,26 @@
  */
 static const char *xfeature_names[] =
 {
-	"x87 floating point registers"	,
-	"SSE registers"			,
-	"AVX registers"			,
-	"MPX bounds registers"		,
-	"MPX CSR"			,
-	"AVX-512 opmask"		,
-	"AVX-512 Hi256"			,
-	"AVX-512 ZMM_Hi256"		,
-	"Processor Trace (unused)"	,
+	"x87 floating point registers",
+	"SSE registers",
+	"AVX registers",
+	"MPX bounds registers",
+	"MPX CSR",
+	"AVX-512 opmask",
+	"AVX-512 Hi256",
+	"AVX-512 ZMM_Hi256",
+	"Processor Trace (unused)",
 	"Protection Keys User registers",
 	"PASID state",
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"AMX Tile config"		,
-	"AMX Tile data"			,
-	"unknown xstate feature"	,
+	"Control-flow User registers",
+	"Control-flow Kernel registers (unused)",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"AMX Tile config",
+	"AMX Tile data",
+	"unknown xstate feature",
 };
 
 static unsigned short xsave_cpuid_features[] __initdata = {
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_PKU,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
+	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
 	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
 	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
@@ -276,6 +277,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
 	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
 	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
@@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
 	 XFEATURE_MASK_PASID |			\
+	 XFEATURE_MASK_CET_USER |		\
 	 XFEATURE_MASK_XTILE)
 
 /*
@@ -446,14 +449,15 @@ static void __init __xstate_dump_leaves(void)
 	}									\
 } while (0)
 
-#define XCHECK_SZ(sz, nr, nr_macro, __struct) do {			\
-	if ((nr == nr_macro) &&						\
-	    WARN_ONCE(sz != sizeof(__struct),				\
-		"%s: struct is %zu bytes, cpu state %d bytes\n",	\
-		__stringify(nr_macro), sizeof(__struct), sz)) {		\
+#define XCHECK_SZ(sz, nr, __struct) ({					\
+	if (WARN_ONCE(sz != sizeof(__struct),				\
+	    "[%s]: struct is %zu bytes, cpu state %d bytes\n",		\
+	    xfeature_names[nr], sizeof(__struct), sz)) {		\
 		__xstate_dump_leaves();					\
 	}								\
-} while (0)
+	true;								\
+})
+
 
 /**
  * check_xtile_data_against_struct - Check tile data state size.
@@ -527,36 +531,28 @@ static bool __init check_xstate_against_struct(int nr)
 	 * Ask the CPU for the size of the state.
 	 */
 	int sz = xfeature_size(nr);
+
 	/*
 	 * Match each CPU state with the corresponding software
 	 * structure.
 	 */
-	XCHECK_SZ(sz, nr, XFEATURE_YMM,       struct ymmh_struct);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDREGS,   struct mpx_bndreg_state);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDCSR,    struct mpx_bndcsr_state);
-	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
-	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
-	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
-	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
-
-	/* The tile data size varies between implementations. */
-	if (nr == XFEATURE_XTILE_DATA)
-		check_xtile_data_against_struct(sz);
-
-	/*
-	 * Make *SURE* to add any feature numbers in below if
-	 * there are "holes" in the xsave state component
-	 * numbers.
-	 */
-	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX) ||
-	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+	switch (nr) {
+	case XFEATURE_YMM:	  return XCHECK_SZ(sz, nr, struct ymmh_struct);
+	case XFEATURE_BNDREGS:	  return XCHECK_SZ(sz, nr, struct mpx_bndreg_state);
+	case XFEATURE_BNDCSR:	  return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state);
+	case XFEATURE_OPMASK:	  return XCHECK_SZ(sz, nr, struct avx_512_opmask_state);
+	case XFEATURE_ZMM_Hi256:  return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state);
+	case XFEATURE_Hi16_ZMM:	  return XCHECK_SZ(sz, nr, struct avx_512_hi16_state);
+	case XFEATURE_PKRU:	  return XCHECK_SZ(sz, nr, struct pkru_state);
+	case XFEATURE_PASID:	  return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
+	case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
+	case XFEATURE_CET_USER:	  return XCHECK_SZ(sz, nr, struct cet_user_state);
+	case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
+	default:
 		XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
 		return false;
 	}
+
 	return true;
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (4 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:47   ` Kees Cook
  2023-02-01 11:01   ` Borislav Petkov
  2023-01-19 21:22 ` [PATCH v5 07/39] x86: Add user control-protection fault handler Rick Edgecombe
                   ` (35 subsequent siblings)
  41 siblings, 2 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.

When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.

The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.

Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Fix spelling error (Boris)
 - Don't export fpregs_lock_and_load() (Boris)

v3:
 - Rename to fpregs_lock_and_load() to match the unlocking
   fpregs_unlock(). (Kees)
 - Elaborate in comment about helper. (Dave)

v2:
 - Drop optimization of writing directly the buffer, and change API
   accordingly.
 - fpregs_lock_and_load() suggested by tglx
 - Some commit log verbiage from dhansen

v1:
 - New patch.

 arch/x86/include/asm/fpu/api.h |  9 +++++++++
 arch/x86/kernel/fpu/core.c     | 18 ++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..aadc6893dcaa 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,15 @@ static inline void fpregs_unlock(void)
 		preempt_enable();
 }
 
+/*
+ * FPU state gets lazily restored before returning to userspace. So when in the
+ * kernel, the valid FPU state may be kept in the buffer. This function will force
+ * restore all the fpu state to the registers early if needed, and lock them from
+ * being automatically saved/restored. Then FPU state can be modified safely in the
+ * registers, before unlocking with fpregs_unlock().
+ */
+void fpregs_lock_and_load(void);
+
 #ifdef CONFIG_X86_DEBUG_FPU
 extern void fpregs_assert_state_consistent(void);
 #else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index dccce58201b7..7317bfd5ea36 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -753,6 +753,24 @@ void switch_fpu_return(void)
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
 
+void fpregs_lock_and_load(void)
+{
+	/*
+	 * fpregs_lock() only disables preemption (mostly). So modifying state
+	 * in an interrupt could screw up some in progress fpregs operation,
+	 * but appear to work. Warn about it.
+	 */
+	WARN_ON_ONCE(!irq_fpu_usable());
+	WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+	fpregs_lock();
+
+	fpregs_assert_state_consistent();
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		fpregs_restore_userregs();
+}
+
 #ifdef CONFIG_X86_DEBUG_FPU
 /*
  * If current FPU state according to its tracking (loaded FPU context on this
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (5 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:50   ` Kees Cook
  2023-02-03 19:09   ` Borislav Petkov
  2023-01-19 21:22 ` [PATCH v5 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
                   ` (34 subsequent siblings)
  41 siblings, 2 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---

v5:
 - Move to separate file to advoid ifdeffery (Boris)
 - Improvements to commit log (Boris)
 - Rename control_protection_err (Boris)
 - Move comment from end of line in IBT fault handler (Boris)

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.

v1:
 - Update static asserts for NSIGSEGV

 arch/arm/kernel/signal.c                 |   2 +-
 arch/arm64/kernel/signal.c               |   2 +-
 arch/arm64/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal_64.c            |   2 +-
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/idtentry.h          |   2 +-
 arch/x86/include/asm/traps.h             |  12 ++
 arch/x86/kernel/Makefile                 |   2 +
 arch/x86/kernel/cet.c                    | 152 +++++++++++++++++++++++
 arch/x86/kernel/idt.c                    |   2 +-
 arch/x86/kernel/signal_32.c              |   2 +-
 arch/x86/kernel/signal_64.c              |   2 +-
 arch/x86/kernel/traps.c                  |  87 -------------
 arch/x86/xen/enlighten_pv.c              |   2 +-
 arch/x86/xen/xen-asm.S                   |   2 +-
 include/uapi/asm-generic/siginfo.h       |   3 +-
 17 files changed, 186 insertions(+), 100 deletions(-)
 create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index be279fd48248..4bced22213d5 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1176,7 +1176,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 505f78ddca82..652e366b68a0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -128,7 +134,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK20	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_disable();
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..92446f1dedd7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -144,6 +144,8 @@ obj-$(CONFIG_CFI_CLANG)			+= cfi.o
 
 obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
+obj-$(CONFIG_X86_CET)			+= cet.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..33d7d119be26
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <asm/bugs.h>
+#include <asm/traps.h>
+
+enum cp_error_code {
+	CP_EC        = (1 << 15) - 1,
+
+	CP_RET       = 1,
+	CP_IRET      = 2,
+	CP_ENDBR     = 3,
+	CP_RSTRORSSP = 4,
+	CP_SETSSBSY  = 5,
+
+	CP_ENCL	     = 1 << 15,
+};
+
+static const char cp_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(cp_err))
+		cpec = 0;
+	return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		     user_mode(regs) ? "user mode" : "kernel mode",
+		     cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
+		return;
+	}
+
+	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
+		regs->ax = 0;
+		return;
+	}
+
+	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
+	if (!ibt_fatal) {
+		printk(KERN_DEFAULT CUT_HERE);
+		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+		return;
+	}
+	BUG();
+}
+
+/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
+noinline bool ibt_selftest(void)
+{
+	unsigned long ret;
+
+	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
+	     ANNOTATE_RETPOLINE_SAFE
+	     "	jmp *%%rax\n\t"
+	     "ibt_selftest_ip:\n\t"
+	     UNWIND_HINT_FUNC
+	     ANNOTATE_NOENDBR
+	     "	nop\n\t"
+
+	     : "=a" (ret) : : "memory");
+
+	return !ret;
+}
+
+static int __init ibt_setup(char *str)
+{
+	if (!strcmp(str, "off"))
+		setup_clear_cpu_cap(X86_FEATURE_IBT);
+
+	if (!strcmp(str, "warn"))
+		ibt_fatal = false;
+
+	return 1;
+}
+
+__setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..18fb9d620824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@
 
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_disable();
-}
-
 __always_inline int is_valid_bugaddr(unsigned long addr)
 {
 	if (addr < TASK_SIZE_MAX)
@@ -213,81 +201,6 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
-enum cp_error_code {
-	CP_EC        = (1 << 15) - 1,
-
-	CP_RET       = 1,
-	CP_IRET      = 2,
-	CP_ENDBR     = 3,
-	CP_RSTRORSSP = 4,
-	CP_SETSSBSY  = 5,
-
-	CP_ENCL	     = 1 << 15,
-};
-
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
-{
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
-	}
-
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
-		return;
-
-	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
-		regs->ax = 0;
-		return;
-	}
-
-	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
-	if (!ibt_fatal) {
-		printk(KERN_DEFAULT CUT_HERE);
-		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
-		return;
-	}
-	BUG();
-}
-
-/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
-noinline bool ibt_selftest(void)
-{
-	unsigned long ret;
-
-	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
-	     ANNOTATE_RETPOLINE_SAFE
-	     "	jmp *%%rax\n\t"
-	     "ibt_selftest_ip:\n\t"
-	     UNWIND_HINT_FUNC
-	     ANNOTATE_NOENDBR
-	     "	nop\n\t"
-
-	     : "=a" (ret) : : "memory");
-
-	return !ret;
-}
-
-static int __init ibt_setup(char *str)
-{
-	if (!strcmp(str, "off"))
-		setup_clear_cpu_cap(X86_FEATURE_IBT);
-
-	if (!strcmp(str, "warn"))
-		ibt_fatal = false;
-
-	return 1;
-}
-
-__setup("ibt=", ibt_setup);
-
-#endif /* CONFIG_X86_KERNEL_IBT */
-
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index bb59cc6ddb2d..9c29cd5393cc 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (6 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 07/39] x86: Add user control-protection fault handler Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:52   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
                   ` (33 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu, Christoph Hellwig

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when the memory gets written to. This
creates additional work for the CPU. So traditional wisdom was to simply
set the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read-only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
---

v5:
 - Spelling and grammar in commit log (Boris)

v3:
 - Update commit log (Andrew Cooper, Peterz)

v2:
 - Normalize PTE bit descriptions between patches

 arch/x86/include/asm/pgtable_types.h | 6 +++---
 arch/x86/mm/pat/set_memory.c         | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 447d4bee25c4..0646ad00178b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@ enum page_cache_mode {
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 356758b7d4b4..d41706ad29db 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2073,7 +2073,7 @@ int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
 }
 
 int set_memory_rox(unsigned long addr, int numpages)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 09/39] x86/mm: Move pmd_write(), pud_write() up in the file
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (7 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

To prepare the introduction of _PAGE_COW, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below.  No functional changes.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0564edd24ffb..b39f16c0d507 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -160,6 +160,18 @@ static inline int pte_write(pte_t pte)
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_RW;
+}
+
 static inline int pte_huge(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_PSE;
@@ -1120,12 +1132,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pmd_t *pmdp)
@@ -1155,12 +1161,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
-	return pud_flags(pud) & _PAGE_RW;
-}
-
 #ifndef pmdp_establish
 #define pmdp_establish pmdp_establish
 static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (8 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:55   ` Kees Cook
                     ` (2 more replies)
  2023-01-19 21:22 ` [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
                   ` (31 subsequent siblings)
  41 siblings, 3 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
in places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
Further differentiated by VMA flags, these PTE bit combinations would be
set as follows for various types of memory:

(Write=0,Cow=1,Dirty=0):
 - A modified, copy-on-write (COW) page. Previously when a typical
   anonymous writable mapping was made COW via fork(), the kernel would
   mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
   happens in copy_present_pte().
 - A R/O page that has been COW'ed. The user page is in a R/O VMA,
   and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
   handler creates a copy of the page and sets the new copy's PTE as
   Write=0 and Cow=1.
 - A shared shadow stack PTE. When a shadow stack page is being shared
   among processes (this happens at fork()), its PTE is made Dirty=0, so
   the next shadow stack access causes a fault, and the page is
   duplicated and Dirty=1 is set again. This is the COW equivalent for
   shadow stack pages, even though it's copy-on-access rather than
   copy-on-write.

(Write=0,Cow=0,Dirty=1):
 - A shadow stack PTE.
 - A Cow PTE created when a processor without shadow stack support set
   Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
because shadow stacks are not enabled there.

Implement only the infrastructure for _PAGE_COW. Changes to start
creating _PAGE_COW PTEs will follow once other pieces are in place.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Fix log, comments and whitespace (Boris)
 - Remove capitalization on shadow stack (Boris)

v4:
 - Teach pte_flags_need_flush() about _PAGE_COW bit
 - Break apart patch for better bisectability

v3:
 - Add comment around _PAGE_TABLE in response to comment
   from (Andrew Cooper)
 - Check for PSE in pmd_shstk (Andrew Cooper)
 - Get to the point quicker in commit log (Andrew Cooper)
 - Clarify and reorder commit log for why the PTE bit examples have
   multiple entries. Apply same changes for comment. (peterz)
 - Fix comment that implied dirty bit for COW was a specific x86 thing
   (peterz)
 - Fix swapping of Write/Dirty (PeterZ)

v2:
 - Update commit log with comments (Dave Hansen)
 - Add comments in code to explain pte modification code better (Dave)
 - Clarify info on the meaning of various Write,Cow,Dirty combinations

 arch/x86/include/asm/pgtable.h       | 78 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h | 59 +++++++++++++++++++--
 arch/x86/include/asm/tlbflush.h      |  3 +-
 3 files changed, 134 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b39f16c0d507..6d2f612c04b5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -301,6 +301,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+/*
+ * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the case
+ * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since the
+ * Dirty=1,Write=0 will result in the memory being treated as shadow stack
+ * by the HW. So when creating COW memory, a software bit is used
+ * _PAGE_BIT_COW. The following functions pte_mkcow() and pte_clear_cow()
+ * take a PTE marked conventionally COW (Dirty=1) and transition it to the
+ * shadow stack compatible version of COW (Cow=1).
+ */
+static inline pte_t pte_mkcow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+	/*
+	 * _PAGE_COW is unnecessary on !X86_FEATURE_USER_SHSTK kernels, since
+	 * the HW dirty bit can be used without creating shadow stack memory.
+	 * See the _PAGE_COW definition for more details.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pte;
+
+	/*
+	 * PTE is getting copied-on-write, so it will be dirtied
+	 * if writable, or made shadow stack if shadow stack and
+	 * being copied on access. Set the dirty bit for both
+	 * cases.
+	 */
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -413,6 +451,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pmd_uffd_wp(pmd_t pmd)
 {
@@ -484,6 +542,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+/* See comments above pte_mkcow() */
+static inline pud_t pud_mkcow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_COW);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0646ad00178b..5c3f942865d9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +127,40 @@
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ *
+ * (Write=0,Cow=1,Dirty=0):
+ *  - A modified, copy-on-write (COW) page. Previously when a typical
+ *    anonymous writable mapping was made COW via fork(), the kernel would
+ *    mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
+ *    happens in copy_present_pte().
+ *  - A R/O page that has been COW'ed. The user page is in a R/O VMA,
+ *    and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
+ *    handler creates a copy of the page and sets the new copy's PTE as
+ *    Write=0 and Cow=1.
+ *  - A shared shadow stack PTE. When a shadow stack page is being shared
+ *    among processes (this happens at fork()), its PTE is made Dirty=0, so
+ *    the next shadow stack access causes a fault, and the page is
+ *    duplicated and Dirty=1 is set again. This is the COW equivalent for
+ *    shadow stack pages, even though it's copy-on-access rather than
+ *    copy-on-write.
+ *
+ * (Write=0,Cow=0,Dirty=1):
+ *  - A shadow stack PTE.
+ *  - A Cow PTE created when a processor without shadow stack support set
+ *    Dirty=1.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_COW	(_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
@@ -186,12 +230,17 @@ enum page_cache_mode {
 #define PAGE_READONLY	     __pg(__PP|   0|_USR|___A|__NX|   0|   0|   0)
 #define PAGE_READONLY_EXEC   __pg(__PP|   0|_USR|___A|   0|   0|   0|   0)
 
-#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
-#define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
-#define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
+/*
+ * Page tables needs to have Write=1 in order for any lower PTEs to be
+ * writable. This includes shadow stack memory (Write=0, Dirty=1)
+ */
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
+#define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
+#define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
+
+#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
 #define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..9429da70d689 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -273,7 +273,8 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
 					_PAGE_ACCESSED;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
-					_PAGE_SOFTW3 | _PAGE_SOFTW4;
+					_PAGE_SOFTW3 | _PAGE_SOFTW4 |
+					_PAGE_COW;
 	const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
 			  _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
 			  _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (9 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:57   ` Kees Cook
  2023-02-09 14:08   ` Borislav Petkov
  2023-01-19 21:22 ` [PATCH v5 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
                   ` (30 subsequent siblings)
  41 siblings, 2 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The Write=0,Dirty=1 PTE has been used to indicate copy-on-write pages.
However, newer x86 processors also regard a Write=0,Dirty=1 PTE as a
shadow stack page. In order to separate the two, the software-defined
_PAGE_DIRTY is changed to _PAGE_COW for the copy-on-write case, and
pte_*() are updated to do this.

pte_modify() takes a "raw" pgprot_t which was not necessarily created
with any of the existing PTE bit helpers. That means that it can return a
pte_t with Write=0,Dirty=1, a shadow stack PTE, when it did not intend to
create one.

However pte_modify() changes a PTE to 'newprot', but it doesn't use the
pte_*(). Modify it to also move _PAGE_DIRTY to _PAGE_COW. Do this by
using the pte_mkdirty() helper. Since pte_mkdirty() also sets the soft
dirty bit, extract a helper that optionally doesn't set
_PAGE_SOFT_DIRTY. This helper will allow future logic for deciding when to
move _PAGE_DIRTY to _PAGE_COW can live in one place.

Apply the same changes to pmd_modify().

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Fix pte_modify() again, to not lose _PAGE_DIRTY, but still not set
   _PAGE_SOFT_DIRTY as was fixed in v4.

v4:
 - Fix an issue in soft-dirty test, where pte_modify() would detect
   _PAGE_COW in pte_dirty() and set the soft dirty bit in pte_mkdirty().

v2:
 - Update commit log with text and suggestions from (Dave Hansen)
 - Drop fixup_dirty_pte() in favor of clearing the HW dirty bit along
   with the _PAGE_CHG_MASK masking, then calling pte_mkdirty() (Dave
   Hansen)

 arch/x86/include/asm/pgtable.h | 64 +++++++++++++++++++++++++++++-----
 1 file changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6d2f612c04b5..7942eff2af50 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -392,9 +392,19 @@ static inline pte_t pte_mkexec(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_NX);
 }
 
+static inline pte_t __pte_mkdirty(pte_t pte, bool soft)
+{
+	pteval_t dirty = _PAGE_DIRTY;
+
+	if (soft)
+		dirty |= _PAGE_SOFT_DIRTY;
+
+	return pte_set_flags(pte, dirty);
+}
+
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return __pte_mkdirty(pte, true);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -503,9 +513,19 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_RW);
 }
 
+static inline pmd_t __pmd_mkdirty(pmd_t pmd, bool soft)
+{
+	pmdval_t dirty = _PAGE_DIRTY;
+
+	if (soft)
+		dirty |= _PAGE_SOFT_DIRTY;
+
+	return pmd_set_flags(pmd, dirty);
+}
+
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return __pmd_mkdirty(pmd, true);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -715,26 +735,54 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
+	pteval_t _page_chg_mask_no_dirty = _PAGE_CHG_MASK & ~_PAGE_DIRTY;
 	pteval_t val = pte_val(pte), oldval = val;
+	pte_t pte_result;
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
 	 * the newprot (if present):
 	 */
-	val &= _PAGE_CHG_MASK;
-	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+	val &= _page_chg_mask_no_dirty;
+	val |= check_pgprot(newprot) & ~_page_chg_mask_no_dirty;
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
-	return __pte(val);
+
+	pte_result = __pte(val);
+
+	/*
+	 * Dirty bit is not preserved above so it can be done
+	 * in a special way for the shadow stack case, where it
+	 * may need to set _PAGE_COW. __pte_mkdirty() will do this in
+	 * the case of shadow stack.
+	 */
+	if (pte_dirty(pte))
+		pte_result = __pte_mkdirty(pte_result, false);
+
+	return pte_result;
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
+	pteval_t _hpage_chg_mask_no_dirty = _HPAGE_CHG_MASK & ~_PAGE_DIRTY;
 	pmdval_t val = pmd_val(pmd), oldval = val;
+	pmd_t pmd_result;
 
-	val &= _HPAGE_CHG_MASK;
-	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val &= _hpage_chg_mask_no_dirty;
+	val |= check_pgprot(newprot) & ~_hpage_chg_mask_no_dirty;
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
-	return __pmd(val);
+
+	pmd_result = __pmd(val);
+
+	/*
+	 * Dirty bit is not preserved above so it can be done
+	 * in a special way for the shadow stack case, where it
+	 * may need to set _PAGE_COW. __pmd_mkdirty() will do this in
+	 * the case of shadow stack.
+	 */
+	if (pmd_dirty(pmd))
+		pmd_result = __pmd_mkdirty(pmd_result, false);
+
+	return pmd_result;
 }
 
 /*
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (10 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:58   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 13/39] x86/mm: Start actually marking _PAGE_COW Rick Edgecombe
                   ` (29 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for
shadow stack. Copy-on-write PTEs then have Write=0,Cow=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
become a transient shadow stack PTE in two cases:

1. Some processors can start a write but end up seeing a Write=0 PTE by
   the time they get to the Dirty bit, creating a transient shadow stack
   PTE. However, this will not occur on processors supporting shadow
   stack, and a TLB flush is not necessary.

2. When _PAGE_DIRTY is replaced with _PAGE_COW non-atomically, a transient
   shadow stack PTE can be created as a result. Thus, prevent that with
   cmpxchg.

In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
on does not exist and the logic would need to be different. Although the
extra functionality will normally be optimized out when user shadow
stacks are not configured, also exclude it in the preprocessor stage so
that it will still compile. User shadow stack is not supported there by
Linux anyway. Leave the cpu_feature_enabled() check so that the
functionality also gets disabled based on runtime detection of the
feature.

Similarly, compile it out in ptep_set_wrprotect() due to a clang warning
on i386. Like above, the code path should get optimized out on i386
since shadow stack is not supported on 32 bit kernels, but this makes
the compiler happy.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Commit log verbiage and formatting (Boris)
 - Remove capitalization on shadow stack (Boris)
 - Fix i386 warning on recent clang

v3:
 - Remove unnecessary #ifdef (Dave Hansen)

v2:
 - Compile out some code due to clang build error
 - Clarify commit log (dhansen)
 - Normalize PTE bit descriptions between patches (dhansen)
 - Update comment with text from (dhansen)

Yu-cheng v30:
 - Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).

 arch/x86/include/asm/pgtable.h | 37 ++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7942eff2af50..c5047eb5f406 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1232,6 +1232,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		old_pte = READ_ONCE(*ptep);
+		do {
+			new_pte = pte_wrprotect(old_pte);
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1284,6 +1301,26 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	/*
+	 * If shadow stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+	 * to _PAGE_COW (see comments at pmd_wrprotect()).
+	 * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+	 * to RW=0, Dirty=0, another thread could have written to the page
+	 * and the PMD is RW=1, Dirty=1 now.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		old_pmd = READ_ONCE(*pmdp);
+		do {
+			new_pmd = pmd_wrprotect(old_pmd);
+		} while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+		return;
+	}
+#endif
+
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 13/39] x86/mm: Start actually marking _PAGE_COW
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (11 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 14/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

The recently introduced _PAGE_COW should be used instead of the HW Dirty
bit whenever a PTE is Write=0, in order to not inadvertently create
shadow stack PTEs. Update pte_mk*() helpers to do this, and apply the same
changes to pmd and pud.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v4:
 - Break part patch for better bisectability

 arch/x86/include/asm/pgtable.h | 125 ++++++++++++++++++++++++++++-----
 1 file changed, 107 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c5047eb5f406..e96558abc8ec 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return false;
+
+	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -134,9 +142,18 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return false;
+
+	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+	       (_PAGE_DIRTY | _PAGE_PSE);
 }
 
 #define pmd_young pmd_young
@@ -145,9 +162,9 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -157,13 +174,21 @@ static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -374,7 +399,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -384,7 +409,16 @@ static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pte_dirty(pte))
+		pte = pte_mkcow(pte);
+	return pte;
 }
 
 static inline pte_t pte_mkexec(pte_t pte)
@@ -396,6 +430,10 @@ static inline pte_t __pte_mkdirty(pte_t pte, bool soft)
 {
 	pteval_t dirty = _PAGE_DIRTY;
 
+	/* Avoid creating Dirty=1,Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_COW;
+
 	if (soft)
 		dirty |= _PAGE_SOFT_DIRTY;
 
@@ -407,6 +445,12 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	return __pte_mkdirty(pte, true);
 }
 
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	/* pte_clear_cow() also sets Dirty=1 */
+	return pte_clear_cow(pte);
+}
+
 static inline pte_t pte_mkyoung(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_ACCESSED);
@@ -414,7 +458,12 @@ static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_RW);
+	pte = pte_set_flags(pte, _PAGE_RW);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_cow(pte);
+
+	return pte;
 }
 
 static inline pte_t pte_mkhuge(pte_t pte)
@@ -505,18 +554,30 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pmd_dirty(pmd))
+		pmd = pmd_mkcow(pmd);
+	return pmd;
 }
 
 static inline pmd_t __pmd_mkdirty(pmd_t pmd, bool soft)
 {
 	pmdval_t dirty = _PAGE_DIRTY;
 
+	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pmd_write(pmd))
+		dirty = _PAGE_COW;
+
 	if (soft)
 		dirty |= _PAGE_SOFT_DIRTY;
 
@@ -528,6 +589,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd_mkdirty(pmd, true);
 }
 
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd_clear_cow(pmd);
+}
+
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_DEVMAP);
@@ -545,7 +611,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_RW);
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_cow(pmd);
+	return pmd;
 }
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -589,17 +659,32 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pud_dirty(pud))
+		pud = pud_mkcow(pud);
+	return pud;
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pud_write(pud))
+		dirty = _PAGE_COW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -619,7 +704,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	if (pud_dirty(pud))
+		pud = pud_clear_cow(pud);
+	return pud;
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 14/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (12 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 13/39] x86/mm: Start actually marking _PAGE_COW Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 15/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu, Peter Xu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Future patches will introduce a new VM flag VM_SHADOW_STACK that will be
VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are
bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake
of order, make all VM_HIGH_ARCH_BITs stay together by moving
VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be
introduced as 37.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7afc86d50442..82a9a4903651 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -366,7 +366,7 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT	37
+# define VM_UFFD_MINOR_BIT	38
 # define VM_UFFD_MINOR		BIT(VM_UFFD_MINOR_BIT)	/* UFFD minor faults */
 #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 # define VM_UFFD_MINOR		VM_NONE
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 15/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (13 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 14/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 16/39] x86/mm: Check shadow stack page fault errors Rick Edgecombe
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce
VM_SHADOW_STACK to track shadow stack VMAs.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v3:
 - Drop arch specific change in arch_vma_name(). The memory can show as
   anonymous (Kirill)
 - Change CONFIG_ARCH_HAS_SHADOW_STACK to CONFIG_X86_USER_SHADOW_STACK
   in show_smap_vma_flags() (Boris)

 Documentation/filesystems/proc.rst | 1 +
 fs/proc/task_mmu.c                 | 3 +++
 include/linux/mm.h                 | 8 ++++++++
 3 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e224b6d5b642..115843e8cce3 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -564,6 +564,7 @@ encoded manner. The codes are the following:
     mt    arm64 MTE allocation tags are enabled
     um    userfaultfd missing tracking
     uw    userfaultfd wr-protect tracking
+    ss    shadow stack page
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..982126ffdbae 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+		[ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 82a9a4903651..824e730b21af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+# define VM_SHADOW_STACK	VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK	VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 16/39] x86/mm: Check shadow stack page fault errors
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (14 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 15/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  0:59   ` Kees Cook
  2023-01-19 21:22 ` [PATCH v5 17/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
                   ` (25 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stack accesses to shadow-stack mappings can result in faults in
normal, valid operation just like regular accesses to regular mappings.
Shadow stacks need some of the same features like delayed allocation, swap
and copy-on-write. The kernel needs to use faults to implement those
features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

For the purpose of making this clearer, consider the following example.
If a process has a shadow stack, and forks, the shadow stack PTEs will
become read-only due to COW. If the CPU in one process performs a shadow
stack read access to the shadow stack, for example executing a RET and
causing the CPU to read the shadow stack copy of the return address, then
in order for the fault to be resolved the PTE will need to be set with
shadow stack permissions. But then the memory would be changeable from
userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
COW, otherwise the shared page would be changeable from both processes.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Add description of COW example (Boris)
 - Replace "permissioned" (Boris)
 - Remove capitalization of shadow stack (Boris)

v4:
 - Further improve comment talking about FAULT_FLAG_WRITE (Peterz)

v3:
 - Improve comment talking about using FAULT_FLAG_WRITE (Peterz)

v2:
 - Update commit log with verbiage/feedback from Dave Hansen
 - Clarify reasoning for FAULT_FLAG_WRITE for all shadow stack accesses
 - Update comments with some verbiage from Dave Hansen

 arch/x86/include/asm/trap_pf.h |  2 ++
 arch/x86/mm/fault.c            | 38 ++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  *   bit 15 ==				1: SGX MMU page-fault
  */
 enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 	X86_PF_SGX	=		1 << 15,
 };
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..070b50c87415 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1138,8 +1138,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Shadow stack accesses (PF_SHSTK=1) are only permitted to
+	 * shadow stack VMAs. All other accesses result in an error.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
+			return 1;
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
+		if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
+			return 1;
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
 		return 0;
@@ -1331,6 +1345,30 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * When a page becomes COW it changes from a shadow stack permission
+	 * page (Write=0,Dirty=1) to (Write=0,Dirty=0,CoW=1), which is simply
+	 * read-only to the CPU. When shadow stack is enabled, a RET would
+	 * normally pop the shadow stack by reading it with a "shadow stack
+	 * read" access. However, in the COW case the shadow stack memory does
+	 * not have shadow stack permissions, it is read-only. So it will
+	 * generate a fault.
+	 *
+	 * For conventionally writable pages, a read can be serviced with a
+	 * read only PTE, and COW would not have to happen. But for shadow
+	 * stack, there isn't the concept of read-only shadow stack memory.
+	 * If it is shadow stack permission, it can be modified via CALL and
+	 * RET instructions. So COW needs to happen before any memory can be
+	 * mapped with shadow stack permissions.
+	 *
+	 * Shadow stack accesses (read or write) need to be serviced with
+	 * shadow stack permission memory, so in the case of a shadow stack
+	 * read access, treat it as a WRITE fault so both COW will happen and
+	 * the write fault path will tickle maybe_mkwrite() and map the memory
+	 * shadow stack.
+	 */
+	if (error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_INSTR)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 17/39] x86/mm: Update maybe_mkwrite() for shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (15 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 16/39] x86/mm: Check shadow stack page fault errors Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk Rick Edgecombe
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When serving a page fault, maybe_mkwrite() makes a PTE writable if there is
a write access to it, and its vma has VM_WRITE. Shadow stack accesses to
shadow stack vma's are also treated as write accesses by the fault handler.
This is because setting shadow stack memory makes it writable via some
instructions, so COW has to happen even for shadow stack reads.

So maybe_mkwrite() should continue to set VM_WRITE vma's as normally
writable, but also set VM_WRITE|VM_SHADOW_STACK vma's as shadow stack.

Do this by adding a pte_mkwrite_shstk() and a cross-arch stub. Check for
VM_SHADOW_STACK in maybe_mkwrite() and call pte_mkwrite_shstk()
accordingly.

Apply the same changes to maybe_pmd_mkwrite().

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v3:
 - Remove unneeded define for maybe_mkwrite (Peterz)
 - Switch to cleaner version of maybe_mkwrite() (Peterz)

v2:
 - Change to handle shadow stacks that are VM_WRITE|VM_SHADOW_STACK
 - Ditch arch specific maybe_mkwrite(), and make the code generic
 - Move do_anonymous_page() to next patch (Kirill)

Yu-cheng v29:
 - Remove likely()'s.

 arch/x86/include/asm/pgtable.h |  2 ++
 include/linux/mm.h             | 13 ++++++++++---
 include/linux/pgtable.h        | 14 ++++++++++++++
 mm/huge_memory.c               | 10 +++++++---
 4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e96558abc8ec..45b1a8f058fe 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -445,6 +445,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	return __pte_mkdirty(pte, true);
 }
 
+#define pte_mkwrite_shstk pte_mkwrite_shstk
 static inline pte_t pte_mkwrite_shstk(pte_t pte)
 {
 	/* pte_clear_cow() also sets Dirty=1 */
@@ -589,6 +590,7 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd_mkdirty(pmd, true);
 }
 
+#define pmd_mkwrite_shstk pmd_mkwrite_shstk
 static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
 {
 	return pmd_clear_cow(pmd);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 824e730b21af..e15d2fc04007 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1106,12 +1106,19 @@ void free_compound_page(struct page *page);
  * servicing faults for write access.  In the normal case, do always want
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
+ *
+ * If a vma is shadow stack (a type of writable memory), mark the pte shadow
+ * stack.
  */
 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
+	if (!(vma->vm_flags & VM_WRITE))
+		return pte;
+
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return pte_mkwrite_shstk(pte);
+
+	return pte_mkwrite(pte);
 }
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..14a820a45a37 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -532,6 +532,20 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
 #define pte_sw_mkyoung	pte_sw_mkyoung
 #endif
 
+#ifndef pte_mkwrite_shstk
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	return pte;
+}
+#endif
+
+#ifndef pmd_mkwrite_shstk
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa..fbb8beb9265e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -553,9 +553,13 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
 
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
-	if (likely(vma->vm_flags & VM_WRITE))
-		pmd = pmd_mkwrite(pmd);
-	return pmd;
+	if (!(vma->vm_flags & VM_WRITE))
+		return pmd;
+
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return pmd_mkwrite_shstk(pmd);
+
+	return pmd_mkwrite(pmd);
 }
 
 #ifdef CONFIG_MEMCG
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (16 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 17/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-23  9:50   ` David Hildenbrand
  2023-01-19 21:22 ` [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
                   ` (23 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, David Hildenbrand, Yu-cheng Yu

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Since shadow stack memory can be changed from userspace, is both
VM_SHADOW_STACK and VM_WRITE. But it should not be made conventionally
writable (i.e. pte_mkwrite()). So some code that calls pte_mkwrite() needs
to be adjusted.

One such case is when memory is made writable without an actual write
fault. This happens in some mprotect operations, and also prot_numa faults.
In both cases code checks whether it should be made (conventionally)
writable by calling vma_wants_manual_pte_write_upgrade().

One way to fix this would be have code actually check if memory is also
VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But since
most memory won't be shadow stack, just have simpler logic and skip this
optimization by changing vma_wants_manual_pte_write_upgrade() to not
return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
cases of this type.

Cc: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Update solution after the recent removal of pte_savedwrite()

v4:
 - Add "why" to comments in code (Peterz)

Yu-cheng v25:
 - Move is_shadow_stack_mapping() to a separate line.

Yu-cheng v24:
 - Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e15d2fc04007..139a682d243b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2181,7 +2181,7 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
 	 */
 	if (vma->vm_flags & VM_SHARED)
 		return vma_wants_writenotify(vma, vma->vm_page_prot);
-	return !!(vma->vm_flags & VM_WRITE);
+	return (vma->vm_flags & VM_WRITE) && !(vma->vm_flags & VM_SHADOW_STACK);
 
 }
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (17 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-20  1:01   ` Kees Cook
  2023-02-14  0:09   ` Deepak Gupta
  2023-01-19 21:22 ` [PATCH v5 20/39] mm: Add guard pages around a shadow stack Rick Edgecombe
                   ` (22 subsequent siblings)
  41 siblings, 2 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

With the introduction of shadow stack memory there are two ways a pte can
be writable: regular writable memory and shadow stack memory.

In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
where a PTE is made writable. However, there are places where pte_mkwrite()
is called directly and the logic should now also create a shadow stack PTE
in the case of a shadow stack VMA.

- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
  directly and call pte_mkwrite(). Teach it about pte_mkwrite_shstk()

- When userfaultfd is creating a PTE after userspace handles the fault
  it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()

To make the code cleaner, introduce is_shstk_write() which simplifies
checking for VM_WRITE | VM_SHADOW_STACK together.

In other cases where pte_mkwrite() is called directly, the VMA will not
be VM_SHADOW_STACK, and so shadow stack memory should not be created.
 - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
 - In the case of the "dirty_accountable" optimization in mprotect(),
   shadow stack VMA's won't be VM_SHARED, so it is not necessary.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Fix typo in commit log

v3:
 - Restore do_anonymous_page() that accidetally moved commits (Kirill)
 - Open code maybe_mkwrite() cases from v2, so the behavior doesn't change
   to mark that non-writable PTEs dirty. (Nadav)

v2:
 - Updated commit log with comment's from Dave Hansen
 - Dave also suggested (I understood) to maybe tweak vm_get_page_prot()
   to avoid having to call maybe_mkwrite(). After playing around with
   this I opted to *not* do this. Shadow stack memory memory is
   effectively writable, so having the default permissions be writable
   ended up mapping the zero page as writable and other surprises. So
   creating shadow stack memory needs to be done with manual logic
   like pte_mkwrite().
 - Drop change in change_pte_range() because it couldn't actually trigger
   for shadow stack VMAs.
 - Clarify reasoning for skipped cases of pte_mkwrite().

Yu-cheng v25:
 - Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().

 arch/x86/include/asm/pgtable.h |  3 +++
 arch/x86/mm/pgtable.c          |  6 ++++++
 include/linux/pgtable.h        |  7 +++++++
 mm/memory.c                    |  5 ++++-
 mm/migrate_device.c            |  4 +++-
 mm/userfaultfd.c               | 10 +++++++---
 6 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 45b1a8f058fe..87d3068734ec 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -951,6 +951,9 @@ static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
 }
 #endif  /* CONFIG_PAGE_TABLE_ISOLATION */
 
+#define is_shstk_write is_shstk_write
+extern bool is_shstk_write(unsigned long vm_flags);
+
 #endif	/* __ASSEMBLY__ */
 
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..d103945ba502 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -880,3 +880,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #endif /* CONFIG_X86_64 */
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+bool is_shstk_write(unsigned long vm_flags)
+{
+	return (vm_flags & (VM_SHADOW_STACK | VM_WRITE)) ==
+	       (VM_SHADOW_STACK | VM_WRITE);
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 14a820a45a37..49ce1f055242 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1578,6 +1578,13 @@ static inline bool arch_has_pfn_modify_check(void)
 }
 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
 
+#ifndef is_shstk_write
+static inline bool is_shstk_write(unsigned long vm_flags)
+{
+	return false;
+}
+#endif
+
 /*
  * Architecture PAGE_KERNEL_* fallbacks
  *
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..5e5107232a26 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4088,7 +4088,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
 	entry = mk_pte(page, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
+
+	if (is_shstk_write(vma->vm_flags))
+		entry = pte_mkwrite_shstk(pte_mkdirty(entry));
+	else if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 721b2365dbca..53d417683e01 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -645,7 +645,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 			goto abort;
 		}
 		entry = mk_pte(page, vma->vm_page_prot);
-		if (vma->vm_flags & VM_WRITE)
+		if (is_shstk_write(vma->vm_flags))
+			entry = pte_mkwrite_shstk(pte_mkdirty(entry));
+		else if (vma->vm_flags & VM_WRITE)
 			entry = pte_mkwrite(pte_mkdirty(entry));
 	}
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0499907b6f1a..832f0250ca61 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	int ret;
 	pte_t _dst_pte, *dst_pte;
 	bool writable = dst_vma->vm_flags & VM_WRITE;
+	bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
 	bool page_in_cache = page_mapping(page);
 	spinlock_t *ptl;
@@ -84,9 +85,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		writable = false;
 	}
 
-	if (writable)
-		_dst_pte = pte_mkwrite(_dst_pte);
-	else
+	if (writable) {
+		if (shstk)
+			_dst_pte = pte_mkwrite_shstk(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	} else
 		/*
 		 * We need this to make sure write bit removed; as mk_pte()
 		 * could return a pte with write bit set.
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 20/39] mm: Add guard pages around a shadow stack.
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (18 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:22 ` [PATCH v5 21/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to  prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the ssp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET  and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

	addq    $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp);       // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8);     // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Fix typo in commit log

v4:
 - Drop references to 32 bit instructions
 - Switch to generic code to drop __weak (Peterz)

v2:
 - Use __weak instead of #ifdef (Dave Hansen)
 - Only have start gap on shadow stack (Andy Luto)
 - Create stack_guard_start_gap() to not duplicate code
   in an arch version of vm_start_gap() (Dave Hansen)
 - Improve commit log partly with verbiage from (Dave Hansen)

Yu-cheng v25:
 - Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.

 include/linux/mm.h | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 139a682d243b..3f980d4823ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2987,15 +2987,36 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
 	return mtree_load(&mm->mm_mt, addr);
 }
 
+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_GROWSDOWN)
+		return stack_guard_gap;
+
+	/*
+	 * Shadow stack pointer is moved by CALL, RET, and INCSSPQ.
+	 * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+	 * and touches the first and the last element in the range, which
+	 * triggers a page fault if the range is not in a shadow stack.
+	 * Because of this, creating 4-KB guard pages around a shadow
+	 * stack prevents these instructions from going beyond.
+	 *
+	 * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+	 * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+	 */
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return PAGE_SIZE;
+
+	return 0;
+}
+
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
+	unsigned long gap = stack_guard_start_gap(vma);
 	unsigned long vm_start = vma->vm_start;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
-		if (vm_start > vma->vm_start)
-			vm_start = 0;
-	}
+	vm_start -= gap;
+	if (vm_start > vma->vm_start)
+		vm_start = 0;
 	return vm_start;
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 21/39] mm/mmap: Add shadow stack pages to memory accounting
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (19 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 20/39] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2023-01-19 21:22 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 22/39] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Account shadow stack pages to stack memory.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v3:
 - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
   (Kirill)

v2:
 - Remove is_shadow_stack_mapping() and just change it to directly bitwise
   and VM_SHADOW_STACK.

Yu-cheng v26:
 - Remove redundant #ifdef CONFIG_MMU.

Yu-cheng v25:
 - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().

 mm/mmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 425a9349e610..9f85596cce31 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3290,6 +3290,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->exec_vm += npages;
 	else if (is_stack_mapping(flags))
 		mm->stack_vm += npages;
+	else if (flags & VM_SHADOW_STACK)
+		mm->stack_vm += npages;
 	else if (is_data_mapping(flags))
 		mm->data_vm += npages;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 22/39] mm: Re-introduce vm_flags to do_mmap()
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (20 preceding siblings ...)
  2023-01-19 21:22 ` [PATCH v5 21/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

    commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap().  Thus, re-introduce vm_flags to do_mmap().

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: linux-mm@kvack.org
---
 fs/aio.c           |  2 +-
 include/linux/mm.h |  3 ++-
 ipc/shm.c          |  2 +-
 mm/mmap.c          | 10 +++++-----
 mm/nommu.c         |  4 ++--
 mm/util.c          |  2 +-
 6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 562916d85cba..279c75ec6a05 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -554,7 +554,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
 				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+				 MAP_SHARED, 0, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3f980d4823ad..6e1796ee7e1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2897,7 +2897,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf);
 extern int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool downgrade);
diff --git a/ipc/shm.c b/ipc/shm.c
index bd2fcc4d454e..1c5476bfec8b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 9f85596cce31..350bf156fcae 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1238,11 +1238,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  */
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
-			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long flags, vm_flags_t vm_flags,
+			unsigned long pgoff, unsigned long *populate,
+			struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	vm_flags_t vm_flags;
 	int pkey = 0;
 
 	validate_mm(mm);
@@ -1303,7 +1303,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
@@ -2877,7 +2877,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, 0, pgoff, &populate, NULL);
 	fput(file);
 out:
 	mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 5b83938ecb67..3642a3e01265 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1042,6 +1042,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long len,
 			unsigned long prot,
 			unsigned long flags,
+			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
 			struct list_head *uf)
@@ -1049,7 +1050,6 @@ unsigned long do_mmap(struct file *file,
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	struct rb_node *rb;
-	vm_flags_t vm_flags;
 	unsigned long capabilities, result;
 	int ret;
 	MA_STATE(mas, &current->mm->mm_mt, 0, 0);
@@ -1069,7 +1069,7 @@ unsigned long do_mmap(struct file *file,
 
 	/* we've determined that we can make the mapping, now translate what we
 	 * now know into VMA flags */
-	vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+	vm_flags |= determine_vm_flags(file, prot, flags, capabilities);
 
 
 	/* we're going to need to record the mapping */
diff --git a/mm/util.c b/mm/util.c
index b56c92fb910f..77867bf9959a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -517,7 +517,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
 			      &uf);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (21 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 22/39] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-23  9:10   ` David Hildenbrand
  2023-01-19 21:23 ` [PATCH v5 24/39] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
                   ` (18 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Shadow stack memory is writable only in very specific, controlled ways.
However, since it is writable, the kernel treats it as such. As a result
there remain many ways for userspace to trigger the kernel to write to
shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
little less exposed, block writable GUPs for shadow stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v3:
 - Add comment in __pte_access_permitted() (Dave)
 - Remove unneeded shadow stack specific check in
   __pte_access_permitted() (Jann)

 arch/x86/include/asm/pgtable.h | 5 +++++
 mm/gup.c                       | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 87d3068734ec..425ded5dd6ec 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1671,6 +1671,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
 {
 	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
 
+	/*
+	 * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
+	 * shouldn't generally allow access to, but since they
+	 * are already Write=0, the below logic covers both cases.
+	 */
 	if (write)
 		need_pte_bits |= _PAGE_RW;
 
diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..bfd33d9edb89 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -982,7 +982,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;
 
 	if (write) {
-		if (!(vm_flags & VM_WRITE)) {
+		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
 			/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 24/39] x86/mm: Introduce MAP_ABOVE4G
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (22 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 25/39] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which require some core mm changes to function
properly.

One of the properties is that the shadow stack pointer (SSP), which is a
CPU register that points to the shadow stack like the stack pointer points
to the stack, can't be pointing outside of the 32 bit address space when
the CPU is executing in 32 bit mode. It is desirable to prevent executing
in 32 bit mode when shadow stack is enabled because the kernel can't easily
support 32 bit signals.

On x86 it is possible to transition to 32 bit mode without any special
interaction with the kernel, by doing a "far call" to a 32 bit segment.
So the shadow stack implementation can use this address space behavior
as a feature, by enforcing that shadow stack memory is always crated
outside of the 32 bit address space. This way userspace will trigger a
general protection fault which will in turn trigger a segfault if it
tries to transition to 32 bit mode with shadow stack enabled.

This provides a clean error generating border for the user if they try
attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
half working state for userspace to be surprised by.

So to allow future shadow stack enabling patches to map shadow stacks
out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
is pretty much like MAP_32BIT, except that it has the opposite address
range. The are a few differences though.

If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
syscall.

Since the default search behavior is top down, the normal kaslr base can
be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add it's
own randomization in the bottom up case.

For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
both are potentially valid, so both are used. In the bottomup search
path, the default behavior is already consistent with MAP_ABOVE4G since
mmap base should be above 4GB.

Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
with shadow stack enabled would usually segfault anyway. This is already
pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
complexity spent to make it make it more complete.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - New patch

 arch/x86/include/uapi/asm/mman.h | 1 +
 arch/x86/kernel/sys_x86_64.c     | 6 +++++-
 include/linux/mman.h             | 4 ++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 775dbd3aff73..5a0256e73f1e 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMAN_H
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
+#define MAP_ABOVE4G	0x80		/* only map above 4GB */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 #define arch_calc_vm_prot_bits(prot, key) (		\
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 8cc653ffdccd..06378b5682c1 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
-	info.low_limit = PAGE_SIZE;
+	if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
+		info.low_limit = 0x100000000;
+	else
+		info.low_limit = PAGE_SIZE;
+
 	info.high_limit = get_mmap_base(0);
 
 	/*
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 58b3abd457a3..32156daa985a 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -15,6 +15,9 @@
 #ifndef MAP_32BIT
 #define MAP_32BIT 0
 #endif
+#ifndef MAP_ABOVE4G
+#define MAP_ABOVE4G 0
+#endif
 #ifndef MAP_HUGE_2MB
 #define MAP_HUGE_2MB 0
 #endif
@@ -50,6 +53,7 @@
 		| MAP_STACK \
 		| MAP_HUGETLB \
 		| MAP_32BIT \
+		| MAP_ABOVE4G \
 		| MAP_HUGE_2MB \
 		| MAP_HUGE_1GB)
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 25/39] mm: Warn on shadow stack memory in wrong vma
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (23 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 24/39] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:01   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 26/39] x86: Introduce userspace API for shadow stack Rick Edgecombe
                   ` (16 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage. It will be compiled out when shadow
stack is not configured.

In order to check if a pte is shadow stack in core mm code, add default
implementations for pte_shstk() and pmd_shstk().

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Fix typo in commit log

v3:
 - New patch

 arch/x86/include/asm/pgtable.h |  2 ++
 include/linux/pgtable.h        | 14 ++++++++++++++
 mm/huge_memory.c               |  2 ++
 mm/memory.c                    |  2 ++
 4 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 425ded5dd6ec..356f1d43e403 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -129,6 +129,7 @@ static inline bool pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY_BITS;
 }
 
+#define pte_shstk pte_shstk
 static inline bool pte_shstk(pte_t pte)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
@@ -147,6 +148,7 @@ static inline bool pmd_dirty(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
 }
 
+#define pmd_shstk pmd_shstk
 static inline bool pmd_shstk(pmd_t pmd)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 49ce1f055242..04d0bc466e43 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -539,6 +539,20 @@ static inline pte_t pte_mkwrite_shstk(pte_t pte)
 }
 #endif
 
+#ifndef pte_shstk
+static inline bool pte_shstk(pte_t pte)
+{
+	return false;
+}
+#endif
+
+#ifndef pmd_shstk
+static inline bool pmd_shstk(pmd_t pte)
+{
+	return false;
+}
+#endif
+
 #ifndef pmd_mkwrite_shstk
 static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fbb8beb9265e..5bd71da75dec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1700,6 +1700,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
 						tlb->fullmm);
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pmd_shstk(orig_pmd));
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	if (vma_is_special_huge(vma)) {
 		if (arch_needs_pgtable_deposit())
diff --git a/mm/memory.c b/mm/memory.c
index 5e5107232a26..c4cc38baffc5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1381,6 +1381,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+					pte_shstk(ptent));
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
 						      ptent);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 26/39] x86: Introduce userspace API for shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (24 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 25/39] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:04   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 27/39] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
                   ` (15 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Add three new arch_prctl() handles:

 - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
   feature. Returns 0 on success or an error.

 - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
   specified feature. Returns 0 on success or an error

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

This is preparation patch. It does not implement any features.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[tweaked with feedback from tglx]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v4:
 - Remove references to CET and replace with shadow stack (Peterz)

v3:
 - Move shstk.c Makefile changes earlier (Kees)
 - Add #ifdef around features_locked and features (Kees)
 - Encapsulate features reset earlier in reset_thread_features() so
   features and features_locked are not referenced in code that would be
   compiled !CONFIG_X86_USER_SHADOW_STACK. (Kees)
 - Fix typo in commit log (Kees)
 - Switch arch_prctl() numbers to avoid conflict with LAM

v2:
 - Only allow one enable/disable per call (tglx)
 - Return error code like a normal arch_prctl() (Alexander Potapenko)
 - Make CET only (tglx)

 arch/x86/include/asm/processor.h  |  6 +++++
 arch/x86/include/asm/shstk.h      | 21 +++++++++++++++
 arch/x86/include/uapi/asm/prctl.h |  6 +++++
 arch/x86/kernel/Makefile          |  2 ++
 arch/x86/kernel/process_64.c      |  7 ++++-
 arch/x86/kernel/shstk.c           | 44 +++++++++++++++++++++++++++++++
 6 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 4e35c66edeb7..e0734f417273 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -28,6 +28,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/shstk.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -475,6 +476,11 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	unsigned long		features;
+	unsigned long		features_locked;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
new file mode 100644
index 000000000000..58f9ee675be0
--- /dev/null
+++ b/arch/x86/include/asm/shstk.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHSTK_H
+#define _ASM_X86_SHSTK_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+void reset_thread_features(void);
+#else
+static inline long shstk_prctl(struct task_struct *task, int option,
+			     unsigned long features) { return -EINVAL; }
+static inline void reset_thread_features(void) {}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_SHSTK_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..b2b3b7200b2d 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,10 @@
 #define ARCH_MAP_VDSO_32		0x2002
 #define ARCH_MAP_VDSO_64		0x2003
 
+/* Don't use 0x3001-0x3004 because of old glibcs */
+
+#define ARCH_SHSTK_ENABLE		0x5001
+#define ARCH_SHSTK_DISABLE		0x5002
+#define ARCH_SHSTK_LOCK			0x5003
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 92446f1dedd7..b366641703e3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -146,6 +146,8 @@ obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
 obj-$(CONFIG_X86_CET)			+= cet.o
 
+obj-$(CONFIG_X86_USER_SHADOW_STACK)	+= shstk.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4e34b3b68ebd..71094c8a305f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		load_gs_index(__USER_DS);
 	}
 
+	reset_thread_features();
+
 	loadsegment(fs, 0);
 	loadsegment(es, _ds);
 	loadsegment(ds, _ds);
@@ -830,7 +832,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_MAP_VDSO_64:
 		return prctl_map_vdso(&vdso_image_64, arg2);
 #endif
-
+	case ARCH_SHSTK_ENABLE:
+	case ARCH_SHSTK_DISABLE:
+	case ARCH_SHSTK_LOCK:
+		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..41ed6552e0a5
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <asm/prctl.h>
+
+void reset_thread_features(void)
+{
+	current->thread.features = 0;
+	current->thread.features_locked = 0;
+}
+
+long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+{
+	if (option == ARCH_SHSTK_LOCK) {
+		task->thread.features_locked |= features;
+		return 0;
+	}
+
+	/* Don't allow via ptrace */
+	if (task != current)
+		return -EINVAL;
+
+	/* Do not allow to change locked features */
+	if (features & task->thread.features_locked)
+		return -EPERM;
+
+	/* Only support enabling/disabling one feature at a time. */
+	if (hweight_long(features) > 1)
+		return -EINVAL;
+
+	if (option == ARCH_SHSTK_DISABLE) {
+		return -EINVAL;
+	}
+
+	/* Handle ARCH_SHSTK_ENABLE */
+	return -EINVAL;
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 27/39] x86/shstk: Add user-mode shadow stack support
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (25 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 26/39] x86: Introduce userspace API for shadow stack Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:05   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 28/39] x86/shstk: Handle thread shadow stack Rick Edgecombe
                   ` (14 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

Do not support IA32 emulation or x32.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Switch to EOPNOTSUPP
 - Use MAP_ABOVE4G
 - Move set_clr_bits_msrl() to patch where it is first used

v4:
 - Just set MSR_IA32_U_CET when disabling shadow stack, since we don't
   have IBT yet. (Peterz)

v3:
 - Use define for set_clr_bits_msrl() (Kees)
 - Make some functions static (Kees)
 - Change feature_foo() to features_foo() (Kees)
 - Centralize shadow stack size rlimit checks (Kees)
 - Disable x32 support

v2:
 - Get rid of unnessary shstk->base checks
 - Don't support IA32 emulation

 arch/x86/include/asm/processor.h  |   2 +
 arch/x86/include/asm/shstk.h      |   7 ++
 arch/x86/include/uapi/asm/prctl.h |   3 +
 arch/x86/kernel/shstk.c           | 146 ++++++++++++++++++++++++++++++
 4 files changed, 158 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e0734f417273..3c257a1a0757 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -479,6 +479,8 @@ struct thread_struct {
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 	unsigned long		features;
 	unsigned long		features_locked;
+
+	struct thread_shstk	shstk;
 #endif
 
 	/* Floating point and extended processor state */
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 58f9ee675be0..f40414a982e8 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -8,12 +8,19 @@
 struct task_struct;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
+struct thread_shstk {
+	u64	base;
+	u64	size;
+};
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			     unsigned long features) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index b2b3b7200b2d..7dfd9dc00509 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,4 +26,7 @@
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
 
+/* ARCH_SHSTK_ features bits */
+#define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 41ed6552e0a5..f39e5d3b9303 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,14 +8,160 @@
 
 #include <linux/sched.h>
 #include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/shstk.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+static bool features_enabled(unsigned long features)
+{
+	return current->thread.features & features;
+}
+
+static void features_set(unsigned long features)
+{
+	current->thread.features |= features;
+}
+
+static void features_clr(unsigned long features)
+{
+	current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, unused;
+
+	mmap_write_lock(mm);
+	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+	mmap_write_unlock(mm);
+
+	return addr;
+}
+
+static unsigned long adjust_shstk_size(unsigned long size)
+{
+	if (size)
+		return PAGE_ALIGN(size);
+
+	return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+	while (1) {
+		int r;
+
+		r = vm_munmap(base, size);
+
+		/*
+		 * vm_munmap() returns -EINTR when mmap_lock is held by
+		 * something else, and that lock should not be held for a
+		 * long time.  Retry it for the case.
+		 */
+		if (r == -EINTR) {
+			cond_resched();
+			continue;
+		}
+
+		/*
+		 * For all other types of vm_munmap() failure, either the
+		 * system is out of memory or there is bug.
+		 */
+		WARN_ON_ONCE(r);
+		break;
+	}
+}
+
+static int shstk_setup(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long addr, size;
+
+	/* Already enabled */
+	if (features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/* Also not supported for 32 bit and x32 */
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
+		return -EOPNOTSUPP;
+
+	size = adjust_shstk_size(0);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+	fpregs_unlock();
+
+	shstk->base = addr;
+	shstk->size = size;
+	features_set(ARCH_SHSTK_SHSTK);
+
+	return 0;
+}
+
 void reset_thread_features(void)
 {
+	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
 	current->thread.features = 0;
 	current->thread.features_locked = 0;
 }
 
+void shstk_free(struct task_struct *tsk)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return;
+
+	if (!tsk->mm)
+		return;
+
+	unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+
+static int shstk_disable(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	/* Already disabled? */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	fpregs_lock_and_load();
+	/* Disable WRSS too when disabling shadow stack */
+	wrmsrl(MSR_IA32_U_CET, 0);
+	wrmsrl(MSR_IA32_PL3_SSP, 0);
+	fpregs_unlock();
+
+	shstk_free(current);
+	features_clr(ARCH_SHSTK_SHSTK);
+
+	return 0;
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 28/39] x86/shstk: Handle thread shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (26 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 27/39] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 29/39] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v3:
 - Fix update_fpu_shstk() stub (Mike Rapoport)
 - Fix chunks around alloc_shstk() in wrong patch (Kees)
 - Fix stack_size/flags swap (Kees)
 - Use centalized stack size logic (Kees)

v2:
 - Have fpu_clone() take new shadow stack pointer and update SSP in
   xsave buffer for new task. (tglx)

v1:
 - Expand commit log.
 - Add more comments.
 - Switch to xsave helpers.

Yu-cheng v30:
 - Update comments about clone()/clone3(). (Borislav Petkov)

 arch/x86/include/asm/fpu/sched.h   |  3 +-
 arch/x86/include/asm/mmu_context.h |  2 ++
 arch/x86/include/asm/shstk.h       |  7 +++++
 arch/x86/kernel/fpu/core.c         | 41 +++++++++++++++++++++++++++-
 arch/x86/kernel/process.c          | 18 +++++++++++-
 arch/x86/kernel/shstk.c            | 44 ++++++++++++++++++++++++++++--
 6 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index b2486b2cbc6e..54c9c2fd1907 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@
 
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
 extern void fpu__drop(struct fpu *fpu);
-extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+		      unsigned long shstk_addr);
 extern void fpu_flush_thread(void);
 
 /*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e01aa74a6de7..9714f08d941b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -147,6 +147,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index f40414a982e8..172a69052770 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -15,11 +15,18 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+			     unsigned long stack_size,
+			     unsigned long *shstk_addr);
 void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			     unsigned long features) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+					   unsigned long clone_flags,
+					   unsigned long stack_size,
+					   unsigned long *shstk_addr) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7317bfd5ea36..c72262479f03 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
 	}
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+	struct cet_user_state *xstate;
+
+	/* If ssp update is not needed. */
+	if (!ssp)
+		return 0;
+
+	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+				XFEATURE_CET_USER);
+
+	/*
+	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+	 * stack and the fpu state should be up to date since it was just copied
+	 * from the parent in fpu_clone(). So there must be a valid non-init CET
+	 * state location in the buffer.
+	 */
+	if (WARN_ON_ONCE(!xstate))
+		return 1;
+
+	xstate->user_ssp = (u64)ssp;
+
+	return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
+{
+	return 0;
+}
+#endif
+
 /* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+	      unsigned long ssp)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
@@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
 	if (use_xsave())
 		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
 
+	/*
+	 * Update shadow stack pointer, in case it changed during clone.
+	 */
+	if (update_fpu_shstk(dst, ssp))
+		return 1;
+
 	trace_x86_fpu_copy_src(src_fpu);
 	trace_x86_fpu_copy_dst(dst_fpu);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e57cd31bfec4..13a0a81d70b9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -48,6 +48,7 @@
 #include <asm/frame.h>
 #include <asm/unwind.h>
 #include <asm/tdx.h>
+#include <asm/shstk.h>
 
 #include "process.h"
 
@@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
+	unsigned long shstk_addr = 0;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-	fpu_clone(p, clone_flags, args->fn);
+	/* Allocate a new shadow stack for pthread if needed */
+	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
+				       &shstk_addr);
+	if (ret)
+		return ret;
+
+	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
@@ -220,6 +229,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
+	/*
+	 * If copy_thread() if failing, don't leak the shadow stack possibly
+	 * allocated in shstk_alloc_thread_stack() above.
+	 */
+	if (ret)
+		shstk_free(p);
+
 	return ret;
 }
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index f39e5d3b9303..111ea56115d2 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size)
 	unsigned long addr, unused;
 
 	mmap_write_lock(mm);
-	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
 		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 
 	mmap_write_unlock(mm);
@@ -126,6 +126,40 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+			     unsigned long stack_size, unsigned long *shstk_addr)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long addr, size;
+
+	/*
+	 * If shadow stack is not enabled on the new thread, skip any
+	 * switch to a new shadow stack.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+
+	size = adjust_shstk_size(stack_size);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	shstk->base = addr;
+	shstk->size = size;
+
+	*shstk_addr = addr + size;
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -134,7 +168,13 @@ void shstk_free(struct task_struct *tsk)
 	    !features_enabled(ARCH_SHSTK_SHSTK))
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	unmap_shadow_stack(shstk->base, shstk->size);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 29/39] x86/shstk: Introduce routines modifying shstk
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (27 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 28/39] x86/shstk: Handle thread shadow stack Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:05   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 30/39] x86/shstk: Handle signals for shadow stack Rick Edgecombe
                   ` (12 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stacks are normally written to via CALL/RET or specific CET
instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
operations the kernel will need to write to directly using the ring-0 only
WRUSS instruction.

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.

In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.

To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It is also can't be a
valid restore token.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---

v5:
 - Fix typo in commit log

v3:
 - Drop shstk_check_rstor_token()
 - Fail put_shstk_data() if bit 63 is set in the data (Kees)
 - Add comment in create_rstor_token() (Kees)
 - Pull in create_rstor_token() changes from future patch (Kees)

v2:
 - Add data helpers for writing to shadow stack.

v1:
 - Use xsave helpers.

 arch/x86/include/asm/special_insns.h | 13 +++++
 arch/x86/kernel/shstk.c              | 73 ++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index de48d1389936..d6cd9344f6c7 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: [addr] "r" (addr), [val] "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EFAULT;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
 #define nop() asm volatile ("nop")
 
 static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 111ea56115d2..3e470917eb0b 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
 #include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+#define SS_FRAME_SIZE 8
+
 static bool features_enabled(unsigned long features)
 {
 	return current->thread.features & features;
@@ -40,6 +42,35 @@ static void features_clr(unsigned long features)
 	current->thread.features &= ~features;
 }
 
+/*
+ * Create a restore token on the shadow stack.  A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+	unsigned long addr;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(ssp, 8))
+		return -EINVAL;
+
+	addr = ssp - SS_FRAME_SIZE;
+
+	/*
+	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+	 * the token 64-bit.
+	 */
+	ssp |= BIT(0);
+
+	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+		return -EFAULT;
+
+	if (token_addr)
+		*token_addr = addr;
+
+	return 0;
+}
+
 static unsigned long alloc_shstk(unsigned long size)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
@@ -160,6 +191,48 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 	return 0;
 }
 
+static unsigned long get_user_shstk_addr(void)
+{
+	unsigned long long ssp;
+
+	fpregs_lock_and_load();
+
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	fpregs_unlock();
+
+	return ssp;
+}
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+	if (WARN_ON_ONCE(data & BIT(63)))
+		return -EINVAL;
+
+	/*
+	 * Mark the high bit so that the sigframe can't be processed as a
+	 * return address.
+	 */
+	if (write_user_shstk_64(addr, data | BIT(63)))
+		return -EFAULT;
+	return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+	unsigned long ldata;
+
+	if (unlikely(get_user(ldata, addr)))
+		return -EFAULT;
+
+	if (!(ldata & BIT(63)))
+		return -EINVAL;
+
+	*data = ldata & ~BIT(63);
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 30/39] x86/shstk: Handle signals for shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (28 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 29/39] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 31/39] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
                   ` (11 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.

For this token we can use the shadow stack data format defined earlier.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.

So, when handling a signal push
 - the SSP pointing in the shadow stack data format
 - the restorer address below the restore token.

In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
---

v3:
 - Drop shstk_setup_rstor_token() (Kees)
 - Drop x32 signal support, since x32 support is dropped

v2:
 - Switch to new shstk signal format

v1:
 - Use xsave helpers.
 - Expand commit log.

Yu-cheng v27:
 - Eliminate saving shadow stack pointer to signal context.

 arch/x86/include/asm/shstk.h |  5 ++
 arch/x86/kernel/shstk.c      | 98 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c     |  1 +
 arch/x86/kernel/signal_64.c  |  6 +++
 4 files changed, 110 insertions(+)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 172a69052770..746c040f7cb6 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 struct thread_shstk {
@@ -19,6 +20,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 			     unsigned long stack_size,
 			     unsigned long *shstk_addr);
 void shstk_free(struct task_struct *p);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			     unsigned long features) { return -EINVAL; }
@@ -28,6 +31,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
 					   unsigned long stack_size,
 					   unsigned long *shstk_addr) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3e470917eb0b..56e7ca8e42cc 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -233,6 +233,104 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
 	return 0;
 }
 
+static int shstk_push_sigframe(unsigned long *ssp)
+{
+	unsigned long target_ssp = *ssp;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(*ssp, 8))
+		return -EINVAL;
+
+	if (!IS_ALIGNED(target_ssp, 8))
+		return -EINVAL;
+
+	*ssp -= SS_FRAME_SIZE;
+	if (put_shstk_data((void *__user)*ssp, target_ssp))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+	unsigned long token_addr;
+	int err;
+
+	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Restore SSP aligned? */
+	if (unlikely(!IS_ALIGNED(token_addr, 8)))
+		return -EINVAL;
+
+	/* SSP in userspace? */
+	if (unlikely(token_addr >= TASK_SIZE_MAX))
+		return -EINVAL;
+
+	*ssp = token_addr;
+
+	return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+	void __user *restorer = ksig->ka.sa.sa_restorer;
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	if (!restorer)
+		return -EINVAL;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_push_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Push restorer address */
+	ssp -= SS_FRAME_SIZE;
+	err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+	if (unlikely(err))
+		return -EFAULT;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_pop_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 004cb30b7419..356253e85ce9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -40,6 +40,7 @@
 #include <asm/syscall.h>
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/shstk.h>
 
 static inline int is_ia32_compat_frame(struct ksignal *ksig)
 {
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 0e808c72bf7e..cacf2ede6217 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -175,6 +175,9 @@ int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	frame = get_sigframe(ksig, regs, sizeof(struct rt_sigframe), &fp);
 	uc_flags = frame_uc_flags(regs);
 
+	if (setup_signal_shadow_stack(ksig))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -260,6 +263,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 31/39] x86/shstk: Introduce map_shadow_stack syscall
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (29 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 30/39] x86/shstk: Handle signals for shadow stack Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:07   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 32/39] x86/shstk: Support WRSS for userspace Rick Edgecombe
                   ` (10 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Fix addr/mapped_addr (Kees)
 - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch
   suggests this)
 - Return error for addresses below 4G

v3:
 - Change syscall common -> 64 (Kees)
 - Use bit shift notation instead of 0x1 for uapi header (Kees)
 - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
 - Block unsupported flags (Kees)
 - Require size >= 8 to set token (Kees)

v2:
 - Change syscall to take address like mmap() for CRIU's usage

v1:
 - New patch (replaces PROT_SHADOW_STACK).

 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/include/uapi/asm/mman.h       |  3 ++
 arch/x86/kernel/shstk.c                | 59 ++++++++++++++++++++++----
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +-
 kernel/sys_ni.c                        |  1 +
 6 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..f65c671ce3b1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	64	map_shadow_stack	sys_map_shadow_stack
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 5a0256e73f1e..8148bdddbd2c 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -13,6 +13,9 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+/* Flags for map_shadow_stack(2) */
+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)	/* Set up a restore token in the shadow stack */
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 56e7ca8e42cc..e857083b9e14 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@
 #include <linux/compat.h>
 #include <linux/sizes.h>
 #include <linux/user.h>
+#include <linux/syscalls.h>
 #include <asm/msr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
 	return 0;
 }
 
-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+				 unsigned long token_offset, bool set_res_tok)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, unused;
+	unsigned long mapped_addr, unused;
 
-	mmap_write_lock(mm);
-	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
-		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+	if (addr)
+		flags |= MAP_FIXED_NOREPLACE;
 
+	mmap_write_lock(mm);
+	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 
-	return addr;
+	if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
+		goto out;
+
+	if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+		vm_munmap(mapped_addr, size);
+		return -EINVAL;
+	}
+
+out:
+	return mapped_addr;
 }
 
 static unsigned long adjust_shstk_size(unsigned long size)
@@ -134,7 +147,7 @@ static int shstk_setup(void)
 		return -EOPNOTSUPP;
 
 	size = adjust_shstk_size(0);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -179,7 +192,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 
 
 	size = adjust_shstk_size(stack_size);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -373,6 +386,36 @@ static int shstk_disable(void)
 	return 0;
 }
 
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+	bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+	unsigned long aligned_size;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	if (flags & ~SHADOW_STACK_SET_TOKEN)
+		return -EINVAL;
+
+	/* If there isn't space for a token */
+	if (set_tok && size < 8)
+		return -EINVAL;
+
+	if (addr && addr <= 0xFFFFFFFF)
+		return -EINVAL;
+
+	/*
+	 * An overflow would result in attempting to write the restore token
+	 * to the wrong location. Not catastrophic, but just return the right
+	 * error code and block it.
+	 */
+	aligned_size = PAGE_ALIGN(size);
+	if (aligned_size < size)
+		return -EOVERFLOW;
+
+	return alloc_shstk(addr, aligned_size, size, set_tok);
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..392dc11e3556 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
 COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 32/39] x86/shstk: Support WRSS for userspace
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (30 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 31/39] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:06   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 33/39] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
                   ` (9 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, WRSS, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace WRSS instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Switch to EOPNOTSUPP
 - Move set_clr_bits_msrl() to patch where it is first used
 - Commit log formatting

v3:
 - Make wrss_control() static
 - Fix verbiage in commit log (Kees)

v2:
 - Add some commit log verbiage from (Dave Hansen)

v1:
 - New patch.

 arch/x86/include/asm/msr.h        | 11 +++++++++++
 arch/x86/include/uapi/asm/prctl.h |  1 +
 arch/x86/kernel/shstk.c           | 31 ++++++++++++++++++++++++++++++-
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..a4b86eb537d6 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
 int msr_set_bit(u32 msr, u8 bit);
 int msr_clear_bit(u32 msr, u8 bit);
 
+/* Helper that can never get accidentally un-inlined. */
+#define set_clr_bits_msrl(msr, set, clear)	do {	\
+	u64 __val, __new_val;				\
+							\
+	rdmsrl(msr, __val);				\
+	__new_val = (__val & ~(clear)) | (set);		\
+							\
+	if (__new_val != __val)				\
+		wrmsrl(msr, __new_val);			\
+} while (0)
+
 #ifdef CONFIG_SMP
 int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
 int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 7dfd9dc00509..e31495668056 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -28,5 +28,6 @@
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+#define ARCH_SHSTK_WRSS			(1ULL <<  1)
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index e857083b9e14..71dbb49b93cd 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -364,6 +364,35 @@ void shstk_free(struct task_struct *tsk)
 	unmap_shadow_stack(shstk->base, shstk->size);
 }
 
+static int wrss_control(bool enable)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Only enable wrss if shadow stack is enabled. If shadow stack is not
+	 * enabled, wrss will already be disabled, so don't bother clearing it
+	 * when disabling.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return -EPERM;
+
+	/* Already enabled/disabled? */
+	if (features_enabled(ARCH_SHSTK_WRSS) == enable)
+		return 0;
+
+	fpregs_lock_and_load();
+	if (enable) {
+		set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
+		features_set(ARCH_SHSTK_WRSS);
+	} else {
+		set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
+		features_clr(ARCH_SHSTK_WRSS);
+	}
+	fpregs_unlock();
+
+	return 0;
+}
 
 static int shstk_disable(void)
 {
@@ -381,7 +410,7 @@ static int shstk_disable(void)
 	fpregs_unlock();
 
 	shstk_free(current);
-	features_clr(ARCH_SHSTK_SHSTK);
+	features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS);
 
 	return 0;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 33/39] x86: Expose thread features in /proc/$PID/status
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (31 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 32/39] x86/shstk: Support WRSS for userspace Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 34/39] x86/shstk: Wire in shadow stack interface Rick Edgecombe
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

Applications and loaders can have logic to decide whether to enable
shadow stack. They usually don't report whether shadow stack has been
enabled or not, so there is no way to verify whether an application
actually is protected by shadow stack.

Add two lines in /proc/$PID/status to report enabled and locked features.

Since, this involves referring to arch specific defines in asm/prctl.h,
implement an arch breakout to emit the feature lines.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[Switched to CET, added to commit log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v4:
 - Remove "CET" references

v3:
 - Move to /proc/pid/status (Kees)

v2:
 - New patch

 arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++
 fs/proc/array.c            |  6 ++++++
 include/linux/proc_fs.h    |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 099b6f0d96bd..31c0e68f6227 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -4,6 +4,8 @@
 #include <linux/string.h>
 #include <linux/seq_file.h>
 #include <linux/cpufreq.h>
+#include <asm/prctl.h>
+#include <linux/proc_fs.h>
 
 #include "cpu.h"
 
@@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op = {
 	.stop	= c_stop,
 	.show	= show_cpuinfo,
 };
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static void dump_x86_features(struct seq_file *m, unsigned long features)
+{
+	if (features & ARCH_SHSTK_SHSTK)
+		seq_puts(m, "shstk ");
+	if (features & ARCH_SHSTK_WRSS)
+		seq_puts(m, "wrss ");
+}
+
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task)
+{
+	seq_puts(m, "x86_Thread_features:\t");
+	dump_x86_features(m, task->thread.features);
+	seq_putc(m, '\n');
+
+	seq_puts(m, "x86_Thread_features_locked:\t");
+	dump_x86_features(m, task->thread.features_locked);
+	seq_putc(m, '\n');
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49283b8103c7..7ac43ecda1c2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -428,6 +428,11 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+__weak void arch_proc_pid_thread_features(struct seq_file *m,
+					  struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -451,6 +456,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_thread_features(m, task);
 	return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 0260f5ea98fe..80ff8e533cbd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task);
 #endif /* CONFIG_PROC_PID_ARCH_STATUS */
 
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task);
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 34/39] x86/shstk: Wire in shadow stack interface
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (32 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 33/39] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 35/39] selftests/x86: Add shadow stack test Rick Edgecombe
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

The kernel now has the main shadow stack functionality to support
applications. Wire in the WRSS and shadow stack enable/disable functions
into the existing shadow stack API skeleton.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v4:
 - Remove "CET" references

v2:
 - Split from other patches

 arch/x86/kernel/shstk.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 71dbb49b93cd..07142e6f05f6 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -465,9 +465,17 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 		return -EINVAL;
 
 	if (option == ARCH_SHSTK_DISABLE) {
+		if (features & ARCH_SHSTK_WRSS)
+			return wrss_control(false);
+		if (features & ARCH_SHSTK_SHSTK)
+			return shstk_disable();
 		return -EINVAL;
 	}
 
 	/* Handle ARCH_SHSTK_ENABLE */
+	if (features & ARCH_SHSTK_SHSTK)
+		return shstk_setup();
+	if (features & ARCH_SHSTK_WRSS)
+		return wrss_control(true);
 	return -EINVAL;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 35/39] selftests/x86: Add shadow stack test
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (33 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 34/39] x86/shstk: Wire in shadow stack interface Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
                   ` (6 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

Add a simple selftest for exercising some shadow stack behavior:
 - map_shadow_stack syscall and pivot
 - Faulting in shadow stack memory
 - Handling shadow stack violations
 - GUP of shadow stack memory
 - mprotect() of shadow stack memory
 - Userfaultfd on shadow stack memory

Since this test exercises a recently added syscall manually, it needs
to find the automatically created __NR_foo defines. Per the selftest
documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
find the headers from the kernel source. This way the new selftest can
be built inside the kernel source tree without installing the headers
to the system. So also add KHDR_INCLUDES as described in the selftest
docs, to facilitate this.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Update 32 bit signal test with new ABI and better asm

v4:
 - Add test for 32 bit signal ABI blocking

v3:
 - Change "+m" to "=m" in write_shstk() (Andrew Cooper)
 - Fix userfaultfd test with transparent huge pages by doing a
   MADV_DONTNEED, since the token write faults in the while stack with
   huge pages.

v2:
 - Change print statements to more align with other selftests
 - Add more tests
 - Add KHDR_INCLUDES to Makefile

 tools/testing/selftests/x86/Makefile          |   4 +-
 .../testing/selftests/x86/test_shadow_stack.c | 667 ++++++++++++++++++
 2 files changed, 669 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..cfc8a26ad151 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
-			corrupt_xstate_header amx
+			corrupt_xstate_header amx test_shadow_stack
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
@@ -34,7 +34,7 @@ BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
 BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))
 
-CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
+CFLAGS := -O2 -g -std=gnu99 -pthread -Wall $(KHDR_INCLUDES)
 
 # call32_from_64 in thunks.S uses absolute addresses.
 ifeq ($(CAN_BUILD_WITH_NOPIE),1)
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
new file mode 100644
index 000000000000..5a3b4f6d1a1d
--- /dev/null
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -0,0 +1,667 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program test's basic kernel shadow stack support. It enables shadow
+ * stack manual via the arch_prctl(), instead of relying on glibc. It's
+ * Makefile doesn't compile with shadow stack support, so it doesn't rely on
+ * any particular glibc. As a result it can't do any operations that require
+ * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just
+ * stick to the basics and hope the compiler doesn't do anything strange.
+ */
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <asm/mman.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <stdint.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <setjmp.h>
+
+#define SS_SIZE 0x200000
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+	printf("[SKIP]\tCompiler does not support CET.\n");
+	return 0;
+}
+#else
+void write_shstk(unsigned long *addr, unsigned long val)
+{
+	asm volatile("wrssq %[val], (%[addr])\n"
+		     : "=m" (addr)
+		     : [addr] "r" (addr), [val] "r" (val));
+}
+
+static inline unsigned long __attribute__((always_inline)) get_ssp(void)
+{
+	unsigned long ret = 0;
+
+	asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
+	return ret;
+}
+
+/*
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack gets enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull in all
+ * of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2)					\
+({								\
+	long _ret;						\
+	register long _num  asm("eax") = __NR_arch_prctl;	\
+	register long _arg1 asm("rdi") = (long)(arg1);		\
+	register long _arg2 asm("rsi") = (long)(arg2);		\
+								\
+	asm volatile (						\
+		"syscall\n"					\
+		: "=a"(_ret)					\
+		: "r"(_arg1), "r"(_arg2),			\
+		  "0"(_num)					\
+		: "rcx", "r11", "memory", "cc"			\
+	);							\
+	_ret;							\
+})
+
+void *create_shstk(void *addr)
+{
+	return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+void *create_normal_mem(void *addr)
+{
+	return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
+		    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+}
+
+void free_shstk(void *shstk)
+{
+	munmap(shstk, SS_SIZE);
+}
+
+int reset_shstk(void *shstk)
+{
+	return madvise(shstk, SS_SIZE, MADV_DONTNEED);
+}
+
+void try_shstk(unsigned long new_ssp)
+{
+	unsigned long ssp;
+
+	printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
+		new_ssp, *((unsigned long *)new_ssp));
+
+	ssp = get_ssp();
+	printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
+
+	asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+	asm volatile("saveprevssp");
+	printf("[INFO]\tssp is now %lx\n", get_ssp());
+
+	/* Switch back to original shadow stack */
+	ssp -= 8;
+	asm volatile("rstorssp (%0)\n":: "r" (ssp));
+	asm volatile("saveprevssp");
+}
+
+int test_shstk_pivot(void)
+{
+	void *shstk = create_shstk(0);
+
+	if (shstk == MAP_FAILED) {
+		printf("[FAIL]\tError creating shadow stack: %d\n", errno);
+		return 1;
+	}
+	try_shstk((unsigned long)shstk + SS_SIZE - 8);
+	free_shstk(shstk);
+
+	printf("[OK]\tShadow stack pivot\n");
+	return 0;
+}
+
+int test_shstk_faults(void)
+{
+	unsigned long *shstk = create_shstk(0);
+
+	/* Read shadow stack, test if it's zero to not get read optimized out */
+	if (*shstk != 0)
+		goto err;
+
+	/* Wrss memory that was already read. */
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	/* Page out memory, so we can wrss it again. */
+	if (reset_shstk((void *)shstk))
+		goto err;
+
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	printf("[OK]\tShadow stack faults\n");
+	return 0;
+
+err:
+	return 1;
+}
+
+unsigned long saved_ssp;
+unsigned long saved_ssp_val;
+volatile bool segv_triggered;
+
+void __attribute__((noinline)) violate_ss(void)
+{
+	saved_ssp = get_ssp();
+	saved_ssp_val = *(unsigned long *)saved_ssp;
+
+	/* Corrupt shadow stack */
+	printf("[INFO]\tCorrupting shadow stack\n");
+	write_shstk((void *)saved_ssp, 0);
+}
+
+void segv_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tGenerated shadow stack violation successfully\n");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	write_shstk((void *)saved_ssp, saved_ssp_val);
+}
+
+int test_shstk_violation(void)
+{
+	struct sigaction sa;
+
+	sa.sa_sigaction = segv_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before violate_ss() */
+	asm volatile("" : : : "memory");
+
+	violate_ss();
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow stack violation test\n");
+
+	return !segv_triggered;
+}
+
+/* Gup test state */
+#define MAGIC_VAL 0x12345678
+bool is_shstk_access;
+void *shstk_ptr;
+int fd;
+
+void reset_test_shstk(void *addr)
+{
+	if (shstk_ptr != NULL)
+		free_shstk(shstk_ptr);
+	shstk_ptr = create_shstk(addr);
+}
+
+void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	if (is_shstk_access) {
+		reset_test_shstk(shstk_ptr);
+		return;
+	}
+
+	free_shstk(shstk_ptr);
+	create_normal_mem(shstk_ptr);
+}
+
+bool test_shstk_access(void *ptr)
+{
+	is_shstk_access = true;
+	segv_triggered = false;
+	write_shstk(ptr, MAGIC_VAL);
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool test_write_access(void *ptr)
+{
+	is_shstk_access = false;
+	segv_triggered = false;
+	*(unsigned long *)ptr = MAGIC_VAL;
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool gup_write(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (write(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+bool gup_read(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (read(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+int test_gup(void)
+{
+	struct sigaction sa;
+	int status;
+	pid_t pid;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	fd = open("/proc/self/mem", O_RDWR);
+	if (fd == -1)
+		return 1;
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> write access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> write access success\n");
+
+	close(fd);
+
+	/* COW/gup test */
+	reset_test_shstk(0);
+	pid = fork();
+	if (!pid) {
+		fd = open("/proc/self/mem", O_RDWR);
+		if (fd == -1)
+			exit(1);
+
+		if (gup_write(shstk_ptr)) {
+			close(fd);
+			exit(1);
+		}
+		close(fd);
+		exit(0);
+	}
+	waitpid(pid, &status, 0);
+	if (WEXITSTATUS(status)) {
+		printf("[FAIL]\tWrite in child failed\n");
+		return 1;
+	}
+	if (*(unsigned long *)shstk_ptr == MAGIC_VAL) {
+		printf("[FAIL]\tWrite in child wrote through to shared memory\n");
+		return 1;
+	}
+
+	printf("[INFO]\tCow gup write -> write access success\n");
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow gup test\n");
+
+	return 0;
+}
+
+int test_mprotect(void)
+{
+	struct sigaction sa;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* mprotect a shadow stack as read only */
+	reset_test_shstk(0);
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+		return 1;
+	}
+
+	/* try to wrss it and fail */
+	if (!test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to read-only memory succeeded\n");
+		return 1;
+	}
+
+	/* then back to writable */
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_WRITE) failed\n");
+		return 1;
+	}
+
+	/* then pivot to it and succeed */
+	if (test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n");
+		return 1;
+	}
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tmprotect() test\n");
+
+	return 0;
+}
+
+char zero[4096];
+
+static void *uffd_thread(void *arg)
+{
+	struct uffdio_copy req;
+	int uffd = *(int *)arg;
+	struct uffd_msg msg;
+
+	if (read(uffd, &msg, sizeof(msg)) <= 0)
+		return (void *)1;
+
+	req.dst = msg.arg.pagefault.address;
+	req.src = (__u64)zero;
+	req.len = 4096;
+	req.mode = 0;
+
+	if (ioctl(uffd, UFFDIO_COPY, &req))
+		return (void *)1;
+
+	return (void *)0;
+}
+
+int test_userfaultfd(void)
+{
+	struct uffdio_register uffdio_register;
+	struct uffdio_api uffdio_api;
+	struct sigaction sa;
+	pthread_t thread;
+	void *res;
+	int uffd;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		printf("[SKIP]\tUserfaultfd unavailable.\n");
+		return 0;
+	}
+
+	reset_test_shstk(0);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+		goto err;
+
+	uffdio_register.range.start = (__u64)shstk_ptr;
+	uffdio_register.range.len = 4096;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		goto err;
+
+	if (pthread_create(&thread, NULL, &uffd_thread, &uffd))
+		goto err;
+
+	reset_shstk(shstk_ptr);
+	test_shstk_access(shstk_ptr);
+
+	if (pthread_join(thread, &res))
+		goto err;
+
+	if (test_shstk_access(shstk_ptr))
+		goto err;
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	if (!res)
+		printf("[OK]\tUserfaultfd test\n");
+	return !!res;
+err:
+	free_shstk(shstk_ptr);
+	close(uffd);
+	signal(SIGSEGV, SIG_DFL);
+	return 1;
+}
+
+/*
+ * Too complicated to pull it out of the 32 bit header, but also get the
+ * 64 bit one needed above. Just define a copy here.
+ */
+#define __NR_compat_sigaction 67
+
+/*
+ * Call 32 bit signal handler to get 32 bit signals ABI. Make sure
+ * to push the registers that will get clobbered.
+ */
+int sigaction32(int signum, const struct sigaction *restrict act,
+		struct sigaction *restrict oldact)
+{
+	register long syscall_reg asm("eax") = __NR_compat_sigaction;
+	register long signum_reg asm("ebx") = signum;
+	register long act_reg asm("ecx") = (long)act;
+	register long oldact_reg asm("edx") = (long)oldact;
+	int ret = 0;
+
+	asm volatile ("int $0x80;"
+		      : "=a"(ret), "=m"(oldact)
+		      : "r"(syscall_reg), "r"(signum_reg), "r"(act_reg),
+			"r"(oldact_reg)
+		      : "r8", "r9", "r10", "r11"
+		     );
+
+	return ret;
+}
+
+sigjmp_buf jmp_buffer;
+
+void segv_gp_handler(int signum, siginfo_t *si, void *uc)
+{
+	segv_triggered = true;
+
+	/*
+	 * To work with old glibc, this can't rely on siglongjmp working with
+	 * shadow stack enabled, so disable shadow stack before siglongjmp().
+	 */
+	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
+	siglongjmp(jmp_buffer, -1);
+}
+
+/*
+ * Transition to 32 bit mode and check that a #GP triggers a segfault.
+ */
+int test_32bit(void)
+{
+	struct sigaction sa;
+	struct sigaction *sa32;
+
+	/* Create sigaction in 32 bit address range */
+	sa32 = mmap(0, 4096, PROT_READ | PROT_WRITE,
+		   MAP_32BIT | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	sa32->sa_flags = SA_SIGINFO;
+
+	sa.sa_sigaction = segv_gp_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before triggering the #GP */
+	asm volatile("" : : : "memory");
+
+	/*
+	 * Set handler to somewhere in 32 bit address space
+	 */
+	sa32->sa_handler = (void *)sa32;
+	if (sigaction32(SIGUSR1, sa32, NULL))
+		return 1;
+
+	if (!sigsetjmp(jmp_buffer, 1))
+		raise(SIGUSR1);
+
+	if (segv_triggered)
+		printf("[OK]\t32 bit test\n");
+
+	return !segv_triggered;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = 0;
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+		printf("[SKIP]\tCould not enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+		printf("[SKIP]\tCould not re-enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_WRSS)) {
+		printf("[SKIP]\tCould not enable WRSS\n");
+		ret = 1;
+		goto out;
+	}
+
+	/* Should have succeeded if here, but this is a test, so double check. */
+	if (!get_ssp()) {
+		printf("[FAIL]\tShadow stack disabled\n");
+		return 1;
+	}
+
+	if (test_shstk_pivot()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack pivot\n");
+		goto out;
+	}
+
+	if (test_shstk_faults()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack fault test\n");
+		goto out;
+	}
+
+	if (test_shstk_violation()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack violation test\n");
+		goto out;
+	}
+
+	if (test_gup()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow stack gup\n");
+		goto out;
+	}
+
+	if (test_mprotect()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow mprotect test\n");
+		goto out;
+	}
+
+	if (test_userfaultfd()) {
+		ret = 1;
+		printf("[FAIL]\tUserfaultfd test\n");
+		goto out;
+	}
+
+	if (test_32bit()) {
+		ret = 1;
+		printf("[FAIL]\t32 bit test\n");
+	}
+
+	return ret;
+
+out:
+	/*
+	 * Disable shadow stack before the function returns, or there will be a
+	 * shadow stack violation.
+	 */
+	if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	return ret;
+}
+#endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 36/39] x86/fpu: Add helper for initing features
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (34 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 35/39] selftests/x86: Add shadow stack test Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 37/39] x86: Add PTRACE interface for shadow stack Rick Edgecombe
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

If an xfeature is saved in a buffer, the xfeature's bit will be set in
xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
is in it's init state. In this case the xfeature buffer address cannot
be retrieved with get_xsave_addr().

Future patches will need to handle the case of writing to an xfeature
that may not be saved. So provide helpers to init an xfeature in an
xsave buffer.

This could of course be done directly by reaching into the xsave buffer,
however this would not be robust against future changes to optimize the
xsave buffer by compacting it. In that case the xsave buffer would need
to be re-arranged as well. So the logic properly belongs encapsulated
in a helper where the logic can be unified.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v2:
 - New patch

 arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
 arch/x86/kernel/fpu/xstate.h |  6 ++++
 2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 13a80521dd51..3ff80be0a441 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -934,6 +934,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
 }
 
+static int xsave_buffer_access_checks(int xfeature_nr)
+{
+	/*
+	 * Do we even *have* xsave state?
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVE))
+		return 1;
+
+	/*
+	 * We should not ever be requesting features that we
+	 * have not enabled.
+	 */
+	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+		return 1;
+
+	return 0;
+}
+
 /*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
@@ -954,17 +972,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 {
-	/*
-	 * Do we even *have* xsave state?
-	 */
-	if (!boot_cpu_has(X86_FEATURE_XSAVE))
-		return NULL;
-
-	/*
-	 * We should not ever be requesting features that we
-	 * have not enabled.
-	 */
-	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+	if (xsave_buffer_access_checks(xfeature_nr))
 		return NULL;
 
 	/*
@@ -984,6 +992,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	return __raw_xsave_addr(xsave, xfeature_nr);
 }
 
+/*
+ * Given the xsave area and a state inside, this function
+ * initializes an xfeature in the buffer.
+ *
+ * get_xsave_addr() will return NULL if the feature bit is
+ * not present in the header. This function will make it so
+ * the xfeature buffer address is ready to be retrieved by
+ * get_xsave_addr().
+ *
+ * Inputs:
+ *	xstate: the thread's storage area for all FPU data
+ *	xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ *	XFEATURE_SSE, etc...)
+ * Output:
+ *	1 if the feature cannot be inited, 0 on success
+ */
+int init_xfeature(struct xregs_state *xsave, int xfeature_nr)
+{
+	if (xsave_buffer_access_checks(xfeature_nr))
+		return 1;
+
+	/*
+	 * Mark the feature inited.
+	 */
+	xsave->header.xfeatures |= BIT_ULL(xfeature_nr);
+	return 0;
+}
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..dc06f63063ee 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -54,6 +54,12 @@ extern void fpu__init_cpu_xstate(void);
 extern void fpu__init_system_xstate(unsigned int legacy_size);
 
 extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int init_xfeature(struct xregs_state *xsave, int xfeature_nr);
+
+static inline int xfeature_saved(struct xregs_state *xsave, int xfeature_nr)
+{
+	return xsave->header.xfeatures & BIT_ULL(xfeature_nr);
+}
 
 static inline u64 xfeatures_mask_supervisor(void)
 {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 37/39] x86: Add PTRACE interface for shadow stack
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (35 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:08   ` Kees Cook
  2023-01-19 21:23 ` [PATCH v5 38/39] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
                   ` (4 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Some applications (like GDB) would like to tweak shadow stack state via
ptrace. This allows for existing functionality to continue to work for
seized shadow stack applications. Provide an regset interface for
manipulating the shadow stack pointer (SSP).

There is already ptrace functionality for accessing xstate, but this
does not include supervisor xfeatures. So there is not a completely
clear place for where to put the shadow stack state. Adding it to the
user xfeatures regset would complicate that code, as it currently shares
logic with signals which should not have supervisor features.

Don't add a general supervisor xfeature regset like the user one,
because it is better to maintain flexibility for other supervisor
xfeatures to define their own interface. For example, an xfeature may
decide not to expose all of it's state to userspace, as is actually the
case for  shadow stack ptrace functionality. A lot of enum values remain
to be used, so just put it in dedicated shadow stack regset.

The only downside to not having a generic supervisor xfeature regset,
is that apps need to be enlightened of any new supervisor xfeature
exposed this way (i.e. they can't try to have generic save/restore
logic). But maybe that is a good thing, because they have to think
through each new xfeature instead of encountering issues when new a new
supervisor xfeature was added.

By adding a shadow stack regset, it also has the effect of including the
shadow stack state in a core dump, which could be useful for debugging.

The shadow stack specific xstate includes the SSP, and the shadow stack
and WRSS enablement status. Enabling shadow stack or wrss in the kernel
involves more than just flipping the bit. The kernel is made aware that
it has to do extra things when cloning or handling signals. That logic
is triggered off of separate feature enablement state kept in the task
struct. So the flipping on HW shadow stack enforcement without notifying
the kernel to change its behavior would severely limit what an application
could do without crashing, and the results would depend on kernel
internal implementation details. There is also no known use for controlling
this state via prtace today. So only expose the SSP, which is something
that userspace already has indirect control over.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---

v5:
 - Check shadow stack enablement status for tracee (rppt)
 - Fix typo in comment

v4:
 - Make shadow stack only. Reduce to only supporting SSP register, and
   remove CET references (peterz)
 - Add comment to not use 0x203, because binutils already looks for it in
   coredumps. (Christina Schimpe)

v3:
 - Drop dependence on thread.shstk.size, and use thread.features bits
 - Drop 32 bit support

v2:
 - Check alignment on ssp.
 - Block IBT bits.
 - Handle init states instead of returning error.
 - Add verbose commit log justifying the design.

 arch/x86/include/asm/fpu/regset.h |  7 +--
 arch/x86/kernel/fpu/regset.c      | 87 +++++++++++++++++++++++++++++++
 arch/x86/kernel/ptrace.c          | 12 +++++
 include/uapi/linux/elf.h          |  2 +
 4 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index 4f928d6a367b..697b77e96025 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@
 
 #include <linux/regset.h>
 
-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+				ssp_active;
 extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get,
-				 xstateregs_get;
+				 xstateregs_get, ssp_get;
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
-				 xstateregs_set;
+				 xstateregs_set, ssp_set;
 
 /*
  * xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 6d056b68f4ed..10c092d21809 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -8,6 +8,7 @@
 #include <asm/fpu/api.h>
 #include <asm/fpu/signal.h>
 #include <asm/fpu/regset.h>
+#include <asm/prctl.h>
 
 #include "context.h"
 #include "internal.h"
@@ -174,6 +175,92 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	return ret;
 }
 
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+int ssp_active(struct task_struct *target, const struct user_regset *regset)
+{
+	if (target->thread.features & ARCH_SHSTK_SHSTK)
+		return regset->n;
+
+	return 0;
+}
+
+int ssp_get(struct task_struct *target, const struct user_regset *regset,
+		struct membuf to)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct cet_user_state *cetregs;
+
+	if (!boot_cpu_has(X86_FEATURE_USER_SHSTK))
+		return -ENODEV;
+
+	sync_fpstate(fpu);
+	cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER);
+	if (!cetregs) {
+		/*
+		 * The registers are the in the init state. The init values for
+		 * these regs are zero, so just zero the output buffer.
+		 */
+		membuf_zero(&to, sizeof(cetregs->user_ssp));
+		return 0;
+	}
+
+	return membuf_write(&to, (unsigned long *)&cetregs->user_ssp,
+			    sizeof(cetregs->user_ssp));
+}
+
+int ssp_set(struct task_struct *target, const struct user_regset *regset,
+		  unsigned int pos, unsigned int count,
+		  const void *kbuf, const void __user *ubuf)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct xregs_state *xsave = &fpu->fpstate->regs.xsave;
+	struct cet_user_state *cetregs;
+	unsigned long user_ssp;
+	int r;
+
+	if (!boot_cpu_has(X86_FEATURE_USER_SHSTK) ||
+	    !ssp_active(target, regset))
+		return -ENODEV;
+
+	r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_ssp, 0, -1);
+	if (r)
+		return r;
+
+	/*
+	 * Some kernel instructions (IRET, etc) can cause exceptions in the case
+	 * of disallowed CET register values. Just prevent invalid values.
+	 */
+	if ((user_ssp >= TASK_SIZE_MAX) || !IS_ALIGNED(user_ssp, 8))
+		return -EINVAL;
+
+	fpu_force_restore(fpu);
+
+	/*
+	 * Don't want to init the xfeature until the kernel will definitely
+	 * overwrite it, otherwise if it inits and then fails out, it would
+	 * end up initing it to random data.
+	 */
+	if (!xfeature_saved(xsave, XFEATURE_CET_USER) &&
+	    WARN_ON(init_xfeature(xsave, XFEATURE_CET_USER)))
+		return -ENODEV;
+
+	cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
+	if (WARN_ON(!cetregs)) {
+		/*
+		 * This shouldn't ever be NULL because it was successfully
+		 * inited above if needed. The only scenario would be if an
+		 * xfeature was somehow saved in a buffer, but not enabled in
+		 * xsave.
+		 */
+		return -ENODEV;
+	}
+
+	cetregs->user_ssp = user_ssp;
+	return 0;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 
 /*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index dfaa270a7cc9..095f04bdabdc 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -58,6 +58,7 @@ enum x86_regset_64 {
 	REGSET64_FP,
 	REGSET64_IOPERM,
 	REGSET64_XSTATE,
+	REGSET64_SSP,
 };
 
 #define REGSET_GENERAL \
@@ -1267,6 +1268,17 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
 		.active		= ioperm_active,
 		.regset_get	= ioperm_get
 	},
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	[REGSET64_SSP] = {
+		.core_note_type	= NT_X86_SHSTK,
+		.n		= 1,
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= ssp_active,
+		.regset_get	= ssp_get,
+		.set		= ssp_set
+	},
+#endif
 };
 
 static const struct user_regset_view user_x86_64_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 4c6a8fa5e7ed..413a15c07121 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -406,6 +406,8 @@ typedef struct elf64_shdr {
 #define NT_386_TLS	0x200		/* i386 TLS slots (struct user_desc) */
 #define NT_386_IOPERM	0x201		/* x86 io permission bitmap (1=deny) */
 #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
+/* Old binutils treats 0x203 as a CET state */
+#define NT_X86_SHSTK	0x204		/* x86 SHSTK state */
 #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
 #define NT_S390_TIMER	0x301		/* s390 timer register */
 #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 38/39] x86/shstk: Add ARCH_SHSTK_UNLOCK
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (36 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 37/39] x86: Add PTRACE interface for shadow stack Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-19 21:23 ` [PATCH v5 39/39] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other shadow stack operations, but restrict it being
called by the ptrace arch_pctl() interface.

Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
[Merged into recent API changes, added commit log and docs]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v4:
 - Add to docs that it is ptrace only.
 - Remove "CET" references

v3:
 - Depend on CONFIG_CHECKPOINT_RESTORE (Kees)

 Documentation/x86/shstk.rst       | 4 ++++
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 9 +++++++--
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
index f2e6f323cf68..e8ed5fc0f7ae 100644
--- a/Documentation/x86/shstk.rst
+++ b/Documentation/x86/shstk.rst
@@ -73,6 +73,10 @@ arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
     are ignored. The mask is ORed with the existing value. So any feature bits
     set here cannot be enabled or disabled afterwards.
 
+arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
+    Unlock features. 'features' is a mask of all features to unlock. All
+    bits set are processed, unset bits are ignored. Only works via ptrace.
+
 The return values are as follows. On success, return 0. On error, errno can
 be::
 
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index e31495668056..200efbbe5809 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -25,6 +25,7 @@
 #define ARCH_SHSTK_ENABLE		0x5001
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
+#define ARCH_SHSTK_UNLOCK		0x5004
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 71094c8a305f..d368854fa9c4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -835,6 +835,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_SHSTK_ENABLE:
 	case ARCH_SHSTK_DISABLE:
 	case ARCH_SHSTK_LOCK:
+	case ARCH_SHSTK_UNLOCK:
 		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 07142e6f05f6..a639119a21c5 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -452,9 +452,14 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 		return 0;
 	}
 
-	/* Don't allow via ptrace */
-	if (task != current)
+	/* Only allow via ptrace */
+	if (task != current) {
+		if (option == ARCH_SHSTK_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
+			task->thread.features_locked &= ~features;
+			return 0;
+		}
 		return -EINVAL;
+	}
 
 	/* Do not allow to change locked features */
 	if (features & task->thread.features_locked)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 39/39] x86/shstk: Add ARCH_SHSTK_STATUS
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (37 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 38/39] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
@ 2023-01-19 21:23 ` Rick Edgecombe
  2023-01-20  1:08   ` Kees Cook
  2023-01-19 22:26 ` [PATCH v5 00/39] Shadow stacks for userspace Andrew Morton
                   ` (2 subsequent siblings)
  41 siblings, 1 reply; 120+ messages in thread
From: Rick Edgecombe @ 2023-01-19 21:23 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe

CRIU and GDB need to get the current shadow stack and WRSS enablement
status. This information is already available via /proc/pid/status, but
this is inconvenient for CRIU because it involves parsing the text output
in an area of the code where this is difficult. Provide a status
arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
userspace address, and make the new arch_prctl simply copy the features
out to userspace.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Fix typo in commit log

v4:
 - New patch

 Documentation/x86/shstk.rst       | 6 ++++++
 arch/x86/include/asm/shstk.h      | 4 ++--
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 8 +++++++-
 5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
index e8ed5fc0f7ae..7f4af798794e 100644
--- a/Documentation/x86/shstk.rst
+++ b/Documentation/x86/shstk.rst
@@ -77,6 +77,11 @@ arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
     Unlock features. 'features' is a mask of all features to unlock. All
     bits set are processed, unset bits are ignored. Only works via ptrace.
 
+arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr)
+    Copy the currently enabled features to the address passed in addr. The
+    features are described using the bits passed into the others in
+    'features'.
+
 The return values are as follows. On success, return 0. On error, errno can
 be::
 
@@ -84,6 +89,7 @@ be::
         -ENOTSUPP if the feature is not supported by the hardware or
          kernel.
         -EINVAL arguments (non existing feature, etc)
+        -EFAULT if could not copy information back to userspace
 
 The feature's bits supported are::
 
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 746c040f7cb6..73de995f55ca 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -14,7 +14,7 @@ struct thread_shstk {
 	u64	size;
 };
 
-long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
 int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 			     unsigned long stack_size,
@@ -24,7 +24,7 @@ int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
-			     unsigned long features) { return -EINVAL; }
+			     unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
 static inline int shstk_alloc_thread_stack(struct task_struct *p,
 					   unsigned long clone_flags,
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 200efbbe5809..1b85bc876c2d 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,6 +26,7 @@
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
 #define ARCH_SHSTK_UNLOCK		0x5004
+#define ARCH_SHSTK_STATUS		0x5005
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d368854fa9c4..dde43caf196e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -836,6 +836,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_SHSTK_DISABLE:
 	case ARCH_SHSTK_LOCK:
 	case ARCH_SHSTK_UNLOCK:
+	case ARCH_SHSTK_STATUS:
 		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index a639119a21c5..3b1433bd63c7 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -445,8 +445,14 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi
 	return alloc_shstk(addr, aligned_size, size, set_tok);
 }
 
-long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
 {
+	unsigned long features = arg2;
+
+	if (option == ARCH_SHSTK_STATUS) {
+		return put_user(task->thread.features, (unsigned long __user *)arg2);
+	}
+
 	if (option == ARCH_SHSTK_LOCK) {
 		task->thread.features_locked |= features;
 		return 0;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 00/39] Shadow stacks for userspace
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (38 preceding siblings ...)
  2023-01-19 21:23 ` [PATCH v5 39/39] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
@ 2023-01-19 22:26 ` Andrew Morton
  2023-01-20 17:27   ` Edgecombe, Rick P
  2023-01-20 17:48 ` John Allen
  2023-01-22  8:20 ` Mike Rapoport
  41 siblings, 1 reply; 120+ messages in thread
From: Andrew Morton @ 2023-01-19 22:26 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, Andrew.Cooper3, christina.schimpe

On Thu, 19 Jan 2023 13:22:38 -0800 Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> SHSTK

Sounds like me trying to swear in Russian while drunk.

Is there any chance of s/shstk/shadow_stack/g?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description
  2023-01-19 21:22 ` [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description Rick Edgecombe
@ 2023-01-20  0:38   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:38 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:39PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Introduce a new document on Control-flow Enforcement Technology (CET).
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 02/39] x86/shstk: Add Kconfig option for shadow stack
  2023-01-19 21:22 ` [PATCH v5 02/39] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
@ 2023-01-20  0:40   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:40 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:40PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow stack provides protection for applications against function return
> address corruption. It is active when the processor supports it, the
> kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
> for the feature. This is only implemented for the 64-bit kernel. When it
> is enabled, legacy non-shadow stack applications continue to work, but
> without protection.
> 
> Since there is another feature that utilizes CET (Kernel IBT) that will
> share implementation with shadow stacks, create CONFIG_CET to signify
> that at least one CET feature is configured.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2023-01-19 21:22 ` [PATCH v5 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
@ 2023-01-20  0:44   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:44 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:41PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The Control-Flow Enforcement Technology contains two related features,
> one of which is Shadow Stacks. Future patches will utilize this feature
> for shadow stack support in KVM, so add a CPU feature flags for Shadow
> Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
> 
> To protect shadow stack state from malicious modification, the registers
> are only accessible in supervisor mode. This implementation
> context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
> on XSAVES.
> 
> The shadow stack feature, enumerated by the CPUID bit described above,
> encompasses both supervisor and userspace support for shadow stack. In
> near future patches, only userspace shadow stack will be enabled. In
> expectation of future supervisor shadow stack support, create a software
> CPU capability to enumerate kernel utilization of userspace shadow stack
> support. This user shadow stack bit should depend on the HW "shstk"
> capability and that logic will be implemented in future patches.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2023-01-19 21:22 ` [PATCH v5 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
@ 2023-01-20  0:46   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:46 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:42PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Setting CR4.CET is a prerequisite for utilizing any CET features, most of
> which also require setting MSRs.
> 
> Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
> and is configured with kernel IBT. However, future patches that enable
> userspace shadow stack support will need the bit set as well. So change
> the logic to enable it in either case.
> 
> Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
> userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2023-01-19 21:22 ` [PATCH v5 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
@ 2023-01-20  0:46   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:46 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:43PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow stack register state can be managed with XSAVE. The registers
> can logically be separated into two groups:
>         * Registers controlling user-mode operation
>         * Registers controlling kernel-mode operation
> 
> The architecture has two new XSAVE state components: one for each group
> of those groups of registers. This lets an OS manage them separately if
> it chooses. Future patches for host userspace and KVM guests will only
> utilize the user-mode registers, so only configure XSAVE to save
> user-mode registers. This state will add 16 bytes to the xsave buffer
> size.
> 
> Future patches will use the user-mode XSAVE area to save guest user-mode
> CET state. However, VMCS includes new fields for guest CET supervisor
> states. KVM can use these to save and restore guest supervisor state, so
> host supervisor XSAVE support is not required.
> 
> Adding this exacerbates the already unwieldy if statement in
> check_xstate_against_struct() that handles warning about un-implemented
> xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
> it actually check's the xfeature. This ends up exceeding 80 chars, but was
> better on balance than other options explored. Pass the bool as pointer to
> make it clear that XCHECK_SZ() can change the variable.
> 
> While configuring user-mode XSAVE, clarify kernel-mode registers are not
> managed by XSAVE by defining the xfeature in
> XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
> This serves more of a documentation as code purpose, and functionally,
> only enables a few safety checks.
> 
> Both XSAVE state components are supervisor states, even the state
> controlling user-mode operation. This is a departure from earlier features
> like protection keys where the PKRU state is a normal user
> (non-supervisor) state. Having the user state be supervisor-managed
> ensures there is no direct, unprivileged access to it, making it harder
> for an attacker to subvert CET.
> 
> To facilitate this privileged access, define the two user-mode CET MSRs,
> and the bits defined in those MSRs relevant to future shadow stack
> enablement patches.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate
  2023-01-19 21:22 ` [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
@ 2023-01-20  0:47   ` Kees Cook
  2023-02-01 11:01   ` Borislav Petkov
  1 sibling, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:47 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:22:44PM -0800, Rick Edgecombe wrote:
> Just like user xfeatures, supervisor xfeatures can be active in the
> registers or present in the task FPU buffer. If the registers are
> active, the registers can be modified directly. If the registers are
> not active, the modification must be performed on the task FPU buffer.
> 
> When the state is not active, the kernel could perform modifications
> directly to the buffer. But in order for it to do that, it needs
> to know where in the buffer the specific state it wants to modify is
> located. Doing this is not robust against optimizations that compact
> the FPU buffer, as each access would require computing where in the
> buffer it is.
> 
> The easiest way to modify supervisor xfeature data is to force restore
> the registers and write directly to the MSRs. Often times this is just fine
> anyway as the registers need to be restored before returning to userspace.
> Do this for now, leaving buffer writing optimizations for the future.
> 
> Add a new function fpregs_lock_and_load() that can simultaneously call
> fpregs_lock() and do this restore. Also perform some extra sanity
> checks in this function since this will be used in non-fpu focused code.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-01-19 21:22 ` [PATCH v5 07/39] x86: Add user control-protection fault handler Rick Edgecombe
@ 2023-01-20  0:50   ` Kees Cook
  2023-02-03 19:09   ` Borislav Petkov
  1 sibling, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:50 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu, Michael Kerrisk

On Thu, Jan 19, 2023 at 01:22:45PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> A control-protection fault is triggered when a control-flow transfer
> attempt violates Shadow Stack or Indirect Branch Tracking constraints.
> For example, the return address for a RET instruction differs from the copy
> on the shadow stack.
> 
> There already exists a control-protection fault handler for handling kernel
> IBT faults. Refactor this fault handler into separate user and kernel
> handlers, like the page fault handler. Add a control-protection handler
> for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
> is compiled in the case of either of the two CET features supported in the
> kernel: kernel IBT or user mode shadow stack. Move some static inline
> functions from traps.c into a header so they can be used in cet.c.
> 
> Opportunistically fix a comment in the kernel IBT part of the fault
> handler that is on the end of the line instead of preceding it.
> 
> Keep the same behavior for the kernel side of the fault handler, except for
> converting a BUG to a WARN in the case of a #CP happening when the feature
> is missing. This unifies the behavior with the new shadow stack code, and
> also prevents the kernel from crashing under this situation which is
> potentially recoverable.
> 
> The control-protection fault handler works in a similar way as the general
> protection fault handler. It provides the si_code SEGV_CPERR to the signal
> handler.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

This diff would have been a bit easier to review if the file move was
separate from the addition of the handler, but regardless:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2023-01-19 21:22 ` [PATCH v5 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
@ 2023-01-20  0:52   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:52 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu, Christoph Hellwig

On Thu, Jan 19, 2023 at 01:22:46PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
> shadow stack pages.
> 
> In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
> if HW dirty tracking is not needed, because if the Dirty bit is not already
> set the CPU has to set Dirty=1 when the memory gets written to. This
> creates additional work for the CPU. So traditional wisdom was to simply
> set the Dirty bit whenever you didn't care about it. However, it was never
> really very helpful for read-only kernel memory.
> 
> When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
> such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
> avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
> functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
> any functional benefit either, so to reduce ambiguity between shadow stack
> and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW
  2023-01-19 21:22 ` [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
@ 2023-01-20  0:55   ` Kees Cook
  2023-01-23  9:16   ` David Hildenbrand
  2023-01-23  9:28   ` David Hildenbrand
  2 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:55 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:48PM -0800, Rick Edgecombe wrote:
> Some OSes have a greater dependence on software available bits in PTEs than
> Linux. That left the hardware architects looking for a way to represent a
> new memory type (shadow stack) within the existing bits. They chose to
> repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
> shadow stack memory, Linux should avoid creating memory with this PTE bit
> combination unless it intends for it to be shadow stack.
> 
> The reason it's lightly used is that Dirty=1 is normally set by HW
> _before_ a write. A write with a Write=0 PTE would typically only generate
> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
> supports shadow stacks will no longer exhibit this oddity.
> 
> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
> in places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
> Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
> Further differentiated by VMA flags, these PTE bit combinations would be
> set as follows for various types of memory:
> 
> (Write=0,Cow=1,Dirty=0):
>  - A modified, copy-on-write (COW) page. Previously when a typical
>    anonymous writable mapping was made COW via fork(), the kernel would
>    mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
>    happens in copy_present_pte().
>  - A R/O page that has been COW'ed. The user page is in a R/O VMA,
>    and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
>    handler creates a copy of the page and sets the new copy's PTE as
>    Write=0 and Cow=1.
>  - A shared shadow stack PTE. When a shadow stack page is being shared
>    among processes (this happens at fork()), its PTE is made Dirty=0, so
>    the next shadow stack access causes a fault, and the page is
>    duplicated and Dirty=1 is set again. This is the COW equivalent for
>    shadow stack pages, even though it's copy-on-access rather than
>    copy-on-write.
> 
> (Write=0,Cow=0,Dirty=1):
>  - A shadow stack PTE.
>  - A Cow PTE created when a processor without shadow stack support set
>    Dirty=1.
> 
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
> because shadow stacks are not enabled there.
> 
> Implement only the infrastructure for _PAGE_COW. Changes to start
> creating _PAGE_COW PTEs will follow once other pieces are in place.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-01-19 21:22 ` [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
@ 2023-01-20  0:57   ` Kees Cook
  2023-02-09 14:08   ` Borislav Petkov
  1 sibling, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:57 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:49PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The Write=0,Dirty=1 PTE has been used to indicate copy-on-write pages.
> However, newer x86 processors also regard a Write=0,Dirty=1 PTE as a
> shadow stack page. In order to separate the two, the software-defined
> _PAGE_DIRTY is changed to _PAGE_COW for the copy-on-write case, and
> pte_*() are updated to do this.
> 
> pte_modify() takes a "raw" pgprot_t which was not necessarily created
> with any of the existing PTE bit helpers. That means that it can return a
> pte_t with Write=0,Dirty=1, a shadow stack PTE, when it did not intend to
> create one.
> 
> However pte_modify() changes a PTE to 'newprot', but it doesn't use the
> pte_*(). Modify it to also move _PAGE_DIRTY to _PAGE_COW. Do this by
> using the pte_mkdirty() helper. Since pte_mkdirty() also sets the soft
> dirty bit, extract a helper that optionally doesn't set
> _PAGE_SOFT_DIRTY. This helper will allow future logic for deciding when to
> move _PAGE_DIRTY to _PAGE_COW can live in one place.
> 
> Apply the same changes to pmd_modify().
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2023-01-19 21:22 ` [PATCH v5 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
@ 2023-01-20  0:58   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:58 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:50PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for
> shadow stack. Copy-on-write PTEs then have Write=0,Cow=1.
> 
> When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
> become a transient shadow stack PTE in two cases:
> 
> 1. Some processors can start a write but end up seeing a Write=0 PTE by
>    the time they get to the Dirty bit, creating a transient shadow stack
>    PTE. However, this will not occur on processors supporting shadow
>    stack, and a TLB flush is not necessary.
> 
> 2. When _PAGE_DIRTY is replaced with _PAGE_COW non-atomically, a transient
>    shadow stack PTE can be created as a result. Thus, prevent that with
>    cmpxchg.
> 
> In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
> on does not exist and the logic would need to be different. Although the
> extra functionality will normally be optimized out when user shadow
> stacks are not configured, also exclude it in the preprocessor stage so
> that it will still compile. User shadow stack is not supported there by
> Linux anyway. Leave the cpu_feature_enabled() check so that the
> functionality also gets disabled based on runtime detection of the
> feature.
> 
> Similarly, compile it out in ptep_set_wrprotect() due to a clang warning
> on i386. Like above, the code path should get optimized out on i386
> since shadow stack is not supported on 32 bit kernels, but this makes
> the compiler happy.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue. Jann Horn provided the cmpxchg solution.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 16/39] x86/mm: Check shadow stack page fault errors
  2023-01-19 21:22 ` [PATCH v5 16/39] x86/mm: Check shadow stack page fault errors Rick Edgecombe
@ 2023-01-20  0:59   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  0:59 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:54PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The CPU performs "shadow stack accesses" when it expects to encounter
> shadow stack mappings. These accesses can be implicit (via CALL/RET
> instructions) or explicit (instructions like WRSS).
> 
> Shadow stack accesses to shadow-stack mappings can result in faults in
> normal, valid operation just like regular accesses to regular mappings.
> Shadow stacks need some of the same features like delayed allocation, swap
> and copy-on-write. The kernel needs to use faults to implement those
> features.
> 
> The architecture has concepts of both shadow stack reads and shadow stack
> writes. Any shadow stack access to non-shadow stack memory will generate
> a fault with the shadow stack error code bit set.
> 
> This means that, unlike normal write protection, the fault handler needs
> to create a type of memory that can be written to (with instructions that
> generate shadow stack writes), even to fulfill a read access. So in the
> case of COW memory, the COW needs to take place even with a shadow stack
> read. Otherwise the page will be left (shadow stack) writable in
> userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
> for shadow stack accesses, even if the access was a shadow stack read.
> 
> For the purpose of making this clearer, consider the following example.
> If a process has a shadow stack, and forks, the shadow stack PTEs will
> become read-only due to COW. If the CPU in one process performs a shadow
> stack read access to the shadow stack, for example executing a RET and
> causing the CPU to read the shadow stack copy of the return address, then
> in order for the fault to be resolved the PTE will need to be set with
> shadow stack permissions. But then the memory would be changeable from
> userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
> COW, otherwise the shared page would be changeable from both processes.
> 
> Shadow stack accesses can also result in errors, such as when a shadow
> stack overflows, or if a shadow stack access occurs to a non-shadow-stack
> mapping. Also, generate the errors for invalid shadow stack accesses.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-01-19 21:22 ` [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
@ 2023-01-20  1:01   ` Kees Cook
  2023-02-14  0:09   ` Deepak Gupta
  1 sibling, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:57PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> With the introduction of shadow stack memory there are two ways a pte can
> be writable: regular writable memory and shadow stack memory.
> 
> In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
> or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
> where a PTE is made writable. However, there are places where pte_mkwrite()
> is called directly and the logic should now also create a shadow stack PTE
> in the case of a shadow stack VMA.
> 
> - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
>   directly and call pte_mkwrite(). Teach it about pte_mkwrite_shstk()
> 
> - When userfaultfd is creating a PTE after userspace handles the fault
>   it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()
> 
> To make the code cleaner, introduce is_shstk_write() which simplifies
> checking for VM_WRITE | VM_SHADOW_STACK together.
> 
> In other cases where pte_mkwrite() is called directly, the VMA will not
> be VM_SHADOW_STACK, and so shadow stack memory should not be created.
>  - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
>  - In the case of the "dirty_accountable" optimization in mprotect(),
>    shadow stack VMA's won't be VM_SHARED, so it is not necessary.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/39] mm: Warn on shadow stack memory in wrong vma
  2023-01-19 21:23 ` [PATCH v5 25/39] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
@ 2023-01-20  1:01   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:23:03PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
> treated as shadow by the CPU, but this combination used to be created by
> the kernel on x86. Previous patches have changed the kernel to now avoid
> creating these PTEs unless they are for shadow stack memory. In case any
> missed corners of the kernel are still creating PTEs like this for
> non-shadow stack memory, and to catch any re-introductions of the logic,
> warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
> stack VMAs when they are being zapped. This won't catch transient cases
> but should have decent coverage. It will be compiled out when shadow
> stack is not configured.
> 
> In order to check if a pte is shadow stack in core mm code, add default
> implementations for pte_shstk() and pmd_shstk().
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/39] x86: Introduce userspace API for shadow stack
  2023-01-19 21:23 ` [PATCH v5 26/39] x86: Introduce userspace API for shadow stack Rick Edgecombe
@ 2023-01-20  1:04   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:04 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:23:04PM -0800, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Add three new arch_prctl() handles:
> 
>  - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
>    feature. Returns 0 on success or an error.
> 
>  - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
>    specified feature. Returns 0 on success or an error
> 
> The features are handled per-thread and inherited over fork(2)/clone(2),
> but reset on exec().
> 
> This is preparation patch. It does not implement any features.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/39] x86/shstk: Add user-mode shadow stack support
  2023-01-19 21:23 ` [PATCH v5 27/39] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2023-01-20  1:05   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:23:05PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Introduce basic shadow stack enabling/disabling/allocation routines.
> A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
> and has a fixed size of min(RLIMIT_STACK, 4GB).
> 
> Keep the task's shadow stack address and size in thread_struct. This will
> be copied when cloning new threads, but needs to be cleared during exec,
> so add a function to do this.
> 
> Do not support IA32 emulation or x32.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 29/39] x86/shstk: Introduce routines modifying shstk
  2023-01-19 21:23 ` [PATCH v5 29/39] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2023-01-20  1:05   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:23:07PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow stacks are normally written to via CALL/RET or specific CET
> instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
> operations the kernel will need to write to directly using the ring-0 only
> WRUSS instruction.
> 
> A shadow stack restore token marks a restore point of the shadow stack, and
> the address in a token must point directly above the token, which is within
> the same shadow stack. This is distinctively different from other pointers
> on the shadow stack, since those pointers point to executable code area.
> 
> Introduce token setup and verify routines. Also introduce WRUSS, which is
> a kernel-mode instruction but writes directly to user shadow stack.
> 
> In future patches that enable shadow stack to work with signals, the kernel
> will need something to denote the point in the stack where sigreturn may be
> called. This will prevent attackers calling sigreturn at arbitrary places
> in the stack, in order to help prevent SROP attacks.
> 
> To do this, something that can only be written by the kernel needs to be
> placed on the shadow stack. This can be accomplished by setting bit 63 in
> the frame written to the shadow stack. Userspace return addresses can't
> have this bit set as it is in the kernel range. It is also can't be a
> valid restore token.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 32/39] x86/shstk: Support WRSS for userspace
  2023-01-19 21:23 ` [PATCH v5 32/39] x86/shstk: Support WRSS for userspace Rick Edgecombe
@ 2023-01-20  1:06   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:06 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:23:10PM -0800, Rick Edgecombe wrote:
> For the current shadow stack implementation, shadow stacks contents can't
> easily be provisioned with arbitrary data. This property helps apps
> protect themselves better, but also restricts any potential apps that may
> want to do exotic things at the expense of a little security.
> 
> The x86 shadow stack feature introduces a new instruction, WRSS, which
> can be enabled to write directly to shadow stack permissioned memory from
> userspace. Allow it to get enabled via the prctl interface.
> 
> Only enable the userspace WRSS instruction, which allows writes to
> userspace shadow stacks from userspace. Do not allow it to be enabled
> independently of shadow stack, as HW does not support using WRSS when
> shadow stack is disabled.
> 
> From a fault handler perspective, WRSS will behave very similar to WRUSS,
> which is treated like a user access from a #PF err code perspective.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 31/39] x86/shstk: Introduce map_shadow_stack syscall
  2023-01-19 21:23 ` [PATCH v5 31/39] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2023-01-20  1:07   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:07 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:23:09PM -0800, Rick Edgecombe wrote:
> When operating with shadow stacks enabled, the kernel will automatically
> allocate shadow stacks for new threads, however in some cases userspace
> will need additional shadow stacks. The main example of this is the
> ucontext family of functions, which require userspace allocating and
> pivoting to userspace managed stacks.
> 
> Unlike most other user memory permissions, shadow stacks need to be
> provisioned with special data in order to be useful. They need to be setup
> with a restore token so that userspace can pivot to them via the RSTORSSP
> instruction. But, the security design of shadow stack's is that they
> should not be written to except in limited circumstances. This presents a
> problem for userspace, as to how userspace can provision this special
> data, without allowing for the shadow stack to be generally writable.
> 
> Previously, a new PROT_SHADOW_STACK was attempted, which could be
> mprotect()ed from RW permissions after the data was provisioned. This was
> found to not be secure enough, as other thread's could write to the
> shadow stack during the writable window.
> 
> The kernel can use a special instruction, WRUSS, to write directly to
> userspace shadow stacks. So the solution can be that memory can be mapped
> as shadow stack permissions from the beginning (never generally writable
> in userspace), and the kernel itself can write the restore token.
> 
> First, a new madvise() flag was explored, which could operate on the
> PROT_SHADOW_STACK memory. This had a couple downsides:
> 1. Extra checks were needed in mprotect() to prevent writable memory from
>    ever becoming PROT_SHADOW_STACK.
> 2. Extra checks/vma state were needed in the new madvise() to prevent
>    restore tokens being written into the middle of pre-used shadow stacks.
>    It is ideal to prevent restore tokens being added at arbitrary
>    locations, so the check was to make sure the shadow stack had never been
>    written to.
> 3. It stood out from the rest of the madvise flags, as more of direct
>    action than a hint at future desired behavior.
> 
> So rather than repurpose two existing syscalls (mmap, madvise) that don't
> quite fit, just implement a new map_shadow_stack syscall to allow
> userspace to map and setup new shadow stacks in one step. While ucontext
> is the primary motivator, userspace may have other unforeseen reasons to
> setup it's own shadow stacks using the WRSS instruction. Towards this
> provide a flag so that stacks can be optionally setup securely for the
> common case of ucontext without enabling WRSS. Or potentially have the
> kernel set up the shadow stack in some new way.
> 
> The following example demonstrates how to create a new shadow stack with
> map_shadow_stack:
> void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 37/39] x86: Add PTRACE interface for shadow stack
  2023-01-19 21:23 ` [PATCH v5 37/39] x86: Add PTRACE interface for shadow stack Rick Edgecombe
@ 2023-01-20  1:08   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:08 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:23:15PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Some applications (like GDB) would like to tweak shadow stack state via
> ptrace. This allows for existing functionality to continue to work for
> seized shadow stack applications. Provide an regset interface for
> manipulating the shadow stack pointer (SSP).
> 
> There is already ptrace functionality for accessing xstate, but this
> does not include supervisor xfeatures. So there is not a completely
> clear place for where to put the shadow stack state. Adding it to the
> user xfeatures regset would complicate that code, as it currently shares
> logic with signals which should not have supervisor features.
> 
> Don't add a general supervisor xfeature regset like the user one,
> because it is better to maintain flexibility for other supervisor
> xfeatures to define their own interface. For example, an xfeature may
> decide not to expose all of it's state to userspace, as is actually the
> case for  shadow stack ptrace functionality. A lot of enum values remain
> to be used, so just put it in dedicated shadow stack regset.
> 
> The only downside to not having a generic supervisor xfeature regset,
> is that apps need to be enlightened of any new supervisor xfeature
> exposed this way (i.e. they can't try to have generic save/restore
> logic). But maybe that is a good thing, because they have to think
> through each new xfeature instead of encountering issues when new a new
> supervisor xfeature was added.
> 
> By adding a shadow stack regset, it also has the effect of including the
> shadow stack state in a core dump, which could be useful for debugging.
> 
> The shadow stack specific xstate includes the SSP, and the shadow stack
> and WRSS enablement status. Enabling shadow stack or wrss in the kernel
> involves more than just flipping the bit. The kernel is made aware that
> it has to do extra things when cloning or handling signals. That logic
> is triggered off of separate feature enablement state kept in the task
> struct. So the flipping on HW shadow stack enforcement without notifying
> the kernel to change its behavior would severely limit what an application
> could do without crashing, and the results would depend on kernel
> internal implementation details. There is also no known use for controlling
> this state via prtace today. So only expose the SSP, which is something
> that userspace already has indirect control over.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 39/39] x86/shstk: Add ARCH_SHSTK_STATUS
  2023-01-19 21:23 ` [PATCH v5 39/39] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
@ 2023-01-20  1:08   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-20  1:08 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:23:17PM -0800, Rick Edgecombe wrote:
> CRIU and GDB need to get the current shadow stack and WRSS enablement
> status. This information is already available via /proc/pid/status, but
> this is inconvenient for CRIU because it involves parsing the text output
> in an area of the code where this is difficult. Provide a status
> arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
> userspace address, and make the new arch_prctl simply copy the features
> out to userspace.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Suggested-by: Mike Rapoport <rppt@kernel.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 00/39] Shadow stacks for userspace
  2023-01-19 22:26 ` [PATCH v5 00/39] Shadow stacks for userspace Andrew Morton
@ 2023-01-20 17:27   ` Edgecombe, Rick P
  2023-01-20 19:19     ` Kees Cook
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-20 17:27 UTC (permalink / raw)
  To: akpm
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, hjl.tools, pavel, Lutomirski, Andy, linux-doc, arnd,
	tglx, Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang,
	jamorris, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, 2023-01-19 at 14:26 -0800, Andrew Morton wrote:
> On Thu, 19 Jan 2023 13:22:38 -0800 Rick Edgecombe <
> rick.p.edgecombe@intel.com> wrote:
> 
> > SHSTK
> 
> Sounds like me trying to swear in Russian while drunk.
> 
> Is there any chance of s/shstk/shadow_stack/g?

I'm fine with the name change. I think shstk got debated and picked
early in the history of the series before I got involved. "shstk" is
nice and short, but it's not completely clear what it is unless you
already know about shadow stack. So there is a tradeoff of clarity and
line length/wrapping. Does anyone else have any strong opinions?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 00/39] Shadow stacks for userspace
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (39 preceding siblings ...)
  2023-01-19 22:26 ` [PATCH v5 00/39] Shadow stacks for userspace Andrew Morton
@ 2023-01-20 17:48 ` John Allen
  2023-01-22  8:20 ` Mike Rapoport
  41 siblings, 0 replies; 120+ messages in thread
From: John Allen @ 2023-01-20 17:48 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, kcc, eranian, rppt, jamorris,
	dethoma, akpm, Andrew.Cooper3, christina.schimpe

On 1/19/23 3:22 PM, Rick Edgecombe wrote:
> I left tested-by tags in place per discussion with testers. Testers, please
> retest.

Re-tested on my AMD system (Dell PowerEdge R6515 w/ EPYC 7713) and it looks
like everything is still working properly.

The selftests seem to run cleanly:

[INFO]	new_ssp = 7ff19be0dff8, *new_ssp = 7ff19be0e001
[INFO]	changing ssp from 7ff19c7f1ff0 to 7ff19be0dff8
[INFO]	ssp is now 7ff19be0e000
[OK]	Shadow stack pivot
[OK]	Shadow stack faults
[INFO]	Corrupting shadow stack
[INFO]	Generated shadow stack violation successfully
[OK]	Shadow stack violation test
[INFO]	Gup read -> shstk access success
[INFO]	Gup write -> shstk access success
[INFO]	Violation from normal write
[INFO]	Gup read -> write access success
[INFO]	Violation from normal write
[INFO]	Gup write -> write access success
[INFO]	Cow gup write -> write access success
[OK]	Shadow gup test
[INFO]	Violation from shstk access
[OK]	mprotect() test
[OK]	Userfaultfd test
[OK]	32 bit test

Additionally, I could see the control protection messages in dmesg when
running the shstk violation test from here:
https://gitlab.com/cet-software/cet-smoke-test

ld-linux-x86-64[99764] control protection ip:401139 sp:7fff025507d8 ssp:7f186e017fd8 error:1(near ret) in shstk1[401000+1000]

Tested-by: John Allen <john.allen@amd.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 00/39] Shadow stacks for userspace
  2023-01-20 17:27   ` Edgecombe, Rick P
@ 2023-01-20 19:19     ` Kees Cook
  2023-01-25 19:46       ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Kees Cook @ 2023-01-20 19:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: akpm, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, pavel, Lutomirski, Andy, linux-doc, arnd, tglx,
	Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov

On Fri, Jan 20, 2023 at 05:27:30PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-01-19 at 14:26 -0800, Andrew Morton wrote:
> > On Thu, 19 Jan 2023 13:22:38 -0800 Rick Edgecombe <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > > SHSTK
> > 
> > Sounds like me trying to swear in Russian while drunk.
> > 
> > Is there any chance of s/shstk/shadow_stack/g?
> 
> I'm fine with the name change. I think shstk got debated and picked
> early in the history of the series before I got involved. "shstk" is
> nice and short, but it's not completely clear what it is unless you
> already know about shadow stack. So there is a tradeoff of clarity and
> line length/wrapping. Does anyone else have any strong opinions?

I prefer SHSTK because it specifically means x86's hardware shadow
stack from CET. Lots of things can (and have) implemented things called
"shadow stack".

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 00/39] Shadow stacks for userspace
  2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
                   ` (40 preceding siblings ...)
  2023-01-20 17:48 ` John Allen
@ 2023-01-22  8:20 ` Mike Rapoport
  41 siblings, 0 replies; 120+ messages in thread
From: Mike Rapoport @ 2023-01-22  8:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:22:38PM -0800, Rick Edgecombe wrote:
> Hi,
> 
> This series implements Shadow Stacks for userspace using x86's Control-flow 
> Enforcement Technology (CET). CET consists of two related security features: 
> shadow stacks and indirect branch tracking. This series implements just the 
> shadow stack part of this feature, and just for userspace.

Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>

> Previous version [1].
> 
> [0] https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
> [1] https://lore.kernel.org/lkml/20221203003606.6838-1-rick.p.edgecombe@intel.com/
> 
> Kirill A. Shutemov (1):
>   x86: Introduce userspace API for shadow stack
> 
> Mike Rapoport (1):
>   x86/shstk: Add ARCH_SHSTK_UNLOCK
> 
> Rick Edgecombe (14):
>   x86/fpu: Add helper for modifying xstate
>   x86/mm: Introduce _PAGE_COW
>   x86/mm: Start actually marking _PAGE_COW
>   mm: Handle faultless write upgrades for shstk
>   mm: Don't allow write GUPs to shadow stack memory
>   x86/mm: Introduce MAP_ABOVE4G
>   mm: Warn on shadow stack memory in wrong vma
>   x86/shstk: Introduce map_shadow_stack syscall
>   x86/shstk: Support WRSS for userspace
>   x86: Expose thread features in /proc/$PID/status
>   x86/shstk: Wire in shadow stack interface
>   selftests/x86: Add shadow stack test
>   x86/fpu: Add helper for initing features
>   x86/shstk: Add ARCH_SHSTK_STATUS
> 
> Yu-cheng Yu (23):
>   Documentation/x86: Add CET shadow stack description
>   x86/shstk: Add Kconfig option for shadow stack
>   x86/cpufeatures: Add CPU feature flags for shadow stacks
>   x86/cpufeatures: Enable CET CR4 bit for shadow stack
>   x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
>   x86: Add user control-protection fault handler
>   x86/mm: Remove _PAGE_DIRTY from kernel RO pages
>   x86/mm: Move pmd_write(), pud_write() up in the file
>   x86/mm: Update pte_modify for _PAGE_COW
>   x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
>     transition from _PAGE_DIRTY to _PAGE_COW
>   mm: Move VM_UFFD_MINOR_BIT from 37 to 38
>   mm: Introduce VM_SHADOW_STACK for shadow stack memory
>   x86/mm: Check shadow stack page fault errors
>   x86/mm: Update maybe_mkwrite() for shadow stack
>   mm: Fixup places that call pte_mkwrite() directly
>   mm: Add guard pages around a shadow stack.
>   mm/mmap: Add shadow stack pages to memory accounting
>   mm: Re-introduce vm_flags to do_mmap()
>   x86/shstk: Add user-mode shadow stack support
>   x86/shstk: Handle thread shadow stack
>   x86/shstk: Introduce routines modifying shstk
>   x86/shstk: Handle signals for shadow stack
>   x86: Add PTRACE interface for shadow stack
> 
>  Documentation/filesystems/proc.rst            |   1 +
>  Documentation/x86/index.rst                   |   1 +
>  Documentation/x86/shstk.rst                   | 176 +++++
>  arch/arm/kernel/signal.c                      |   2 +-
>  arch/arm64/kernel/signal.c                    |   2 +-
>  arch/arm64/kernel/signal32.c                  |   2 +-
>  arch/sparc/kernel/signal32.c                  |   2 +-
>  arch/sparc/kernel/signal_64.c                 |   2 +-
>  arch/x86/Kconfig                              |  24 +
>  arch/x86/Kconfig.assembler                    |   5 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  arch/x86/include/asm/cpufeatures.h            |   2 +
>  arch/x86/include/asm/disabled-features.h      |  16 +-
>  arch/x86/include/asm/fpu/api.h                |   9 +
>  arch/x86/include/asm/fpu/regset.h             |   7 +-
>  arch/x86/include/asm/fpu/sched.h              |   3 +-
>  arch/x86/include/asm/fpu/types.h              |  16 +-
>  arch/x86/include/asm/fpu/xstate.h             |   6 +-
>  arch/x86/include/asm/idtentry.h               |   2 +-
>  arch/x86/include/asm/mmu_context.h            |   2 +
>  arch/x86/include/asm/msr.h                    |  11 +
>  arch/x86/include/asm/pgtable.h                | 338 ++++++++-
>  arch/x86/include/asm/pgtable_types.h          |  65 +-
>  arch/x86/include/asm/processor.h              |   8 +
>  arch/x86/include/asm/shstk.h                  |  40 ++
>  arch/x86/include/asm/special_insns.h          |  13 +
>  arch/x86/include/asm/tlbflush.h               |   3 +-
>  arch/x86/include/asm/trap_pf.h                |   2 +
>  arch/x86/include/asm/traps.h                  |  12 +
>  arch/x86/include/uapi/asm/mman.h              |   4 +
>  arch/x86/include/uapi/asm/prctl.h             |  12 +
>  arch/x86/kernel/Makefile                      |   4 +
>  arch/x86/kernel/cet.c                         | 152 ++++
>  arch/x86/kernel/cpu/common.c                  |  35 +-
>  arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
>  arch/x86/kernel/cpu/proc.c                    |  23 +
>  arch/x86/kernel/fpu/core.c                    |  59 +-
>  arch/x86/kernel/fpu/regset.c                  |  87 +++
>  arch/x86/kernel/fpu/xstate.c                  | 148 ++--
>  arch/x86/kernel/fpu/xstate.h                  |   6 +
>  arch/x86/kernel/idt.c                         |   2 +-
>  arch/x86/kernel/process.c                     |  18 +-
>  arch/x86/kernel/process_64.c                  |   9 +-
>  arch/x86/kernel/ptrace.c                      |  12 +
>  arch/x86/kernel/shstk.c                       | 492 +++++++++++++
>  arch/x86/kernel/signal.c                      |   1 +
>  arch/x86/kernel/signal_32.c                   |   2 +-
>  arch/x86/kernel/signal_64.c                   |   8 +-
>  arch/x86/kernel/sys_x86_64.c                  |   6 +-
>  arch/x86/kernel/traps.c                       |  87 ---
>  arch/x86/mm/fault.c                           |  38 +
>  arch/x86/mm/pat/set_memory.c                  |   2 +-
>  arch/x86/mm/pgtable.c                         |   6 +
>  arch/x86/xen/enlighten_pv.c                   |   2 +-
>  arch/x86/xen/xen-asm.S                        |   2 +-
>  fs/aio.c                                      |   2 +-
>  fs/proc/array.c                               |   6 +
>  fs/proc/task_mmu.c                            |   3 +
>  include/linux/mm.h                            |  59 +-
>  include/linux/mman.h                          |   4 +
>  include/linux/pgtable.h                       |  35 +
>  include/linux/proc_fs.h                       |   2 +
>  include/linux/syscalls.h                      |   1 +
>  include/uapi/asm-generic/siginfo.h            |   3 +-
>  include/uapi/asm-generic/unistd.h             |   2 +-
>  include/uapi/linux/elf.h                      |   2 +
>  ipc/shm.c                                     |   2 +-
>  kernel/sys_ni.c                               |   1 +
>  mm/gup.c                                      |   2 +-
>  mm/huge_memory.c                              |  12 +-
>  mm/memory.c                                   |   7 +-
>  mm/migrate_device.c                           |   4 +-
>  mm/mmap.c                                     |  12 +-
>  mm/nommu.c                                    |   4 +-
>  mm/userfaultfd.c                              |  10 +-
>  mm/util.c                                     |   2 +-
>  tools/testing/selftests/x86/Makefile          |   4 +-
>  .../testing/selftests/x86/test_shadow_stack.c | 667 ++++++++++++++++++
>  78 files changed, 2578 insertions(+), 259 deletions(-)
>  create mode 100644 Documentation/x86/shstk.rst
>  create mode 100644 arch/x86/include/asm/shstk.h
>  create mode 100644 arch/x86/kernel/cet.c
>  create mode 100644 arch/x86/kernel/shstk.c
>  create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c
> 
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-19 21:23 ` [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
@ 2023-01-23  9:10   ` David Hildenbrand
  2023-01-23 10:45     ` Florian Weimer
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-23  9:10 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On 19.01.23 22:23, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Shadow stack memory is writable only in very specific, controlled ways.
> However, since it is writable, the kernel treats it as such. As a result
> there remain many ways for userspace to trigger the kernel to write to
> shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
> little less exposed, block writable GUPs for shadow stack VMAs.
> 
> Still allow FOLL_FORCE to write through shadow stack protections, as it
> does for read-only protections.

So an app can simply modify the shadow stack itself by writing to 
/proc/self/mem ?

Is that really intended? Looks like security hole to me at first sight, 
but maybe I am missing something important.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW
  2023-01-19 21:22 ` [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
  2023-01-20  0:55   ` Kees Cook
@ 2023-01-23  9:16   ` David Hildenbrand
  2023-01-23  9:28   ` David Hildenbrand
  2 siblings, 0 replies; 120+ messages in thread
From: David Hildenbrand @ 2023-01-23  9:16 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: Yu-cheng Yu

On 19.01.23 22:22, Rick Edgecombe wrote:
> Some OSes have a greater dependence on software available bits in PTEs than
> Linux. That left the hardware architects looking for a way to represent a
> new memory type (shadow stack) within the existing bits. They chose to
> repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
> shadow stack memory, Linux should avoid creating memory with this PTE bit
> combination unless it intends for it to be shadow stack.
> 
> The reason it's lightly used is that Dirty=1 is normally set by HW
> _before_ a write. A write with a Write=0 PTE would typically only generate
> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
> supports shadow stacks will no longer exhibit this oddity.
> 
> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
> in places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
> Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
> Further differentiated by VMA flags, these PTE bit combinations would be
> set as follows for various types of memory:
> 
> (Write=0,Cow=1,Dirty=0):
>   - A modified, copy-on-write (COW) page. Previously when a typical
>     anonymous writable mapping was made COW via fork(), the kernel would
>     mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
>     happens in copy_present_pte().
>   - A R/O page that has been COW'ed. The user page is in a R/O VMA,
>     and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
>     handler creates a copy of the page and sets the new copy's PTE as
>     Write=0 and Cow=1.
>   - A shared shadow stack PTE. When a shadow stack page is being shared
>     among processes (this happens at fork()), its PTE is made Dirty=0, so
>     the next shadow stack access causes a fault, and the page is
>     duplicated and Dirty=1 is set again. This is the COW equivalent for
>     shadow stack pages, even though it's copy-on-access rather than
>     copy-on-write.
> 
> (Write=0,Cow=0,Dirty=1):
>   - A shadow stack PTE.
>   - A Cow PTE created when a processor without shadow stack support set
>     Dirty=1.
> 
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
> because shadow stacks are not enabled there.
> 
> Implement only the infrastructure for _PAGE_COW. Changes to start
> creating _PAGE_COW PTEs will follow once other pieces are in place.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> 
> v5:
>   - Fix log, comments and whitespace (Boris)
>   - Remove capitalization on shadow stack (Boris)
> 
> v4:
>   - Teach pte_flags_need_flush() about _PAGE_COW bit
>   - Break apart patch for better bisectability
> 
> v3:
>   - Add comment around _PAGE_TABLE in response to comment
>     from (Andrew Cooper)
>   - Check for PSE in pmd_shstk (Andrew Cooper)
>   - Get to the point quicker in commit log (Andrew Cooper)
>   - Clarify and reorder commit log for why the PTE bit examples have
>     multiple entries. Apply same changes for comment. (peterz)
>   - Fix comment that implied dirty bit for COW was a specific x86 thing
>     (peterz)
>   - Fix swapping of Write/Dirty (PeterZ)
> 
> v2:
>   - Update commit log with comments (Dave Hansen)
>   - Add comments in code to explain pte modification code better (Dave)
>   - Clarify info on the meaning of various Write,Cow,Dirty combinations
> 
>   arch/x86/include/asm/pgtable.h       | 78 ++++++++++++++++++++++++++++
>   arch/x86/include/asm/pgtable_types.h | 59 +++++++++++++++++++--
>   arch/x86/include/asm/tlbflush.h      |  3 +-
>   3 files changed, 134 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index b39f16c0d507..6d2f612c04b5 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -301,6 +301,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>   	return native_make_pte(v & ~clear);
>   }
>   
> +/*
> + * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the case
> + * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since the
> + * Dirty=1,Write=0 will result in the memory being treated as shadow stack
> + * by the HW. So when creating COW memory, a software bit is used
> + * _PAGE_BIT_COW. The following functions pte_mkcow() and pte_clear_cow()
> + * take a PTE marked conventionally COW (Dirty=1) and transition it to the
> + * shadow stack compatible version of COW (Cow=1).
> + */
> +static inline pte_t pte_mkcow(pte_t pte)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pte;
> +
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_set_flags(pte, _PAGE_COW);
> +}
> +
> +static inline pte_t pte_clear_cow(pte_t pte)
> +{
> +	/*
> +	 * _PAGE_COW is unnecessary on !X86_FEATURE_USER_SHSTK kernels, since
> +	 * the HW dirty bit can be used without creating shadow stack memory.
> +	 * See the _PAGE_COW definition for more details.
> +	 */
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pte;
> +
> +	/*
> +	 * PTE is getting copied-on-write, so it will be dirtied
> +	 * if writable, or made shadow stack if shadow stack and
> +	 * being copied on access. Set the dirty bit for both
> +	 * cases.
> +	 */
> +	pte = pte_set_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_COW);
> +}
> +
>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>   static inline int pte_uffd_wp(pte_t pte)
>   {
> @@ -413,6 +451,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
>   	return native_make_pmd(v & ~clear);
>   }
>   
> +/* See comments above pte_mkcow() */
> +static inline pmd_t pmd_mkcow(pmd_t pmd)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pmd;
> +
> +	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
> +	return pmd_set_flags(pmd, _PAGE_COW);
> +}
> +
> +/* See comments above pte_mkcow() */
> +static inline pmd_t pmd_clear_cow(pmd_t pmd)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pmd;
> +
> +	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
> +	return pmd_clear_flags(pmd, _PAGE_COW);
> +}
> +
>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>   static inline int pmd_uffd_wp(pmd_t pmd)
>   {
> @@ -484,6 +542,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
>   	return native_make_pud(v & ~clear);
>   }
>   
> +/* See comments above pte_mkcow() */
> +static inline pud_t pud_mkcow(pud_t pud)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pud;
> +
> +	pud = pud_clear_flags(pud, _PAGE_DIRTY);
> +	return pud_set_flags(pud, _PAGE_COW);
> +}
> +
> +/* See comments above pte_mkcow() */
> +static inline pud_t pud_clear_cow(pud_t pud)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pud;
> +
> +	pud = pud_set_flags(pud, _PAGE_DIRTY);
> +	return pud_clear_flags(pud, _PAGE_COW);
> +}
> +
>   static inline pud_t pud_mkold(pud_t pud)
>   {
>   	return pud_clear_flags(pud, _PAGE_ACCESSED);
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 0646ad00178b..5c3f942865d9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -21,7 +21,8 @@
>   #define _PAGE_BIT_SOFTW2	10	/* " */
>   #define _PAGE_BIT_SOFTW3	11	/* " */
>   #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
>   #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>   #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
>   #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
> @@ -34,6 +35,15 @@
>   #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>   #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>   
> +/*
> + * Indicates a copy-on-write page.
> + */
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
> +#else
> +#define _PAGE_BIT_COW		0
> +#endif
> +
>   /* If _PAGE_BIT_PRESENT is clear, we use these: */
>   /* - if the user mapped it with PROT_NONE; pte_present gives true */
>   #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
> @@ -117,6 +127,40 @@
>   #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
>   #endif
>   
> +/*
> + * The hardware requires shadow stack to be read-only and Dirty.
> + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
> + * from shadow stack PTEs:

Is that really required?

For anon pages, we have PG_anon_exclusive, that can tell you whether the 
page is "certainly exclusive" (now cow necessary) vs. "maybe shared" 
(cow maybe necessary).

Why isn't that sufficient to make the same decisions here?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW
  2023-01-19 21:22 ` [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
  2023-01-20  0:55   ` Kees Cook
  2023-01-23  9:16   ` David Hildenbrand
@ 2023-01-23  9:28   ` David Hildenbrand
  2023-01-23 20:56     ` Edgecombe, Rick P
  2 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-23  9:28 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: Yu-cheng Yu

On 19.01.23 22:22, Rick Edgecombe wrote:
> Some OSes have a greater dependence on software available bits in PTEs than
> Linux. That left the hardware architects looking for a way to represent a
> new memory type (shadow stack) within the existing bits. They chose to
> repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
> shadow stack memory, Linux should avoid creating memory with this PTE bit
> combination unless it intends for it to be shadow stack.
> 
> The reason it's lightly used is that Dirty=1 is normally set by HW
> _before_ a write. A write with a Write=0 PTE would typically only generate
> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
> supports shadow stacks will no longer exhibit this oddity.
> 
> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
> in places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
> Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
> Further differentiated by VMA flags, these PTE bit combinations would be
> set as follows for various types of memory:
> 
> (Write=0,Cow=1,Dirty=0):
>   - A modified, copy-on-write (COW) page. Previously when a typical
>     anonymous writable mapping was made COW via fork(), the kernel would
>     mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
>     happens in copy_present_pte().
>   - A R/O page that has been COW'ed. The user page is in a R/O VMA,
>     and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
>     handler creates a copy of the page and sets the new copy's PTE as
>     Write=0 and Cow=1.
>   - A shared shadow stack PTE. When a shadow stack page is being shared
>     among processes (this happens at fork()), its PTE is made Dirty=0, so
>     the next shadow stack access causes a fault, and the page is
>     duplicated and Dirty=1 is set again. This is the COW equivalent for
>     shadow stack pages, even though it's copy-on-access rather than
>     copy-on-write.
> 
> (Write=0,Cow=0,Dirty=1):
>   - A shadow stack PTE.
>   - A Cow PTE created when a processor without shadow stack support set
>     Dirty=1.
> 
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
> because shadow stacks are not enabled there.
> 
> Implement only the infrastructure for _PAGE_COW. Changes to start
> creating _PAGE_COW PTEs will follow once other pieces are in place.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> 
> v5:
>   - Fix log, comments and whitespace (Boris)
>   - Remove capitalization on shadow stack (Boris)
> 
> v4:
>   - Teach pte_flags_need_flush() about _PAGE_COW bit
>   - Break apart patch for better bisectability
> 
> v3:
>   - Add comment around _PAGE_TABLE in response to comment
>     from (Andrew Cooper)
>   - Check for PSE in pmd_shstk (Andrew Cooper)
>   - Get to the point quicker in commit log (Andrew Cooper)
>   - Clarify and reorder commit log for why the PTE bit examples have
>     multiple entries. Apply same changes for comment. (peterz)
>   - Fix comment that implied dirty bit for COW was a specific x86 thing
>     (peterz)
>   - Fix swapping of Write/Dirty (PeterZ)
> 
> v2:
>   - Update commit log with comments (Dave Hansen)
>   - Add comments in code to explain pte modification code better (Dave)
>   - Clarify info on the meaning of various Write,Cow,Dirty combinations
> 
>   arch/x86/include/asm/pgtable.h       | 78 ++++++++++++++++++++++++++++
>   arch/x86/include/asm/pgtable_types.h | 59 +++++++++++++++++++--
>   arch/x86/include/asm/tlbflush.h      |  3 +-
>   3 files changed, 134 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index b39f16c0d507..6d2f612c04b5 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -301,6 +301,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>   	return native_make_pte(v & ~clear);
>   }
>   
> +/*
> + * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the case
> + * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since the
> + * Dirty=1,Write=0 will result in the memory being treated as shadow stack
> + * by the HW. So when creating COW memory, a software bit is used
> + * _PAGE_BIT_COW. The following functions pte_mkcow() and pte_clear_cow()
> + * take a PTE marked conventionally COW (Dirty=1) and transition it to the
> + * shadow stack compatible version of COW (Cow=1).
> + */

TBH, I find that all highly confusing.

Dirty=1,Write=0 does not indicate a COW page reliably. You could have 
both, false negatives and false positives.

False negative: fork() on a clean anon page.

False positives: wrpotect() of a dirty anon page.


I wonder if it really has to be that complicated: what you really want 
to achieve is to disallow "Dirty=1,Write=0" if it's not a shadow stack 
page, correct?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-19 21:22 ` [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk Rick Edgecombe
@ 2023-01-23  9:50   ` David Hildenbrand
  2023-01-23 20:47     ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-23  9:50 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: Yu-cheng Yu

On 19.01.23 22:22, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Since shadow stack memory can be changed from userspace, is both
> VM_SHADOW_STACK and VM_WRITE. But it should not be made conventionally
> writable (i.e. pte_mkwrite()). So some code that calls pte_mkwrite() needs
> to be adjusted.
> 
> One such case is when memory is made writable without an actual write
> fault. This happens in some mprotect operations, and also prot_numa faults.
> In both cases code checks whether it should be made (conventionally)
> writable by calling vma_wants_manual_pte_write_upgrade().
> 
> One way to fix this would be have code actually check if memory is also
> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But since
> most memory won't be shadow stack, just have simpler logic and skip this
> optimization by changing vma_wants_manual_pte_write_upgrade() to not
> return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
> cases of this type.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---

Instead of having these x86-shadow stack details all over the MM space, 
was the option explored to handle this more in arch specific code?

IIUC, one way to get it working would be

1) Have a SW "shadowstack" PTE flag.
2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".

pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions based 
on the "shadowstack" PTE flag and hide all these details from core-mm.

When mapping a shadowstack page (new page, migration, swapin, ...), 
which can be obtained by looking at the VMA flags, the first thing you'd 
do is set the "shadowstack" PTE flag.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-23  9:10   ` David Hildenbrand
@ 2023-01-23 10:45     ` Florian Weimer
  2023-01-23 20:46       ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Florian Weimer @ 2023-01-23 10:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Weijiang Yang, Kirill A . Shutemov,
	John Allen, kcc, eranian, rppt, jamorris, dethoma, akpm,
	Andrew.Cooper3, christina.schimpe

* David Hildenbrand:

> On 19.01.23 22:23, Rick Edgecombe wrote:
>> The x86 Control-flow Enforcement Technology (CET) feature includes a new
>> type of memory called shadow stack. This shadow stack memory has some
>> unusual properties, which requires some core mm changes to function
>> properly.
>> Shadow stack memory is writable only in very specific, controlled
>> ways.
>> However, since it is writable, the kernel treats it as such. As a result
>> there remain many ways for userspace to trigger the kernel to write to
>> shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
>> little less exposed, block writable GUPs for shadow stack VMAs.
>> Still allow FOLL_FORCE to write through shadow stack protections, as
>> it
>> does for read-only protections.
>
> So an app can simply modify the shadow stack itself by writing to
> /proc/self/mem ?
>
> Is that really intended? Looks like security hole to me at first
> sight, but maybe I am missing something important.

Isn't it possible to overwrite GOT pointers using the same vector?
So I think it's merely reflecting the status quo.

Thanks,
Florian



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-23 10:45     ` Florian Weimer
@ 2023-01-23 20:46       ` Edgecombe, Rick P
  2023-01-24 16:26         ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-23 20:46 UTC (permalink / raw)
  To: fweimer, david
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, Eranian, Stephane, kirill.shutemov,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, pavel, Lutomirski, Andy, linux-doc, arnd, tglx,
	Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm

On Mon, 2023-01-23 at 11:45 +0100, Florian Weimer wrote:
> * David Hildenbrand:
> 
> > On 19.01.23 22:23, Rick Edgecombe wrote:
> > > The x86 Control-flow Enforcement Technology (CET) feature
> > > includes a new
> > > type of memory called shadow stack. This shadow stack memory has
> > > some
> > > unusual properties, which requires some core mm changes to
> > > function
> > > properly.
> > > Shadow stack memory is writable only in very specific, controlled
> > > ways.
> > > However, since it is writable, the kernel treats it as such. As a
> > > result
> > > there remain many ways for userspace to trigger the kernel to
> > > write to
> > > shadow stack's via get_user_pages(, FOLL_WRITE) operations. To
> > > make this a
> > > little less exposed, block writable GUPs for shadow stack VMAs.
> > > Still allow FOLL_FORCE to write through shadow stack protections,
> > > as
> > > it
> > > does for read-only protections.
> > 
> > So an app can simply modify the shadow stack itself by writing to
> > /proc/self/mem ?
> > 
> > Is that really intended? Looks like security hole to me at first
> > sight, but maybe I am missing something important.
> 
> Isn't it possible to overwrite GOT pointers using the same vector?
> So I think it's merely reflecting the status quo.

There was some debate on this. /proc/self/mem can currently write
through read-only memory which protects executable code. So should
shadow stack get separate rules? Is ROP a worry when you can overwrite
executable code?

The consensus seemed to lean towards not making special rules for this
case, and there was some discussion that /proc/self/mem should maybe be
hardened generally.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-23  9:50   ` David Hildenbrand
@ 2023-01-23 20:47     ` Edgecombe, Rick P
  2023-01-24 16:24       ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-23 20:47 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	pavel, oleg, hjl.tools, bp, Lutomirski, Andy, linux-doc, arnd,
	tglx, Schimpe, Christina, x86, mike.kravetz, Yang, Weijiang,
	jamorris, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov, akpm
  Cc: Yu, Yu-cheng

On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
> On 19.01.23 22:22, Rick Edgecombe wrote:
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> > 
> > Since shadow stack memory can be changed from userspace, is both
> > VM_SHADOW_STACK and VM_WRITE. But it should not be made
> > conventionally
> > writable (i.e. pte_mkwrite()). So some code that calls
> > pte_mkwrite() needs
> > to be adjusted.
> > 
> > One such case is when memory is made writable without an actual
> > write
> > fault. This happens in some mprotect operations, and also prot_numa
> > faults.
> > In both cases code checks whether it should be made
> > (conventionally)
> > writable by calling vma_wants_manual_pte_write_upgrade().
> > 
> > One way to fix this would be have code actually check if memory is
> > also
> > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
> > since
> > most memory won't be shadow stack, just have simpler logic and skip
> > this
> > optimization by changing vma_wants_manual_pte_write_upgrade() to
> > not
> > return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
> > cases of this type.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> 
> Instead of having these x86-shadow stack details all over the MM
> space, 
> was the option explored to handle this more in arch specific code?
> 
> IIUC, one way to get it working would be
> 
> 1) Have a SW "shadowstack" PTE flag.
> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".

I don't think that idea came up. So vma->vm_page_prot would have the SW
shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
Write=0,Dirty=1 part. It seems like it should work.

> 
> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
> based 
> on the "shadowstack" PTE flag and hide all these details from core-
> mm.
> 
> When mapping a shadowstack page (new page, migration, swapin, ...), 
> which can be obtained by looking at the VMA flags, the first thing
> you'd 
> do is set the "shadowstack" PTE flag.

I guess the downside is that it uses an extra software bit. But the
other positive is that it's less error prone, so that someone writing
core-mm code won't introduce a change that makes shadow stack VMAs
Write=1 if they don't know to also check for VM_SHADOW_STACK.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW
  2023-01-23  9:28   ` David Hildenbrand
@ 2023-01-23 20:56     ` Edgecombe, Rick P
  2023-01-24 16:28       ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-23 20:56 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	pavel, oleg, hjl.tools, bp, Lutomirski, Andy, linux-doc, arnd,
	tglx, Schimpe, Christina, x86, mike.kravetz, Yang, Weijiang,
	jamorris, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov, akpm
  Cc: Yu, Yu-cheng

Trying to answer both questions to this patch on this one.

On Mon, 2023-01-23 at 10:28 +0100, David Hildenbrand wrote:
> > +/*
> > + * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in
> > the case
> > + * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since
> > the
> > + * Dirty=1,Write=0 will result in the memory being treated as
> > shadow stack
> > + * by the HW. So when creating COW memory, a software bit is used
> > + * _PAGE_BIT_COW. The following functions pte_mkcow() and
> > pte_clear_cow()
> > + * take a PTE marked conventionally COW (Dirty=1) and transition
> > it to the
> > + * shadow stack compatible version of COW (Cow=1).
> > + */
> 
> TBH, I find that all highly confusing.
> 
> Dirty=1,Write=0 does not indicate a COW page reliably. You could
> have 
> both, false negatives and false positives.
> 
> False negative: fork() on a clean anon page.
> 
> False positives: wrpotect() of a dirty anon page.
> 
> 
> I wonder if it really has to be that complicated: what you really
> want 
> to achieve is to disallow "Dirty=1,Write=0" if it's not a shadow
> stack 
> page, correct?

The other thing is to save that the PTE is/was Dirty=1 somewhere (for
non-shadow stack memory). A slightly different but related thing. But
losing that information would would introduce differences for
pte_dirty() between when shadow stack was enabled or not. GUP/COW
doesn't need this anymore but there are lots of other places it gets
checked.

Perhaps following your GUP changes, _PAGE_COW is just now the wrong
name for it. _PAGE_SAVED_DIRTY maybe?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-23 20:47     ` Edgecombe, Rick P
@ 2023-01-24 16:24       ` David Hildenbrand
  2023-01-24 18:14         ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-24 16:24 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, x86, mike.kravetz,
	Yang, Weijiang, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov, akpm
  Cc: Yu, Yu-cheng

On 23.01.23 21:47, Edgecombe, Rick P wrote:
> On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
>> On 19.01.23 22:22, Rick Edgecombe wrote:
>>> The x86 Control-flow Enforcement Technology (CET) feature includes
>>> a new
>>> type of memory called shadow stack. This shadow stack memory has
>>> some
>>> unusual properties, which requires some core mm changes to function
>>> properly.
>>>
>>> Since shadow stack memory can be changed from userspace, is both
>>> VM_SHADOW_STACK and VM_WRITE. But it should not be made
>>> conventionally
>>> writable (i.e. pte_mkwrite()). So some code that calls
>>> pte_mkwrite() needs
>>> to be adjusted.
>>>
>>> One such case is when memory is made writable without an actual
>>> write
>>> fault. This happens in some mprotect operations, and also prot_numa
>>> faults.
>>> In both cases code checks whether it should be made
>>> (conventionally)
>>> writable by calling vma_wants_manual_pte_write_upgrade().
>>>
>>> One way to fix this would be have code actually check if memory is
>>> also
>>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
>>> since
>>> most memory won't be shadow stack, just have simpler logic and skip
>>> this
>>> optimization by changing vma_wants_manual_pte_write_upgrade() to
>>> not
>>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
>>> cases of this type.
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>>> Tested-by: John Allen <john.allen@amd.com>
>>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> ---
>>
>> Instead of having these x86-shadow stack details all over the MM
>> space,
>> was the option explored to handle this more in arch specific code?
>>
>> IIUC, one way to get it working would be
>>
>> 1) Have a SW "shadowstack" PTE flag.
>> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".
> 
> I don't think that idea came up. So vma->vm_page_prot would have the SW
> shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
> Write=0,Dirty=1 part. It seems like it should work.
> 

Right, if we include it in vma->vm_page_prot, we'd immediately let 
mk_pte() just handle that.

Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma instead 
of the vma->vm_page_prot. Let's see if we can avoid that for now.

>>
>> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
>> based
>> on the "shadowstack" PTE flag and hide all these details from core-
>> mm.
>>
>> When mapping a shadowstack page (new page, migration, swapin, ...),
>> which can be obtained by looking at the VMA flags, the first thing
>> you'd
>> do is set the "shadowstack" PTE flag.
> 
> I guess the downside is that it uses an extra software bit. But the
> other positive is that it's less error prone, so that someone writing
> core-mm code won't introduce a change that makes shadow stack VMAs
> Write=1 if they don't know to also check for VM_SHADOW_STACK.

Right. And I think this mimics the what I would have expected HW to 
provide: a dedicated HW bit, not somehow mangling this into semantics of 
existing bits.

Roughly speaking: if we abstract it that way and get all of the "how to 
set it writable now?" out of core-MM, it not only is cleaner and less 
error prone, it might even allow other architectures that implement 
something comparable (e.g., using a dedicated HW bit) to actually reuse 
some of that work. Otherwise most of that "shstk" is really just x86 
specific ...

I guess the only cases we have to special case would be page pinning 
code where pte_write() would indicate that the PTE is writable (well, it 
is, just not by "ordinary CPU instruction" context directly): but you do 
that already, so ... :)

Sorry for stumbling over that this late, I only started looking into 
this when you CCed me on that one patch.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-23 20:46       ` Edgecombe, Rick P
@ 2023-01-24 16:26         ` David Hildenbrand
  2023-01-24 18:42           ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-24 16:26 UTC (permalink / raw)
  To: Edgecombe, Rick P, fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, Eranian, Stephane, kirill.shutemov,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, pavel, Lutomirski, Andy, linux-doc, arnd, tglx,
	Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm

On 23.01.23 21:46, Edgecombe, Rick P wrote:
> On Mon, 2023-01-23 at 11:45 +0100, Florian Weimer wrote:
>> * David Hildenbrand:
>>
>>> On 19.01.23 22:23, Rick Edgecombe wrote:
>>>> The x86 Control-flow Enforcement Technology (CET) feature
>>>> includes a new
>>>> type of memory called shadow stack. This shadow stack memory has
>>>> some
>>>> unusual properties, which requires some core mm changes to
>>>> function
>>>> properly.
>>>> Shadow stack memory is writable only in very specific, controlled
>>>> ways.
>>>> However, since it is writable, the kernel treats it as such. As a
>>>> result
>>>> there remain many ways for userspace to trigger the kernel to
>>>> write to
>>>> shadow stack's via get_user_pages(, FOLL_WRITE) operations. To
>>>> make this a
>>>> little less exposed, block writable GUPs for shadow stack VMAs.
>>>> Still allow FOLL_FORCE to write through shadow stack protections,
>>>> as
>>>> it
>>>> does for read-only protections.
>>>
>>> So an app can simply modify the shadow stack itself by writing to
>>> /proc/self/mem ?
>>>
>>> Is that really intended? Looks like security hole to me at first
>>> sight, but maybe I am missing something important.
>>
>> Isn't it possible to overwrite GOT pointers using the same vector?
>> So I think it's merely reflecting the status quo.
> 
> There was some debate on this. /proc/self/mem can currently write
> through read-only memory which protects executable code. So should
> shadow stack get separate rules? Is ROP a worry when you can overwrite
> executable code?
> 

The question is, if there is reasonable debugging reason to keep it. I 
assume if a debugger would adjust the ordinary stack, it would have to 
adjust the shadow stack as well (oh my ...). So it sounds reasonable to 
have it in theory at least ... not sure when debugger would support 
that, but maybe they already do.

> The consensus seemed to lean towards not making special rules for this
> case, and there was some discussion that /proc/self/mem should maybe be
> hardened generally.

I agree with that. It's a debugging mechanism that a process can abuse 
to do nasty stuff to its memory that it maybe shouldn't be able to do ...

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW
  2023-01-23 20:56     ` Edgecombe, Rick P
@ 2023-01-24 16:28       ` David Hildenbrand
  0 siblings, 0 replies; 120+ messages in thread
From: David Hildenbrand @ 2023-01-24 16:28 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, x86, mike.kravetz,
	Yang, Weijiang, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov, akpm
  Cc: Yu, Yu-cheng

On 23.01.23 21:56, Edgecombe, Rick P wrote:
> Trying to answer both questions to this patch on this one.
> 
> On Mon, 2023-01-23 at 10:28 +0100, David Hildenbrand wrote:
>>> +/*
>>> + * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in
>>> the case
>>> + * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since
>>> the
>>> + * Dirty=1,Write=0 will result in the memory being treated as
>>> shadow stack
>>> + * by the HW. So when creating COW memory, a software bit is used
>>> + * _PAGE_BIT_COW. The following functions pte_mkcow() and
>>> pte_clear_cow()
>>> + * take a PTE marked conventionally COW (Dirty=1) and transition
>>> it to the
>>> + * shadow stack compatible version of COW (Cow=1).
>>> + */
>>
>> TBH, I find that all highly confusing.
>>
>> Dirty=1,Write=0 does not indicate a COW page reliably. You could
>> have
>> both, false negatives and false positives.
>>
>> False negative: fork() on a clean anon page.
>>
>> False positives: wrpotect() of a dirty anon page.
>>
>>
>> I wonder if it really has to be that complicated: what you really
>> want
>> to achieve is to disallow "Dirty=1,Write=0" if it's not a shadow
>> stack
>> page, correct?
> 
> The other thing is to save that the PTE is/was Dirty=1 somewhere (for
> non-shadow stack memory). A slightly different but related thing. But
> losing that information would would introduce differences for
> pte_dirty() between when shadow stack was enabled or not. GUP/COW
> doesn't need this anymore but there are lots of other places it gets
> checked.
> 
> Perhaps following your GUP changes, _PAGE_COW is just now the wrong
> name for it. _PAGE_SAVED_DIRTY maybe?

It goes into the direction of my other proposal/idea, yes. Not sure if 
_PAGE_SAVED_DIRTY would currently mimic what's happening here ... 
_PAGE_COW is certainly wrong and misleading.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-24 16:24       ` David Hildenbrand
@ 2023-01-24 18:14         ` Edgecombe, Rick P
  2023-01-25  9:27           ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-24 18:14 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, bp,
	jamorris, linux-doc, Schimpe, Christina, mike.kravetz, x86, akpm,
	tglx, arnd, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote:
> On 23.01.23 21:47, Edgecombe, Rick P wrote:
> > On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
> > > On 19.01.23 22:22, Rick Edgecombe wrote:
> > > > The x86 Control-flow Enforcement Technology (CET) feature
> > > > includes
> > > > a new
> > > > type of memory called shadow stack. This shadow stack memory
> > > > has
> > > > some
> > > > unusual properties, which requires some core mm changes to
> > > > function
> > > > properly.
> > > > 
> > > > Since shadow stack memory can be changed from userspace, is
> > > > both
> > > > VM_SHADOW_STACK and VM_WRITE. But it should not be made
> > > > conventionally
> > > > writable (i.e. pte_mkwrite()). So some code that calls
> > > > pte_mkwrite() needs
> > > > to be adjusted.
> > > > 
> > > > One such case is when memory is made writable without an actual
> > > > write
> > > > fault. This happens in some mprotect operations, and also
> > > > prot_numa
> > > > faults.
> > > > In both cases code checks whether it should be made
> > > > (conventionally)
> > > > writable by calling vma_wants_manual_pte_write_upgrade().
> > > > 
> > > > One way to fix this would be have code actually check if memory
> > > > is
> > > > also
> > > > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
> > > > since
> > > > most memory won't be shadow stack, just have simpler logic and
> > > > skip
> > > > this
> > > > optimization by changing vma_wants_manual_pte_write_upgrade()
> > > > to
> > > > not
> > > > return true for VM_SHADOW_STACK_MEMORY. This will simply handle
> > > > all
> > > > cases of this type.
> > > > 
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > > > Tested-by: John Allen <john.allen@amd.com>
> > > > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > > Reviewed-by: Kirill A. Shutemov <
> > > > kirill.shutemov@linux.intel.com>
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > ---
> > > 
> > > Instead of having these x86-shadow stack details all over the MM
> > > space,
> > > was the option explored to handle this more in arch specific
> > > code?
> > > 
> > > IIUC, one way to get it working would be
> > > 
> > > 1) Have a SW "shadowstack" PTE flag.
> > > 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when
> > > "write=0".
> > 
> > I don't think that idea came up. So vma->vm_page_prot would have
> > the SW
> > shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
> > Write=0,Dirty=1 part. It seems like it should work.
> > 
> 
> Right, if we include it in vma->vm_page_prot, we'd immediately let 
> mk_pte() just handle that.
> 
> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma
> instead 
> of the vma->vm_page_prot. Let's see if we can avoid that for now.
> 
> > > 
> > > pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
> > > based
> > > on the "shadowstack" PTE flag and hide all these details from
> > > core-
> > > mm.
> > > 
> > > When mapping a shadowstack page (new page, migration, swapin,
> > > ...),
> > > which can be obtained by looking at the VMA flags, the first
> > > thing
> > > you'd
> > > do is set the "shadowstack" PTE flag.
> > 
> > I guess the downside is that it uses an extra software bit. But the
> > other positive is that it's less error prone, so that someone
> > writing
> > core-mm code won't introduce a change that makes shadow stack VMAs
> > Write=1 if they don't know to also check for VM_SHADOW_STACK.
> 
> Right. And I think this mimics the what I would have expected HW to 
> provide: a dedicated HW bit, not somehow mangling this into semantics
> of 
> existing bits.

Yea.

> 
> Roughly speaking: if we abstract it that way and get all of the "how
> to 
> set it writable now?" out of core-MM, it not only is cleaner and
> less 
> error prone, it might even allow other architectures that implement 
> something comparable (e.g., using a dedicated HW bit) to actually
> reuse 
> some of that work. Otherwise most of that "shstk" is really just x86 
> specific ...
> 
> I guess the only cases we have to special case would be page pinning 
> code where pte_write() would indicate that the PTE is writable (well,
> it 
> is, just not by "ordinary CPU instruction" context directly): but you
> do 
> that already, so ... :)
> 
> Sorry for stumbling over that this late, I only started looking into 
> this when you CCed me on that one patch.

Sorry for not calling more attention to it earlier. Appreciate your
comments.

Previously versions of this series had changed some of these
pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma.
This way an x86 implementation could use the VM_SHADOW_STACK vma flag
to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback
was that in some of these code paths "maybe" isn't really an option, it
*needs* to make it writable. Even though the logic was the same, the
name of the function made it look wrong.

But another option could be to change pte_mkwrite() to take a vma. This
would save using another software bit on x86, but instead requires a
small change to each arch's pte_mkwrite().

x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but
maybe it could additionally warn if the vma is not writable. It also
seems more aligned with your changes to stop taking hints from PTE bits
and just look at the VMA? (I'm thinking about the dropping of the dirty
check in GUP and dropping pte_saved_write())


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-24 16:26         ` David Hildenbrand
@ 2023-01-24 18:42           ` Edgecombe, Rick P
  2023-01-24 23:08             ` Kees Cook
  2023-01-25 15:36             ` Schimpe, Christina
  0 siblings, 2 replies; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-24 18:42 UTC (permalink / raw)
  To: fweimer, david
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, pavel, rppt, john.allen, mingo,
	corbet, linux-kernel, linux-api, gorcunov, akpm

Ping Cristina regarding GDB.

Ping Kees regarding /proc/self/mem.

On Tue, 2023-01-24 at 17:26 +0100, David Hildenbrand wrote:
> > > Isn't it possible to overwrite GOT pointers using the same
> > > vector?
> > > So I think it's merely reflecting the status quo.
> > 
> > There was some debate on this. /proc/self/mem can currently write
> > through read-only memory which protects executable code. So should
> > shadow stack get separate rules? Is ROP a worry when you can
> > overwrite
> > executable code?
> > 
> 
> The question is, if there is reasonable debugging reason to keep it.
> I 
> assume if a debugger would adjust the ordinary stack, it would have
> to 
> adjust the shadow stack as well (oh my ...). So it sounds reasonable
> to 
> have it in theory at least ... not sure when debugger would support 
> that, but maybe they already do.

GDB support for shadow stack is queued up for whenever the kernel
interface settles. I believe it just uses ptrace, and not this proc.
But yea ptrace poke will still need to use FOLL_FORCE and be able to
write through shadow stacks.

> 
> > The consensus seemed to lean towards not making special rules for
> > this
> > case, and there was some discussion that /proc/self/mem should
> > maybe be
> > hardened generally.
> 
> I agree with that. It's a debugging mechanism that a process can
> abuse 
> to do nasty stuff to its memory that it maybe shouldn't be able to do
> ...

Ok.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-24 18:42           ` Edgecombe, Rick P
@ 2023-01-24 23:08             ` Kees Cook
  2023-01-24 23:41               ` Edgecombe, Rick P
  2023-01-25 15:36             ` Schimpe, Christina
  1 sibling, 1 reply; 120+ messages in thread
From: Kees Cook @ 2023-01-24 23:08 UTC (permalink / raw)
  To: Edgecombe, Rick P, fweimer, david
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, pavel, rppt, john.allen, mingo,
	corbet, linux-kernel, linux-api, gorcunov, akpm

On January 24, 2023 10:42:28 AM PST, "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
>Ping Cristina regarding GDB.
>
>Ping Kees regarding /proc/self/mem.
>
>On Tue, 2023-01-24 at 17:26 +0100, David Hildenbrand wrote:
>> > > Isn't it possible to overwrite GOT pointers using the same
>> > > vector?
>> > > So I think it's merely reflecting the status quo.
>> > 
>> > There was some debate on this. /proc/self/mem can currently write
>> > through read-only memory which protects executable code. So should
>> > shadow stack get separate rules? Is ROP a worry when you can
>> > overwrite
>> > executable code?
>> > 
>> 
>> The question is, if there is reasonable debugging reason to keep it.
>> I 
>> assume if a debugger would adjust the ordinary stack, it would have
>> to 
>> adjust the shadow stack as well (oh my ...). So it sounds reasonable
>> to 
>> have it in theory at least ... not sure when debugger would support 
>> that, but maybe they already do.
>
>GDB support for shadow stack is queued up for whenever the kernel
>interface settles. I believe it just uses ptrace, and not this proc.
>But yea ptrace poke will still need to use FOLL_FORCE and be able to
>write through shadow stacks.

I'd prefer to avoid adding more FOLL_FORCE if we can. If gdb can do stack manipulations through a ptrace interface then let's leave off FOLL_FORCE.

-Kees

>
>> 
>> > The consensus seemed to lean towards not making special rules for
>> > this
>> > case, and there was some discussion that /proc/self/mem should
>> > maybe be
>> > hardened generally.
>> 
>> I agree with that. It's a debugging mechanism that a process can
>> abuse 
>> to do nasty stuff to its memory that it maybe shouldn't be able to do
>> ...
>
>Ok.


-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-24 23:08             ` Kees Cook
@ 2023-01-24 23:41               ` Edgecombe, Rick P
  2023-01-25  9:29                 ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-24 23:41 UTC (permalink / raw)
  To: fweimer, david, kees
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, pavel, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov, akpm

On Tue, 2023-01-24 at 15:08 -0800, Kees Cook wrote:
> > GDB support for shadow stack is queued up for whenever the kernel
> > interface settles. I believe it just uses ptrace, and not this
> > proc.
> > But yea ptrace poke will still need to use FOLL_FORCE and be able
> > to
> > write through shadow stacks.
> 
> I'd prefer to avoid adding more FOLL_FORCE if we can. If gdb can do
> stack manipulations through a ptrace interface then let's leave off
> FOLL_FORCE.

Ptrace and /proc/self/mem both use FOLL_FORCE. I think ptrace will
always need it or something like it for debugging.

To jog your memory, this series doesn't change what uses FOLL_FORCE. It
just sets the shadow stack rules to be the same as read-only memory. So
even though shadow stack memory is sort of writable, it's a bit more
locked down and FOLL_FORCE is required to write to it with GUP.

If we just remove FOLL_FORCE from /proc/self/mem, something will
probably break right? How do we do this? Some sort of opt-in?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-24 18:14         ` Edgecombe, Rick P
@ 2023-01-25  9:27           ` David Hildenbrand
  2023-01-25 18:43             ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-25  9:27 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, bp, jamorris, linux-doc, Schimpe, Christina, mike.kravetz,
	x86, akpm, tglx, arnd, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 24.01.23 19:14, Edgecombe, Rick P wrote:
> On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote:
>> On 23.01.23 21:47, Edgecombe, Rick P wrote:
>>> On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
>>>> On 19.01.23 22:22, Rick Edgecombe wrote:
>>>>> The x86 Control-flow Enforcement Technology (CET) feature
>>>>> includes
>>>>> a new
>>>>> type of memory called shadow stack. This shadow stack memory
>>>>> has
>>>>> some
>>>>> unusual properties, which requires some core mm changes to
>>>>> function
>>>>> properly.
>>>>>
>>>>> Since shadow stack memory can be changed from userspace, is
>>>>> both
>>>>> VM_SHADOW_STACK and VM_WRITE. But it should not be made
>>>>> conventionally
>>>>> writable (i.e. pte_mkwrite()). So some code that calls
>>>>> pte_mkwrite() needs
>>>>> to be adjusted.
>>>>>
>>>>> One such case is when memory is made writable without an actual
>>>>> write
>>>>> fault. This happens in some mprotect operations, and also
>>>>> prot_numa
>>>>> faults.
>>>>> In both cases code checks whether it should be made
>>>>> (conventionally)
>>>>> writable by calling vma_wants_manual_pte_write_upgrade().
>>>>>
>>>>> One way to fix this would be have code actually check if memory
>>>>> is
>>>>> also
>>>>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
>>>>> since
>>>>> most memory won't be shadow stack, just have simpler logic and
>>>>> skip
>>>>> this
>>>>> optimization by changing vma_wants_manual_pte_write_upgrade()
>>>>> to
>>>>> not
>>>>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle
>>>>> all
>>>>> cases of this type.
>>>>>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>>>>> Tested-by: John Allen <john.allen@amd.com>
>>>>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>>>> Reviewed-by: Kirill A. Shutemov <
>>>>> kirill.shutemov@linux.intel.com>
>>>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>>> ---
>>>>
>>>> Instead of having these x86-shadow stack details all over the MM
>>>> space,
>>>> was the option explored to handle this more in arch specific
>>>> code?
>>>>
>>>> IIUC, one way to get it working would be
>>>>
>>>> 1) Have a SW "shadowstack" PTE flag.
>>>> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when
>>>> "write=0".
>>>
>>> I don't think that idea came up. So vma->vm_page_prot would have
>>> the SW
>>> shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
>>> Write=0,Dirty=1 part. It seems like it should work.
>>>
>>
>> Right, if we include it in vma->vm_page_prot, we'd immediately let
>> mk_pte() just handle that.
>>
>> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma
>> instead
>> of the vma->vm_page_prot. Let's see if we can avoid that for now.
>>
>>>>
>>>> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
>>>> based
>>>> on the "shadowstack" PTE flag and hide all these details from
>>>> core-
>>>> mm.
>>>>
>>>> When mapping a shadowstack page (new page, migration, swapin,
>>>> ...),
>>>> which can be obtained by looking at the VMA flags, the first
>>>> thing
>>>> you'd
>>>> do is set the "shadowstack" PTE flag.
>>>
>>> I guess the downside is that it uses an extra software bit. But the
>>> other positive is that it's less error prone, so that someone
>>> writing
>>> core-mm code won't introduce a change that makes shadow stack VMAs
>>> Write=1 if they don't know to also check for VM_SHADOW_STACK.
>>
>> Right. And I think this mimics the what I would have expected HW to
>> provide: a dedicated HW bit, not somehow mangling this into semantics
>> of
>> existing bits.
> 
> Yea.
> 
>>
>> Roughly speaking: if we abstract it that way and get all of the "how
>> to
>> set it writable now?" out of core-MM, it not only is cleaner and
>> less
>> error prone, it might even allow other architectures that implement
>> something comparable (e.g., using a dedicated HW bit) to actually
>> reuse
>> some of that work. Otherwise most of that "shstk" is really just x86
>> specific ...
>>
>> I guess the only cases we have to special case would be page pinning
>> code where pte_write() would indicate that the PTE is writable (well,
>> it
>> is, just not by "ordinary CPU instruction" context directly): but you
>> do
>> that already, so ... :)
>>
>> Sorry for stumbling over that this late, I only started looking into
>> this when you CCed me on that one patch.
> 
> Sorry for not calling more attention to it earlier. Appreciate your
> comments.
> 
> Previously versions of this series had changed some of these
> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma.
> This way an x86 implementation could use the VM_SHADOW_STACK vma flag
> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback
> was that in some of these code paths "maybe" isn't really an option, it
> *needs* to make it writable. Even though the logic was the same, the
> name of the function made it look wrong.
> 
> But another option could be to change pte_mkwrite() to take a vma. This
> would save using another software bit on x86, but instead requires a
> small change to each arch's pte_mkwrite().

I played with that idea shortly as well, but discarded it. I was not 
able to convince myself that it wouldn't be required to pass in the VMA 
as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... 
which would end up fairly ugly (or even impossible in thing slike GUP-fast).

For example, I wonder how we'd be handling stuff like do_numa_page() 
cleanly correctly, where we use pte_modify() + pte_mkwrite(), and either 
call might set the PTE writable and maintain dirty bit ...

Having that said, maybe it could work with only a single saved-dirty bit 
and passing in the VMA for pte_mkwrite() only.

pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty bit 
to the soft-dirty bit instead, resulting in 
"writable=0,dirty=0,saved-dirty=1",

pte_dirty() would return dirty==1||saved-dirty==1.

pte_mkdirty() would set either set dirty=1 or saved-dirty=1, depending 
on the writable bit.

pte_mkclean() would clean both bits.

pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"

pte_mkwrite() would act according to the VMA, and in addition, merge the 
saved-dirty bit into the dirty bit.

pte_modify() and mk_pte() .... would require more thought ...


Further, ptep_modify_prot_commit() might have to be adjusted to properly 
flush in all relevant cases IIRC.

> 
> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but
> maybe it could additionally warn if the vma is not writable. It also
> seems more aligned with your changes to stop taking hints from PTE bits
> and just look at the VMA? (I'm thinking about the dropping of the dirty
> check in GUP and dropping pte_saved_write())

The soft-shstk bit wouldn't be a hint, it would be logically changing 
the "type" of the PTE such that any other PTE functions can do the right 
thing without having to consume the VMA.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-24 23:41               ` Edgecombe, Rick P
@ 2023-01-25  9:29                 ` David Hildenbrand
  2023-01-25 15:23                   ` Kees Cook
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-25  9:29 UTC (permalink / raw)
  To: Edgecombe, Rick P, fweimer, kees
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, pavel, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov, akpm

On 25.01.23 00:41, Edgecombe, Rick P wrote:
> On Tue, 2023-01-24 at 15:08 -0800, Kees Cook wrote:
>>> GDB support for shadow stack is queued up for whenever the kernel
>>> interface settles. I believe it just uses ptrace, and not this
>>> proc.
>>> But yea ptrace poke will still need to use FOLL_FORCE and be able
>>> to
>>> write through shadow stacks.
>>
>> I'd prefer to avoid adding more FOLL_FORCE if we can. If gdb can do
>> stack manipulations through a ptrace interface then let's leave off
>> FOLL_FORCE.
> 
> Ptrace and /proc/self/mem both use FOLL_FORCE. I think ptrace will
> always need it or something like it for debugging.
> 
> To jog your memory, this series doesn't change what uses FOLL_FORCE. It
> just sets the shadow stack rules to be the same as read-only memory. So
> even though shadow stack memory is sort of writable, it's a bit more
> locked down and FOLL_FORCE is required to write to it with GUP.
> 
> If we just remove FOLL_FORCE from /proc/self/mem, something will
> probably break right? How do we do this? Some sort of opt-in?

I don't think removing that is an option. It's another debug interface 
that has been allowing such access for ever ...

Blocking /proc/self/mem access completely for selected processes might 
be the better alternative.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-25  9:29                 ` David Hildenbrand
@ 2023-01-25 15:23                   ` Kees Cook
  0 siblings, 0 replies; 120+ messages in thread
From: Kees Cook @ 2023-01-25 15:23 UTC (permalink / raw)
  To: David Hildenbrand, Edgecombe, Rick P, fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, pavel, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov, akpm

On January 25, 2023 1:29:20 AM PST, David Hildenbrand <david@redhat.com> wrote:
>On 25.01.23 00:41, Edgecombe, Rick P wrote:
>> On Tue, 2023-01-24 at 15:08 -0800, Kees Cook wrote:
>>>> GDB support for shadow stack is queued up for whenever the kernel
>>>> interface settles. I believe it just uses ptrace, and not this
>>>> proc.
>>>> But yea ptrace poke will still need to use FOLL_FORCE and be able
>>>> to
>>>> write through shadow stacks.
>>> 
>>> I'd prefer to avoid adding more FOLL_FORCE if we can. If gdb can do
>>> stack manipulations through a ptrace interface then let's leave off
>>> FOLL_FORCE.
>> 
>> Ptrace and /proc/self/mem both use FOLL_FORCE. I think ptrace will
>> always need it or something like it for debugging.
>> 
>> To jog your memory, this series doesn't change what uses FOLL_FORCE. It
>> just sets the shadow stack rules to be the same as read-only memory. So
>> even though shadow stack memory is sort of writable, it's a bit more
>> locked down and FOLL_FORCE is required to write to it with GUP.
>> 
>> If we just remove FOLL_FORCE from /proc/self/mem, something will
>> probably break right? How do we do this? Some sort of opt-in?
>
>I don't think removing that is an option. It's another debug interface that has been allowing such access for ever ...
>
>Blocking /proc/self/mem access completely for selected processes might be the better alternative.
>

Yeah, this would be nice. Kind of like being undumpable or no_new_privs.



-- 
Kees Cook


^ permalink raw reply	[flat|nested] 120+ messages in thread

* RE: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-24 18:42           ` Edgecombe, Rick P
  2023-01-24 23:08             ` Kees Cook
@ 2023-01-25 15:36             ` Schimpe, Christina
  2023-01-25 16:43               ` Schimpe, Christina
  1 sibling, 1 reply; 120+ messages in thread
From: Schimpe, Christina @ 2023-01-25 15:36 UTC (permalink / raw)
  To: Edgecombe, Rick P, fweimer, david
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, mike.kravetz, x86, linux-doc,
	pavel, rppt, john.allen, mingo, corbet, linux-kernel, linux-api,
	gorcunov, akpm

> On Tue, 2023-01-24 at 17:26 +0100, David Hildenbrand wrote:
> > > > Isn't it possible to overwrite GOT pointers using the same vector?
> > > > So I think it's merely reflecting the status quo.
> > >
> > > There was some debate on this. /proc/self/mem can currently write
> > > through read-only memory which protects executable code. So should
> > > shadow stack get separate rules? Is ROP a worry when you can
> > > overwrite executable code?
> > >
> >
> > The question is, if there is reasonable debugging reason to keep it.
> > I
> > assume if a debugger would adjust the ordinary stack, it would have to
> > adjust the shadow stack as well (oh my ...). So it sounds reasonable
> > to have it in theory at least ... not sure when debugger would support
> > that, but maybe they already do.
> 
> GDB support for shadow stack is queued up for whenever the kernel
> interface settles. I believe it just uses ptrace, and not this proc.
> But yea ptrace poke will still need to use FOLL_FORCE and be able to write
> through shadow stacks.

Our patches for GDB use /proc/PID/mem to read/write shadow stack memory.  
However, I think it should be possible to change this to ptrace but GDB normally uses
/proc/PID/mem to read/write target memory.

Regards,
Christina
Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 120+ messages in thread

* RE: [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory
  2023-01-25 15:36             ` Schimpe, Christina
@ 2023-01-25 16:43               ` Schimpe, Christina
  0 siblings, 0 replies; 120+ messages in thread
From: Schimpe, Christina @ 2023-01-25 16:43 UTC (permalink / raw)
  To: Edgecombe, Rick P, fweimer, david
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp,
	andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, tglx, mike.kravetz, x86, linux-doc,
	pavel, rppt, john.allen, mingo, corbet, linux-kernel, linux-api,
	gorcunov, akpm

> > On Tue, 2023-01-24 at 17:26 +0100, David Hildenbrand wrote:
> > > > > Isn't it possible to overwrite GOT pointers using the same vector?
> > > > > So I think it's merely reflecting the status quo.
> > > >
> > > > There was some debate on this. /proc/self/mem can currently write
> > > > through read-only memory which protects executable code. So should
> > > > shadow stack get separate rules? Is ROP a worry when you can
> > > > overwrite executable code?
> > > >
> > >
> > > The question is, if there is reasonable debugging reason to keep it.
> > > I
> > > assume if a debugger would adjust the ordinary stack, it would have
> > > to adjust the shadow stack as well (oh my ...). So it sounds
> > > reasonable to have it in theory at least ... not sure when debugger
> > > would support that, but maybe they already do.
> >
> > GDB support for shadow stack is queued up for whenever the kernel
> > interface settles. I believe it just uses ptrace, and not this proc.
> > But yea ptrace poke will still need to use FOLL_FORCE and be able to
> > write through shadow stacks.
> 
> Our patches for GDB use /proc/PID/mem to read/write shadow stack
> memory.
> However, I think it should be possible to change this to ptrace but GDB
> normally uses /proc/PID/mem to read/write target memory.
> 
> Regards,
> Christina

I just noticed that GDBSERVER actually uses ptrace, so our patches currently use
both: ptrace and proc/PID/mem to read/write shadow stack memory.

Regards,
Christina
Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-25  9:27           ` David Hildenbrand
@ 2023-01-25 18:43             ` Edgecombe, Rick P
  2023-01-26  0:59               ` Edgecombe, Rick P
  2023-01-26  8:57               ` David Hildenbrand
  0 siblings, 2 replies; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-25 18:43 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	arnd, jamorris, linux-doc, bp, Schimpe, Christina, x86,
	mike.kravetz, tglx, andrew.cooper3, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > Roughly speaking: if we abstract it that way and get all of the
> > > "how
> > > to
> > > set it writable now?" out of core-MM, it not only is cleaner and
> > > less
> > > error prone, it might even allow other architectures that
> > > implement
> > > something comparable (e.g., using a dedicated HW bit) to actually
> > > reuse
> > > some of that work. Otherwise most of that "shstk" is really just
> > > x86
> > > specific ...
> > > 
> > > I guess the only cases we have to special case would be page
> > > pinning
> > > code where pte_write() would indicate that the PTE is writable
> > > (well,
> > > it
> > > is, just not by "ordinary CPU instruction" context directly): but
> > > you
> > > do
> > > that already, so ... :)
> > > 
> > > Sorry for stumbling over that this late, I only started looking
> > > into
> > > this when you CCed me on that one patch.
> > 
> > Sorry for not calling more attention to it earlier. Appreciate your
> > comments.
> > 
> > Previously versions of this series had changed some of these
> > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > vma.
> > This way an x86 implementation could use the VM_SHADOW_STACK vma
> > flag
> > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > feedback
> > was that in some of these code paths "maybe" isn't really an
> > option, it
> > *needs* to make it writable. Even though the logic was the same,
> > the
> > name of the function made it look wrong.
> > 
> > But another option could be to change pte_mkwrite() to take a vma.
> > This
> > would save using another software bit on x86, but instead requires
> > a
> > small change to each arch's pte_mkwrite().
> 
> I played with that idea shortly as well, but discarded it. I was not 
> able to convince myself that it wouldn't be required to pass in the
> VMA 
> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... 
> which would end up fairly ugly (or even impossible in thing slike
> GUP-fast).
> 
> For example, I wonder how we'd be handling stuff like do_numa_page() 
> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> either 
> call might set the PTE writable and maintain dirty bit ...

pte_modify() is handled like this currently:

https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/

There has been a couple iterations on that. The current solution is to
do the Dirty->SavedDirty fixup if needed after the new prots are added.

Of course pte_modify() can't know whether you are are attempting to
create a shadow stack PTE with the prot you are passing in. But the
callers today explicitly call pte_mkwrite() after filling in the other
bits with pte_modify(). Today this patch causes the pte_mkwrite() to be
skipped and another fault may be required in the mprotect() and numa
cases, but if we change pte_mkwrite() to take a VMA we can just make it
shadow stack to start.

It might be worth mentioning, there was a suggestion in the past to try
to have the shadow stack bits come out of vm_get_page_prot(), but MM
code would then try to map the zero page as (shadow stack) writable
when there was a normal (non-shadow stack) read access. So I had to
abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
to make it shadow stack.

> 
> Having that said, maybe it could work with only a single saved-dirty
> bit 
> and passing in the VMA for pte_mkwrite() only.
> 
> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
> bit 
> to the soft-dirty bit instead, resulting in 
> "writable=0,dirty=0,saved-dirty=1",
> 
> pte_dirty() would return dirty==1||saved-dirty==1.
> 
> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> depending 
> on the writable bit.
> 
> pte_mkclean() would clean both bits.
> 
> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
> 
> pte_mkwrite() would act according to the VMA, and in addition, merge
> the 
> saved-dirty bit into the dirty bit.
> 
> pte_modify() and mk_pte() .... would require more thought ...

Not sure I'm following what the mk_pte() problem would be. You mean if
Write=0,Dirty=1 is manually added to the prot?

Shouldn't people generally use the pte_mkwrite() helpers unless they
are drawing from a prot that was already created with the helpers or
vm_get_page_prot()? I think they can't manually create prot's from bits
in core mm code, right? And x86 arch code already has to be aware of
shadow stack. It's a bit of an assumption I guess, but I think maybe
not too crazy of one?

> 
> 
> Further, ptep_modify_prot_commit() might have to be adjusted to
> properly 
> flush in all relevant cases IIRC.

Sorry, I'm not following. Can you elaborate? There is an adjustment
made in pte_flags_need_flush().

> 
> > 
> > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
> > but
> > maybe it could additionally warn if the vma is not writable. It
> > also
> > seems more aligned with your changes to stop taking hints from PTE
> > bits
> > and just look at the VMA? (I'm thinking about the dropping of the
> > dirty
> > check in GUP and dropping pte_saved_write())
> 
> The soft-shstk bit wouldn't be a hint, it would be logically
> changing 
> the "type" of the PTE such that any other PTE functions can do the
> right 
> thing without having to consume the VMA.

Yea, true.

Thanks for your comments and ideas here, I'll give the:
pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
...solution a try.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 00/39] Shadow stacks for userspace
  2023-01-20 19:19     ` Kees Cook
@ 2023-01-25 19:46       ` Edgecombe, Rick P
  0 siblings, 0 replies; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-25 19:46 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	Eranian, Stephane, kirill.shutemov, dave.hansen, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, akpm, Lutomirski, Andy, jamorris, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, pavel, andrew.cooper3,
	john.allen, rppt, Yang, Weijiang, mingo, corbet, linux-kernel,
	linux-api, gorcunov

On Fri, 2023-01-20 at 11:19 -0800, Kees Cook wrote:
> On Fri, Jan 20, 2023 at 05:27:30PM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2023-01-19 at 14:26 -0800, Andrew Morton wrote:
> > > On Thu, 19 Jan 2023 13:22:38 -0800 Rick Edgecombe <
> > > rick.p.edgecombe@intel.com> wrote:
> > > 
> > > > SHSTK
> > > 
> > > Sounds like me trying to swear in Russian while drunk.
> > > 
> > > Is there any chance of s/shstk/shadow_stack/g?
> > 
> > I'm fine with the name change. I think shstk got debated and picked
> > early in the history of the series before I got involved. "shstk"
> > is
> > nice and short, but it's not completely clear what it is unless you
> > already know about shadow stack. So there is a tradeoff of clarity
> > and
> > line length/wrapping. Does anyone else have any strong opinions?
> 
> I prefer SHSTK because it specifically means x86's hardware shadow
> stack from CET. Lots of things can (and have) implemented things
> called
> "shadow stack".

This makes sense to, especially if we can hide it more from the core-mm 
code per David Hildebrand's suggestion. I guess I'll leave it for now
unless anyone else has a stronger opinion.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-25 18:43             ` Edgecombe, Rick P
@ 2023-01-26  0:59               ` Edgecombe, Rick P
  2023-01-26  8:46                 ` David Hildenbrand
  2023-01-26  8:57               ` David Hildenbrand
  1 sibling, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-26  0:59 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	arnd, jamorris, linux-doc, bp, Schimpe, Christina, x86,
	mike.kravetz, tglx, andrew.cooper3, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
> Thanks for your comments and ideas here, I'll give the:
> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> ...solution a try.

Well, it turns out there are some pte_mkwrite() callers in other arch's
that operate on kernel memory and don't have a VMA. So it needed a new 
function that can be overridden in arch code. I ended up with x86
versions of these, like this:
pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
	if (!(vma->vm_flags & VM_WRITE))
		return pte;

	if (vma->vm_flags & VM_SHADOW_STACK)
		return pte_mkwrite_shstk(pte);

	return pte_mkwrite(pte);
}

pte_t pte_mkwrite_vma(pte_t pte, struct vm_area_struct *vma)
{
	if (vma->vm_flags & VM_SHADOW_STACK)
		return pte_mkwrite_shstk(pte);

	return pte_mkwrite(pte);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
	if (!(vma->vm_flags & VM_WRITE))
		return pmd;

	if (vma->vm_flags & VM_SHADOW_STACK)
		return pmd_mkwrite_shstk(pmd);

	return pmd_mkwrite(pmd);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

All the other pte_mkdirty()s, etc remain the same.

Previously, there was a suggestion to not override the
maybe_mkwrite()'s and put the logic in core MM by having a generic
version of pte_mkwrite_shstk() that does nothing. But given what we are
trying to do with pte_mkwrite_vma() it seemed better to hide all the
shadow stack PTE changes in arch code again.

After the changes, the only shadow stack specific bits in core mm are
the bit in GUP to require FOLL_FORCE, the memory accounting, and these
warnings:

https://lore.kernel.org/lkml/20230119212317.8324-26-rick.p.edgecombe@intel.com/


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-26  0:59               ` Edgecombe, Rick P
@ 2023-01-26  8:46                 ` David Hildenbrand
  2023-01-26 20:19                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-26  8:46 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Yang, Weijiang,
	Lutomirski, Andy, arnd, jamorris, linux-doc, bp, Schimpe,
	Christina, x86, mike.kravetz, tglx, andrew.cooper3, john.allen,
	rppt, mingo, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 26.01.23 01:59, Edgecombe, Rick P wrote:
> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
>> Thanks for your comments and ideas here, I'll give the:
>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
>> ...solution a try.
> 
> Well, it turns out there are some pte_mkwrite() callers in other arch's
> that operate on kernel memory and don't have a VMA. So it needed a new

Why not pass in NULL as VMA then and document the semantics? The less 
similarly named but slightly different functions, the better :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-25 18:43             ` Edgecombe, Rick P
  2023-01-26  0:59               ` Edgecombe, Rick P
@ 2023-01-26  8:57               ` David Hildenbrand
  2023-01-26 20:16                 ` Edgecombe, Rick P
  1 sibling, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-26  8:57 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Yang, Weijiang,
	Lutomirski, Andy, arnd, jamorris, linux-doc, bp, Schimpe,
	Christina, x86, mike.kravetz, tglx, andrew.cooper3, john.allen,
	rppt, mingo, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 25.01.23 19:43, Edgecombe, Rick P wrote:
> On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
>>>> Roughly speaking: if we abstract it that way and get all of the
>>>> "how
>>>> to
>>>> set it writable now?" out of core-MM, it not only is cleaner and
>>>> less
>>>> error prone, it might even allow other architectures that
>>>> implement
>>>> something comparable (e.g., using a dedicated HW bit) to actually
>>>> reuse
>>>> some of that work. Otherwise most of that "shstk" is really just
>>>> x86
>>>> specific ...
>>>>
>>>> I guess the only cases we have to special case would be page
>>>> pinning
>>>> code where pte_write() would indicate that the PTE is writable
>>>> (well,
>>>> it
>>>> is, just not by "ordinary CPU instruction" context directly): but
>>>> you
>>>> do
>>>> that already, so ... :)
>>>>
>>>> Sorry for stumbling over that this late, I only started looking
>>>> into
>>>> this when you CCed me on that one patch.
>>>
>>> Sorry for not calling more attention to it earlier. Appreciate your
>>> comments.
>>>
>>> Previously versions of this series had changed some of these
>>> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
>>> vma.
>>> This way an x86 implementation could use the VM_SHADOW_STACK vma
>>> flag
>>> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
>>> feedback
>>> was that in some of these code paths "maybe" isn't really an
>>> option, it
>>> *needs* to make it writable. Even though the logic was the same,
>>> the
>>> name of the function made it look wrong.
>>>
>>> But another option could be to change pte_mkwrite() to take a vma.
>>> This
>>> would save using another software bit on x86, but instead requires
>>> a
>>> small change to each arch's pte_mkwrite().
>>
>> I played with that idea shortly as well, but discarded it. I was not
>> able to convince myself that it wouldn't be required to pass in the
>> VMA
>> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ...
>> which would end up fairly ugly (or even impossible in thing slike
>> GUP-fast).
>>
>> For example, I wonder how we'd be handling stuff like do_numa_page()
>> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
>> either
>> call might set the PTE writable and maintain dirty bit ...
> 
> pte_modify() is handled like this currently:
> 
> https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/
> 
> There has been a couple iterations on that. The current solution is to
> do the Dirty->SavedDirty fixup if needed after the new prots are added.
> 
> Of course pte_modify() can't know whether you are are attempting to
> create a shadow stack PTE with the prot you are passing in. But the
> callers today explicitly call pte_mkwrite() after filling in the other
> bits with pte_modify().

See below on my MAP_PRIVATE vs. MAP_SHARED comment.

> Today this patch causes the pte_mkwrite() to be
> skipped and another fault may be required in the mprotect() and numa
> cases, but if we change pte_mkwrite() to take a VMA we can just make it
> shadow stack to start.
> 
> It might be worth mentioning, there was a suggestion in the past to try
> to have the shadow stack bits come out of vm_get_page_prot(), but MM
> code would then try to map the zero page as (shadow stack) writable
> when there was a normal (non-shadow stack) read access. So I had to
> abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
> to make it shadow stack.

Thanks, do you have a pointer?

> 
>>
>> Having that said, maybe it could work with only a single saved-dirty
>> bit
>> and passing in the VMA for pte_mkwrite() only.
>>
>> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
>> bit
>> to the soft-dirty bit instead, resulting in
>> "writable=0,dirty=0,saved-dirty=1",
>>
>> pte_dirty() would return dirty==1||saved-dirty==1.
>>
>> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
>> depending
>> on the writable bit.
>>
>> pte_mkclean() would clean both bits.
>>
>> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
>>
>> pte_mkwrite() would act according to the VMA, and in addition, merge
>> the
>> saved-dirty bit into the dirty bit.
>>
>> pte_modify() and mk_pte() .... would require more thought ...
> 
> Not sure I'm following what the mk_pte() problem would be. You mean if
> Write=0,Dirty=1 is manually added to the prot?
> 
> Shouldn't people generally use the pte_mkwrite() helpers unless they
> are drawing from a prot that was already created with the helpers or
> vm_get_page_prot()?

pte_mkwrite() is mostly only used (except for writenotify ...) for 
MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, 
vma->vm_page_prot in a VM_WRITE mapping already contains the write 
permissions. pte_mkwrite() is not necessary (again, unless writenotify 
is active).

I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you 
didn't stumble over that issue yet? Because I don't see how it could 
work with MAP_SHARED VMAs.


The other thing I had in mind was that we have to make sure that we're 
not accidentally setting "Write=0,Dirty=1" in mk_pte() / pte_modify().

Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect 
using pte_modify(), we have to make sure to move the dirty bit to the 
saved_dirty bit.

> I think they can't manually create prot's from bits
> in core mm code, right? And x86 arch code already has to be aware of
> shadow stack. It's a bit of an assumption I guess, but I think maybe
> not too crazy of one?

I think that's true. Arch code is supposed to deal with that IIRC.

> 
>>
>>
>> Further, ptep_modify_prot_commit() might have to be adjusted to
>> properly
>> flush in all relevant cases IIRC.
> 
> Sorry, I'm not following. Can you elaborate? There is an adjustment
> made in pte_flags_need_flush().

Note that I did not fully review all bits of this patch set, just 
throwing out what was on my mind. If already handled, great.

> 
>>
>>>
>>> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
>>> but
>>> maybe it could additionally warn if the vma is not writable. It
>>> also
>>> seems more aligned with your changes to stop taking hints from PTE
>>> bits
>>> and just look at the VMA? (I'm thinking about the dropping of the
>>> dirty
>>> check in GUP and dropping pte_saved_write())
>>
>> The soft-shstk bit wouldn't be a hint, it would be logically
>> changing
>> the "type" of the PTE such that any other PTE functions can do the
>> right
>> thing without having to consume the VMA.
> 
> Yea, true.
> 
> Thanks for your comments and ideas here, I'll give the:
> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> ...solution a try.

Good!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-26  8:57               ` David Hildenbrand
@ 2023-01-26 20:16                 ` Edgecombe, Rick P
  2023-01-27 16:19                   ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-26 20:16 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, bp, jamorris,
	Yang, Weijiang, Schimpe, Christina, mike.kravetz, arnd,
	linux-doc, x86, tglx, andrew.cooper3, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Thu, 2023-01-26 at 09:57 +0100, David Hildenbrand wrote:
> On 25.01.23 19:43, Edgecombe, Rick P wrote:
> > On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > > > Roughly speaking: if we abstract it that way and get all of
> > > > > the
> > > > > "how
> > > > > to
> > > > > set it writable now?" out of core-MM, it not only is cleaner
> > > > > and
> > > > > less
> > > > > error prone, it might even allow other architectures that
> > > > > implement
> > > > > something comparable (e.g., using a dedicated HW bit) to
> > > > > actually
> > > > > reuse
> > > > > some of that work. Otherwise most of that "shstk" is really
> > > > > just
> > > > > x86
> > > > > specific ...
> > > > > 
> > > > > I guess the only cases we have to special case would be page
> > > > > pinning
> > > > > code where pte_write() would indicate that the PTE is
> > > > > writable
> > > > > (well,
> > > > > it
> > > > > is, just not by "ordinary CPU instruction" context directly):
> > > > > but
> > > > > you
> > > > > do
> > > > > that already, so ... :)
> > > > > 
> > > > > Sorry for stumbling over that this late, I only started
> > > > > looking
> > > > > into
> > > > > this when you CCed me on that one patch.
> > > > 
> > > > Sorry for not calling more attention to it earlier. Appreciate
> > > > your
> > > > comments.
> > > > 
> > > > Previously versions of this series had changed some of these
> > > > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > > > vma.
> > > > This way an x86 implementation could use the VM_SHADOW_STACK
> > > > vma
> > > > flag
> > > > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > > > feedback
> > > > was that in some of these code paths "maybe" isn't really an
> > > > option, it
> > > > *needs* to make it writable. Even though the logic was the
> > > > same,
> > > > the
> > > > name of the function made it look wrong.
> > > > 
> > > > But another option could be to change pte_mkwrite() to take a
> > > > vma.
> > > > This
> > > > would save using another software bit on x86, but instead
> > > > requires
> > > > a
> > > > small change to each arch's pte_mkwrite().
> > > 
> > > I played with that idea shortly as well, but discarded it. I was
> > > not
> > > able to convince myself that it wouldn't be required to pass in
> > > the
> > > VMA
> > > as well for things like pte_dirty(), pte_mkdirty(), pte_write(),
> > > ...
> > > which would end up fairly ugly (or even impossible in thing slike
> > > GUP-fast).
> > > 
> > > For example, I wonder how we'd be handling stuff like
> > > do_numa_page()
> > > cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> > > either
> > > call might set the PTE writable and maintain dirty bit ...
> > 
> > pte_modify() is handled like this currently:
> > 
> > 
https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/
> > 
> > There has been a couple iterations on that. The current solution is
> > to
> > do the Dirty->SavedDirty fixup if needed after the new prots are
> > added.
> > 
> > Of course pte_modify() can't know whether you are are attempting to
> > create a shadow stack PTE with the prot you are passing in. But the
> > callers today explicitly call pte_mkwrite() after filling in the
> > other
> > bits with pte_modify().
> 
> See below on my MAP_PRIVATE vs. MAP_SHARED comment.

Yep, MAP_SHARED support was dropped with the reboot of the series. It
did have some problems IIRC.

Now shadow stack memory creation is tightly controlled. Either created
via special syscall or automatically with a new thread.

> 
> > Today this patch causes the pte_mkwrite() to be
> > skipped and another fault may be required in the mprotect() and
> > numa
> > cases, but if we change pte_mkwrite() to take a VMA we can just
> > make it
> > shadow stack to start.
> > 
> > It might be worth mentioning, there was a suggestion in the past to
> > try
> > to have the shadow stack bits come out of vm_get_page_prot(), but
> > MM
> > code would then try to map the zero page as (shadow stack) writable
> > when there was a normal (non-shadow stack) read access. So I had to
> > abandon that approach and rely on explicit calls to
> > pte_mkwrite/shstk()
> > to make it shadow stack.
> 
> Thanks, do you have a pointer?

I never posted it because it didn't work out. This was the comment that
prompted the exploration in that direction:

https://lore.kernel.org/lkml/8065c333-0911-04a2-f91e-7c2e0cc7ec51@intel.com/

Shadow stack memory also used to not be VM_WRITE (VM_SHADOW_STACK
only), but this was changed for other reasons. In v2 there were some
updates to how shadow stack memory was handled, and the cover letter
had a writeup of the reasons and general design:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

> 
> > 
> > > 
> > > Having that said, maybe it could work with only a single saved-
> > > dirty
> > > bit
> > > and passing in the VMA for pte_mkwrite() only.
> > > 
> > > pte_wrprotect() would detect "writable=0,dirty=1" and move the
> > > dirty
> > > bit
> > > to the soft-dirty bit instead, resulting in
> > > "writable=0,dirty=0,saved-dirty=1",
> > > 
> > > pte_dirty() would return dirty==1||saved-dirty==1.
> > > 
> > > pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> > > depending
> > > on the writable bit.
> > > 
> > > pte_mkclean() would clean both bits.
> > > 
> > > pte_write() would detect "writable == 1 || (writable==0 &&
> > > dirty==1)"
> > > 
> > > pte_mkwrite() would act according to the VMA, and in addition,
> > > merge
> > > the
> > > saved-dirty bit into the dirty bit.
> > > 
> > > pte_modify() and mk_pte() .... would require more thought ...
> > 
> > Not sure I'm following what the mk_pte() problem would be. You mean
> > if
> > Write=0,Dirty=1 is manually added to the prot?
> > 
> > Shouldn't people generally use the pte_mkwrite() helpers unless
> > they
> > are drawing from a prot that was already created with the helpers
> > or
> > vm_get_page_prot()?
> 
> pte_mkwrite() is mostly only used (except for writenotify ...) for 
> MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, 
> vma->vm_page_prot in a VM_WRITE mapping already contains the write 
> permissions. pte_mkwrite() is not necessary (again, unless
> writenotify 
> is active).

Oh, interesting.

> 
> I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you 
> didn't stumble over that issue yet? Because I don't see how it could 
> work with MAP_SHARED VMAs.

Yep, it doesn't support MAP_SHARED.

> 
> 
> The other thing I had in mind was that we have to make sure that
> we're 
> not accidentally setting "Write=0,Dirty=1" in mk_pte() /
> pte_modify().
> 
> Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect 
> using pte_modify(), we have to make sure to move the dirty bit to
> the 
> saved_dirty bit.

For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come
from anywhere. I guess the MAP_SHARED case is a little less bounded. We
could maybe add a warning for this case.

For the pte_modify() case, this does happen. There are two scenarios
considered:
1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as
Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't
support it.
2. A Write=1,Dirty=1 PTE gets write protected. This does happen because
the Write=0 prot comes from protection_map, and pte_modify() would
leave the Dirty=1 bit alone. The main case I know of is mprotect(). It
is handled by changes to pte_modify() by doing the Dirty->SoftDirty
fixup if needed.

So pte_modify()s job should not be too tricky. What you can't do with
it though, is create shadow stack PTEs. But it is ok for our uses
because of the explicit mkwrite().

> 
> > I think they can't manually create prot's from bits
> > in core mm code, right? And x86 arch code already has to be aware
> > of
> > shadow stack. It's a bit of an assumption I guess, but I think
> > maybe
> > not too crazy of one?
> 
> I think that's true. Arch code is supposed to deal with that IIRC.
> 
> > 
> > > 
> > > 
> > > Further, ptep_modify_prot_commit() might have to be adjusted to
> > > properly
> > > flush in all relevant cases IIRC.
> > 
> > Sorry, I'm not following. Can you elaborate? There is an adjustment
> > made in pte_flags_need_flush().
> 
> Note that I did not fully review all bits of this patch set, just 
> throwing out what was on my mind. If already handled, great.
> 
> > 
> > > 
> > > > 
> > > > x86's pte_mkwrite() would then be pretty close to
> > > > maybe_mkwrite(),
> > > > but
> > > > maybe it could additionally warn if the vma is not writable. It
> > > > also
> > > > seems more aligned with your changes to stop taking hints from
> > > > PTE
> > > > bits
> > > > and just look at the VMA? (I'm thinking about the dropping of
> > > > the
> > > > dirty
> > > > check in GUP and dropping pte_saved_write())
> > > 
> > > The soft-shstk bit wouldn't be a hint, it would be logically
> > > changing
> > > the "type" of the PTE such that any other PTE functions can do
> > > the
> > > right
> > > thing without having to consume the VMA.
> > 
> > Yea, true.
> > 
> > Thanks for your comments and ideas here, I'll give the:
> > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > ...solution a try.
> 
> Good!
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-26  8:46                 ` David Hildenbrand
@ 2023-01-26 20:19                   ` Edgecombe, Rick P
  2023-01-27 16:12                     ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-26 20:19 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, bp, jamorris,
	Yang, Weijiang, Schimpe, Christina, mike.kravetz, arnd,
	linux-doc, x86, tglx, andrew.cooper3, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
> On 26.01.23 01:59, Edgecombe, Rick P wrote:
> > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
> > > Thanks for your comments and ideas here, I'll give the:
> > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > > ...solution a try.
> > 
> > Well, it turns out there are some pte_mkwrite() callers in other
> > arch's
> > that operate on kernel memory and don't have a VMA. So it needed a
> > new
> 
> Why not pass in NULL as VMA then and document the semantics? The
> less 
> similarly named but slightly different functions, the better :)

Hmm. The x86 and generic versions should probably have the same
semantics, so then if you pass a NULL, it would do a regular
pte_mkwrite() I guess?

I see another benefit of requiring the vma argument, such that raw
pte_mkwrite()s are less likely to appear in core MM code. But I think
the NULL is awkward because it's not obvious, to me at least, what the
implications of that should be.

So it will be confusing to read in the NULL cases for the other archs.
We also have some warnings to catch miss cases in the PTE tear down
code, so the scenario of new code accidentally marking shadow stack
PTEs as writable is not totally unchecked.

The three functions that do slightly different things are:

pte_mkwrite():
Makes a PTE conventionally writable, only takes a PTE. Very clear that
it is a low level helper and what it does.

maybe_mkwrite():
Might make a PTE writable if the VMA allows it.

pte_mkwrite_vma():
Makes a PTE writable in a specific way depending on the VMA

I wonder if the name pte_mkwrite_vma() is maybe just not clear enough.
It takes a VMA, yes, but what does it do with it?

What if it was called pte_mkwrite_type() instead? Some arch's have
additional types of writable memory and this function creates them. Of
course they also have the normal type of writable memory, and
pte_mkwrite() creates that like usual. Doesn't it seem more readable?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-26 20:19                   ` Edgecombe, Rick P
@ 2023-01-27 16:12                     ` David Hildenbrand
  2023-01-28  0:51                       ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-27 16:12 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, bp,
	jamorris, Yang, Weijiang, Schimpe, Christina, mike.kravetz, arnd,
	linux-doc, x86, tglx, andrew.cooper3, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 26.01.23 21:19, Edgecombe, Rick P wrote:
> On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
>> On 26.01.23 01:59, Edgecombe, Rick P wrote:
>>> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
>>>> Thanks for your comments and ideas here, I'll give the:
>>>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
>>>> ...solution a try.
>>>
>>> Well, it turns out there are some pte_mkwrite() callers in other
>>> arch's
>>> that operate on kernel memory and don't have a VMA. So it needed a
>>> new
>>
>> Why not pass in NULL as VMA then and document the semantics? The
>> less
>> similarly named but slightly different functions, the better :)
> 
> Hmm. The x86 and generic versions should probably have the same
> semantics, so then if you pass a NULL, it would do a regular
> pte_mkwrite() I guess?
> 
> I see another benefit of requiring the vma argument, such that raw
> pte_mkwrite()s are less likely to appear in core MM code. But I think
> the NULL is awkward because it's not obvious, to me at least, what the
> implications of that should be.
> 
> So it will be confusing to read in the NULL cases for the other archs.
> We also have some warnings to catch miss cases in the PTE tear down
> code, so the scenario of new code accidentally marking shadow stack
> PTEs as writable is not totally unchecked.
> 
> The three functions that do slightly different things are:
> 
> pte_mkwrite():
> Makes a PTE conventionally writable, only takes a PTE. Very clear that
> it is a low level helper and what it does.
> 
> maybe_mkwrite():
> Might make a PTE writable if the VMA allows it.
> 
> pte_mkwrite_vma():
> Makes a PTE writable in a specific way depending on the VMA
> 
> I wonder if the name pte_mkwrite_vma() is maybe just not clear enough.
> It takes a VMA, yes, but what does it do with it?
> 
> What if it was called pte_mkwrite_type() instead? Some arch's have
> additional types of writable memory and this function creates them. Of
> course they also have the normal type of writable memory, and
> pte_mkwrite() creates that like usual. Doesn't it seem more readable?

The issue is, the more variants we provide the easier it is to make 
mistakes and introduce new buggy code.

It's tempting to simply use pte_mkwrite() and call it a day, where 
people actually should use pte_mkwrite_vma().

Then, they at least have to investigate what to do about the second VMA 
parameter.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-26 20:16                 ` Edgecombe, Rick P
@ 2023-01-27 16:19                   ` David Hildenbrand
  0 siblings, 0 replies; 120+ messages in thread
From: David Hildenbrand @ 2023-01-27 16:19 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, bp,
	jamorris, Yang, Weijiang, Schimpe, Christina, mike.kravetz, arnd,
	linux-doc, x86, tglx, andrew.cooper3, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

> 
> Now shadow stack memory creation is tightly controlled. Either created
> via special syscall or automatically with a new thread.

Good, it would be valuable to document that somewhere ("Neve rapplies to 
VM_SHARED|VM_MAYSHARE VMAs").

[...]

>>
>> The other thing I had in mind was that we have to make sure that
>> we're
>> not accidentally setting "Write=0,Dirty=1" in mk_pte() /
>> pte_modify().
>>
>> Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect
>> using pte_modify(), we have to make sure to move the dirty bit to
>> the
>> saved_dirty bit.
> 
> For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come
> from anywhere. I guess the MAP_SHARED case is a little less bounded. We
> could maybe add a warning for this case.

Right, Write=0,Dirty=1  shouldn't apply at that point if shstk are 
always wrprotected as default.

> 
> For the pte_modify() case, this does happen. There are two scenarios
> considered:
> 1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as
> Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't
> support it.

It should simply set the saved_dirty bit I guess. But I don't think 
pte_modify() is actually supposed to set PTEs dirty (primary goal is to 
change protection IIRC).

> 2. A Write=1,Dirty=1 PTE gets write protected. This does happen because
> the Write=0 prot comes from protection_map, and pte_modify() would
> leave the Dirty=1 bit alone. The main case I know of is mprotect(). It
> is handled by changes to pte_modify() by doing the Dirty->SoftDirty
> fixup if needed.

Right, we'd have to move the dirty bit to the saved_dirty bit. (we have 
to handle soft-dirty, too, whenever setting the PTE dirty -- either via 
the dirty bit or via the saved_dirty bit)

> 
> So pte_modify()s job should not be too tricky. What you can't do with
> it though, is create shadow stack PTEs. But it is ok for our uses
> because of the explicit mkwrite().

I think you are correct.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-27 16:12                     ` David Hildenbrand
@ 2023-01-28  0:51                       ` Edgecombe, Rick P
  2023-01-31  8:46                         ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-28  0:51 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, bp, jamorris,
	tglx, Schimpe, Christina, mike.kravetz, arnd, Yang, Weijiang,
	x86, andrew.cooper3, john.allen, linux-doc, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Fri, 2023-01-27 at 17:12 +0100, David Hildenbrand wrote:
> On 26.01.23 21:19, Edgecombe, Rick P wrote:
> > On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
> > > On 26.01.23 01:59, Edgecombe, Rick P wrote:
> > > > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
> > > > > Thanks for your comments and ideas here, I'll give the:
> > > > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > > > > ...solution a try.
> > > > 
> > > > Well, it turns out there are some pte_mkwrite() callers in
> > > > other
> > > > arch's
> > > > that operate on kernel memory and don't have a VMA. So it
> > > > needed a
> > > > new
> > > 
> > > Why not pass in NULL as VMA then and document the semantics? The
> > > less
> > > similarly named but slightly different functions, the better :)
> > 
> > Hmm. The x86 and generic versions should probably have the same
> > semantics, so then if you pass a NULL, it would do a regular
> > pte_mkwrite() I guess?
> > 
> > I see another benefit of requiring the vma argument, such that raw
> > pte_mkwrite()s are less likely to appear in core MM code. But I
> > think
> > the NULL is awkward because it's not obvious, to me at least, what
> > the
> > implications of that should be.
> > 
> > So it will be confusing to read in the NULL cases for the other
> > archs.
> > We also have some warnings to catch miss cases in the PTE tear down
> > code, so the scenario of new code accidentally marking shadow stack
> > PTEs as writable is not totally unchecked.
> > 
> > The three functions that do slightly different things are:
> > 
> > pte_mkwrite():
> > Makes a PTE conventionally writable, only takes a PTE. Very clear
> > that
> > it is a low level helper and what it does.
> > 
> > maybe_mkwrite():
> > Might make a PTE writable if the VMA allows it.
> > 
> > pte_mkwrite_vma():
> > Makes a PTE writable in a specific way depending on the VMA
> > 
> > I wonder if the name pte_mkwrite_vma() is maybe just not clear
> > enough.
> > It takes a VMA, yes, but what does it do with it?
> > 
> > What if it was called pte_mkwrite_type() instead? Some arch's have
> > additional types of writable memory and this function creates them.
> > Of
> > course they also have the normal type of writable memory, and
> > pte_mkwrite() creates that like usual. Doesn't it seem more
> > readable?
> 
> The issue is, the more variants we provide the easier it is to make 
> mistakes and introduce new buggy code.
> 
> It's tempting to simply use pte_mkwrite() and call it a day, where 
> people actually should use pte_mkwrite_vma().
> 
> Then, they at least have to investigate what to do about the second
> VMA 
> parameter.

Ok, I'll give it a spin. So far it looks ok. The downside is the giant
tree-wide pte_mkwrite() signature change, but once that is over with
there are other advantages. Like getting rid of maybe_mkwrite()'s
awareness of shadow stack so the logic is more centralized. Please let
me know if you don't feel comfortable with a suggested-by credit tag.

Thanks,
Rick

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-28  0:51                       ` Edgecombe, Rick P
@ 2023-01-31  8:46                         ` David Hildenbrand
  2023-01-31 23:33                           ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-01-31  8:46 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, bp,
	jamorris, tglx, Schimpe, Christina, mike.kravetz, arnd, Yang,
	Weijiang, x86, andrew.cooper3, john.allen, linux-doc, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 28.01.23 01:51, Edgecombe, Rick P wrote:
> On Fri, 2023-01-27 at 17:12 +0100, David Hildenbrand wrote:
>> On 26.01.23 21:19, Edgecombe, Rick P wrote:
>>> On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
>>>> On 26.01.23 01:59, Edgecombe, Rick P wrote:
>>>>> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
>>>>>> Thanks for your comments and ideas here, I'll give the:
>>>>>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
>>>>>> ...solution a try.
>>>>>
>>>>> Well, it turns out there are some pte_mkwrite() callers in
>>>>> other
>>>>> arch's
>>>>> that operate on kernel memory and don't have a VMA. So it
>>>>> needed a
>>>>> new
>>>>
>>>> Why not pass in NULL as VMA then and document the semantics? The
>>>> less
>>>> similarly named but slightly different functions, the better :)
>>>
>>> Hmm. The x86 and generic versions should probably have the same
>>> semantics, so then if you pass a NULL, it would do a regular
>>> pte_mkwrite() I guess?
>>>
>>> I see another benefit of requiring the vma argument, such that raw
>>> pte_mkwrite()s are less likely to appear in core MM code. But I
>>> think
>>> the NULL is awkward because it's not obvious, to me at least, what
>>> the
>>> implications of that should be.
>>>
>>> So it will be confusing to read in the NULL cases for the other
>>> archs.
>>> We also have some warnings to catch miss cases in the PTE tear down
>>> code, so the scenario of new code accidentally marking shadow stack
>>> PTEs as writable is not totally unchecked.
>>>
>>> The three functions that do slightly different things are:
>>>
>>> pte_mkwrite():
>>> Makes a PTE conventionally writable, only takes a PTE. Very clear
>>> that
>>> it is a low level helper and what it does.
>>>
>>> maybe_mkwrite():
>>> Might make a PTE writable if the VMA allows it.
>>>
>>> pte_mkwrite_vma():
>>> Makes a PTE writable in a specific way depending on the VMA
>>>
>>> I wonder if the name pte_mkwrite_vma() is maybe just not clear
>>> enough.
>>> It takes a VMA, yes, but what does it do with it?
>>>
>>> What if it was called pte_mkwrite_type() instead? Some arch's have
>>> additional types of writable memory and this function creates them.
>>> Of
>>> course they also have the normal type of writable memory, and
>>> pte_mkwrite() creates that like usual. Doesn't it seem more
>>> readable?
>>
>> The issue is, the more variants we provide the easier it is to make
>> mistakes and introduce new buggy code.
>>
>> It's tempting to simply use pte_mkwrite() and call it a day, where
>> people actually should use pte_mkwrite_vma().
>>
>> Then, they at least have to investigate what to do about the second
>> VMA
>> parameter.
> 
> Ok, I'll give it a spin. So far it looks ok. The downside is the giant
> tree-wide pte_mkwrite() signature change, but once that is over with
> there are other advantages. Like getting rid of maybe_mkwrite()'s
> awareness of shadow stack so the logic is more centralized. Please let
> me know if you don't feel comfortable with a suggested-by credit tag.

Sure ...

but I reconsidered :)

Maybe there is a cleaner way to do it and avoid the "NULL" argument.

What about having (while you're going over everything already):

pte_mkwrite(pte, vma)
pte_mkwrite_kernel(pte)

The latter would only be used in that arch code where we're working on 
kernel pgtables. We already have pte_offset_kernel() and 
pte_alloc_kernel_track(), so it's not too weird.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-31  8:46                         ` David Hildenbrand
@ 2023-01-31 23:33                           ` Edgecombe, Rick P
  2023-02-01  9:03                             ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-01-31 23:33 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, andrew.cooper3, oleg, Yang, Weijiang, akpm, Lutomirski,
	Andy, bp, jamorris, hjl.tools, tglx, Schimpe, Christina, x86,
	mike.kravetz, linux-doc, arnd, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Tue, 2023-01-31 at 09:46 +0100, David Hildenbrand wrote:
> Sure ...
> 
> but I reconsidered :)
> 
> Maybe there is a cleaner way to do it and avoid the "NULL" argument.
> 
> What about having (while you're going over everything already):
> 
> pte_mkwrite(pte, vma)
> pte_mkwrite_kernel(pte)
> 
> The latter would only be used in that arch code where we're working
> on 
> kernel pgtables. We already have pte_offset_kernel() and 
> pte_alloc_kernel_track(), so it's not too weird.

Hmm, one downside is the "mk" part might lead people to guess
pte_mkwrite_kernel() would make it writable AND a kernel page (like
U/S=0 on x86). Instead of being a mkwrite() that's useful for setting
on kernel PTEs.

The other problem is that one of NULL passers is not for kernel memory.
huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't be
created with MAP_HUGETLB, so it is not needed. Using
pte_mkwrite_kernel() would look weird in this case, but making
huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
huge_pte_mkwrite() take a VMA is the better of those two options. Or
keep the NULL semantics...  Any thoughts?





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-01-31 23:33                           ` Edgecombe, Rick P
@ 2023-02-01  9:03                             ` David Hildenbrand
  2023-02-01 17:32                               ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: David Hildenbrand @ 2023-02-01  9:03 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, andrew.cooper3, oleg, Yang, Weijiang, akpm,
	Lutomirski, Andy, bp, jamorris, hjl.tools, tglx, Schimpe,
	Christina, x86, mike.kravetz, linux-doc, arnd, john.allen, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 01.02.23 00:33, Edgecombe, Rick P wrote:
> On Tue, 2023-01-31 at 09:46 +0100, David Hildenbrand wrote:
>> Sure ...
>>
>> but I reconsidered :)
>>
>> Maybe there is a cleaner way to do it and avoid the "NULL" argument.
>>
>> What about having (while you're going over everything already):
>>
>> pte_mkwrite(pte, vma)
>> pte_mkwrite_kernel(pte)
>>
>> The latter would only be used in that arch code where we're working
>> on
>> kernel pgtables. We already have pte_offset_kernel() and
>> pte_alloc_kernel_track(), so it's not too weird.
> 
> Hmm, one downside is the "mk" part might lead people to guess
> pte_mkwrite_kernel() would make it writable AND a kernel page (like
> U/S=0 on x86). Instead of being a mkwrite() that's useful for setting
> on kernel PTEs.

At least I wouldn't worry about that too much. We handle nowhere in 
common code user vs. supervisor access that way explicitly (e.g., 
mkkernel), and it wouldn't even apply on architectures where we cannot 
make such a decision on a per-PTE basis.

> 
> The other problem is that one of NULL passers is not for kernel memory.
> huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't be
> created with MAP_HUGETLB, so it is not needed. Using
> pte_mkwrite_kernel() would look weird in this case, but making
> huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
> huge_pte_mkwrite() take a VMA is the better of those two options. Or
> keep the NULL semantics...  Any thoughts?

Well, the reason would be consistency. From a core-mm point of view it 
makes sense to handle this all consistency, even if the single user 
(x86) wouldn't strictly require it right now.

I'd just pass in the VMA and call it a day :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate
  2023-01-19 21:22 ` [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
  2023-01-20  0:47   ` Kees Cook
@ 2023-02-01 11:01   ` Borislav Petkov
  2023-02-01 17:31     ` Edgecombe, Rick P
  1 sibling, 1 reply; 120+ messages in thread
From: Borislav Petkov @ 2023-02-01 11:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe

On Thu, Jan 19, 2023 at 01:22:44PM -0800, Rick Edgecombe wrote:
> +void fpregs_lock_and_load(void)
> +{
> +	/*
> +	 * fpregs_lock() only disables preemption (mostly). So modifying state
> +	 * in an interrupt could screw up some in progress fpregs operation,
> +	 * but appear to work. Warn about it.

I don't like comments where it sounds like we don't know what we're
doing. "Appear to work"?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate
  2023-02-01 11:01   ` Borislav Petkov
@ 2023-02-01 17:31     ` Edgecombe, Rick P
  2023-02-01 18:18       ` Borislav Petkov
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-01 17:31 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, jamorris, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Wed, 2023-02-01 at 12:01 +0100, Borislav Petkov wrote:
> On Thu, Jan 19, 2023 at 01:22:44PM -0800, Rick Edgecombe wrote:
> > +void fpregs_lock_and_load(void)
> > +{
> > +     /*
> > +      * fpregs_lock() only disables preemption (mostly). So
> > modifying state
> > +      * in an interrupt could screw up some in progress fpregs
> > operation,
> > +      * but appear to work. Warn about it.
> 
> I don't like comments where it sounds like we don't know what we're
> doing. "Appear to work"?

I can change it. This patch started with the observation that modifying
xstate from the kernel had been gotten wrong a couple times in the
past, so that is what this is referencing. Since then, the fancy
automatic solution got boiled down to this helper and a couple
warnings.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-02-01  9:03                             ` David Hildenbrand
@ 2023-02-01 17:32                               ` Edgecombe, Rick P
  2023-02-01 18:03                                 ` David Hildenbrand
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-01 17:32 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, andrew.cooper3, hjl.tools, Yang, Weijiang, oleg,
	Lutomirski, Andy, bp, jamorris, akpm, Schimpe, Christina, x86,
	tglx, linux-doc, mike.kravetz, arnd, john.allen, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2023-02-01 at 10:03 +0100, David Hildenbrand wrote:
> > 
> > The other problem is that one of NULL passers is not for kernel
> > memory.
> > huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't
> > be
> > created with MAP_HUGETLB, so it is not needed. Using
> > pte_mkwrite_kernel() would look weird in this case, but making
> > huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
> > huge_pte_mkwrite() take a VMA is the better of those two options.
> > Or
> > keep the NULL semantics...  Any thoughts?
> 
> Well, the reason would be consistency. From a core-mm point of view
> it 
> makes sense to handle this all consistency, even if the single user 
> (x86) wouldn't strictly require it right now.
> 
> I'd just pass in the VMA and call it a day :)

Ok, I'll give it a spin.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
  2023-02-01 17:32                               ` Edgecombe, Rick P
@ 2023-02-01 18:03                                 ` David Hildenbrand
  0 siblings, 0 replies; 120+ messages in thread
From: David Hildenbrand @ 2023-02-01 18:03 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, andrew.cooper3, hjl.tools, Yang, Weijiang,
	oleg, Lutomirski, Andy, bp, jamorris, akpm, Schimpe, Christina,
	x86, tglx, linux-doc, mike.kravetz, arnd, john.allen, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On 01.02.23 18:32, Edgecombe, Rick P wrote:
> On Wed, 2023-02-01 at 10:03 +0100, David Hildenbrand wrote:
>>>
>>> The other problem is that one of NULL passers is not for kernel
>>> memory.
>>> huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't
>>> be
>>> created with MAP_HUGETLB, so it is not needed. Using
>>> pte_mkwrite_kernel() would look weird in this case, but making
>>> huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
>>> huge_pte_mkwrite() take a VMA is the better of those two options.
>>> Or
>>> keep the NULL semantics...  Any thoughts?
>>
>> Well, the reason would be consistency. From a core-mm point of view
>> it
>> makes sense to handle this all consistency, even if the single user
>> (x86) wouldn't strictly require it right now.
>>
>> I'd just pass in the VMA and call it a day :)
> 
> Ok, I'll give it a spin.

It would be good to get more opinions on that, but I'm afraid we won't 
get more deep down in this thread :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate
  2023-02-01 17:31     ` Edgecombe, Rick P
@ 2023-02-01 18:18       ` Borislav Petkov
  0 siblings, 0 replies; 120+ messages in thread
From: Borislav Petkov @ 2023-02-01 18:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, jamorris, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Wed, Feb 01, 2023 at 05:31:50PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2023-02-01 at 12:01 +0100, Borislav Petkov wrote:
> > On Thu, Jan 19, 2023 at 01:22:44PM -0800, Rick Edgecombe wrote:
> > > +void fpregs_lock_and_load(void)
> > > +{
> > > +     /*
> > > +      * fpregs_lock() only disables preemption (mostly). So
> > > modifying state
> > > +      * in an interrupt could screw up some in progress fpregs
> > > operation,
> > > +      * but appear to work. Warn about it.
> > 
> > I don't like comments where it sounds like we don't know what we're
> > doing. "Appear to work"?
> 
> I can change it. This patch started with the observation that modifying
> xstate from the kernel had been gotten wrong a couple times in the
> past, so that is what this is referencing. Since then, the fancy
> automatic solution got boiled down to this helper and a couple
> warnings.

Yeah, but that comment right now reads like: modifying in interrupt
context can corrupt fpregs and you should not do it but it kinda works,
by chance. Thus encouraging people to keep doing that.

I guess "but appear to work" can go and then it is fine.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-01-19 21:22 ` [PATCH v5 07/39] x86: Add user control-protection fault handler Rick Edgecombe
  2023-01-20  0:50   ` Kees Cook
@ 2023-02-03 19:09   ` Borislav Petkov
  2023-02-03 19:24     ` Edgecombe, Rick P
  1 sibling, 1 reply; 120+ messages in thread
From: Borislav Petkov @ 2023-02-03 19:09 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu, Michael Kerrisk

On Thu, Jan 19, 2023 at 01:22:45PM -0800, Rick Edgecombe wrote:
> Subject: Re: [PATCH v5 07/39] x86: Add user control-protection fault handler

Subject: x86/shstk: Add...

> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> A control-protection fault is triggered when a control-flow transfer
> attempt violates Shadow Stack or Indirect Branch Tracking constraints.
> For example, the return address for a RET instruction differs from the copy
> on the shadow stack.

...

> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> new file mode 100644
> index 000000000000..33d7d119be26
> --- /dev/null
> +++ b/arch/x86/kernel/cet.c
> @@ -0,0 +1,152 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/ptrace.h>
> +#include <asm/bugs.h>
> +#include <asm/traps.h>
> +
> +enum cp_error_code {
> +	CP_EC        = (1 << 15) - 1,

That looks like a mask, so

	CP_EC_MASK

I guess.

> +
> +	CP_RET       = 1,
> +	CP_IRET      = 2,
> +	CP_ENDBR     = 3,
> +	CP_RSTRORSSP = 4,
> +	CP_SETSSBSY  = 5,
> +
> +	CP_ENCL	     = 1 << 15,
> +};

...

> +static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> +	struct task_struct *tsk;
> +	unsigned long ssp;
> +
> +	/*
> +	 * An exception was just taken from userspace. Since interrupts are disabled
> +	 * here, no scheduling should have messed with the registers yet and they
> +	 * will be whatever is live in userspace. So read the SSP before enabling
> +	 * interrupts so locking the fpregs to do it later is not required.
> +	 */
> +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> +	cond_local_irq_enable(regs);
> +
> +	tsk = current;

Hmm, should you read current before you enable interrupts? Not that it
changes from under us...

> +	tsk->thread.error_code = error_code;
> +	tsk->thread.trap_nr = X86_TRAP_CP;
> +
> +	/* Ratelimit to prevent log spamming. */
> +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> +	    __ratelimit(&cpf_rate)) {
> +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
> +			 tsk->comm, task_pid_nr(tsk),
> +			 regs->ip, regs->sp, ssp, error_code,
> +			 cp_err_string(error_code),
> +			 error_code & CP_ENCL ? " in enclave" : "");
> +		print_vma_addr(KERN_CONT " in ", regs->ip);
> +		pr_cont("\n");
> +	}
> +
> +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> +	cond_local_irq_disable(regs);
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-02-03 19:09   ` Borislav Petkov
@ 2023-02-03 19:24     ` Edgecombe, Rick P
  2023-02-03 19:44       ` Borislav Petkov
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-03 19:24 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	mtk.manpages, corbet, linux-kernel, linux-api, gorcunov

On Fri, 2023-02-03 at 20:09 +0100, Borislav Petkov wrote:
> On Thu, Jan 19, 2023 at 01:22:45PM -0800, Rick Edgecombe wrote:
> > Subject: Re: [PATCH v5 07/39] x86: Add user control-protection
> > fault handler
> 
> Subject: x86/shstk: Add...

Sure.

> 
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > A control-protection fault is triggered when a control-flow
> > transfer
> > attempt violates Shadow Stack or Indirect Branch Tracking
> > constraints.
> > For example, the return address for a RET instruction differs from
> > the copy
> > on the shadow stack.
> 
> ...
> 
> > diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> > new file mode 100644
> > index 000000000000..33d7d119be26
> > --- /dev/null
> > +++ b/arch/x86/kernel/cet.c
> > @@ -0,0 +1,152 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/ptrace.h>
> > +#include <asm/bugs.h>
> > +#include <asm/traps.h>
> > +
> > +enum cp_error_code {
> > +	CP_EC        = (1 << 15) - 1,
> 
> That looks like a mask, so
> 
> 	CP_EC_MASK
> 
> I guess.

The name seems better, but this is actually from the existing kernel
IBT control protection exception code. So it seems like an separate
change. Would you like to see it snuck into the user shadow stack
handler, or could we leave this for future cleanups?

Kees pointed out that adding to the handler and moving it in the same
patch makes it difficult to see where the changes are. I'm splitting
this one into two patches for the next version.

> 
> > +
> > +	CP_RET       = 1,
> > +	CP_IRET      = 2,
> > +	CP_ENDBR     = 3,
> > +	CP_RSTRORSSP = 4,
> > +	CP_SETSSBSY  = 5,
> > +
> > +	CP_ENCL	     = 1 << 15,
> > +};
> 
> ...
> 
> > +static void do_user_cp_fault(struct pt_regs *regs, unsigned long
> > error_code)
> > +{
> > +	struct task_struct *tsk;
> > +	unsigned long ssp;
> > +
> > +	/*
> > +	 * An exception was just taken from userspace. Since interrupts
> > are disabled
> > +	 * here, no scheduling should have messed with the registers
> > yet and they
> > +	 * will be whatever is live in userspace. So read the SSP
> > before enabling
> > +	 * interrupts so locking the fpregs to do it later is not
> > required.
> > +	 */
> > +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > +	cond_local_irq_enable(regs);
> > +
> > +	tsk = current;
> 
> Hmm, should you read current before you enable interrupts? Not that
> it
> changes from under us...

I think we have to read it before we enable interrupts or use
fpregs_lock(). So reading it before saves disabling preemption later.

> 
> > +	tsk->thread.error_code = error_code;
> > +	tsk->thread.trap_nr = X86_TRAP_CP;
> > +
> > +	/* Ratelimit to prevent log spamming. */
> > +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> > +	    __ratelimit(&cpf_rate)) {
> > +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx
> > ssp:%lx error:%lx(%s)%s",
> > +			 tsk->comm, task_pid_nr(tsk),
> > +			 regs->ip, regs->sp, ssp, error_code,
> > +			 cp_err_string(error_code),
> > +			 error_code & CP_ENCL ? " in enclave" : "");
> > +		print_vma_addr(KERN_CONT " in ", regs->ip);
> > +		pr_cont("\n");
> > +	}
> > +
> > +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> > +	cond_local_irq_disable(regs);
> > +}
> 
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-02-03 19:24     ` Edgecombe, Rick P
@ 2023-02-03 19:44       ` Borislav Petkov
  2023-02-03 23:01         ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Borislav Petkov @ 2023-02-03 19:44 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	mtk.manpages, corbet, linux-kernel, linux-api, gorcunov

On Fri, Feb 03, 2023 at 07:24:08PM +0000, Edgecombe, Rick P wrote:
> The name seems better, but this is actually from the existing kernel
> IBT control protection exception code. So it seems like an separate
> change. Would you like to see it snuck into the user shadow stack
> handler, or could we leave this for future cleanups?
> 
> Kees pointed out that adding to the handler and moving it in the same
> patch makes it difficult to see where the changes are. I'm splitting
> this one into two patches for the next version.

Yap, that's the right way to do it.

> I think we have to read it before we enable interrupts or use
> fpregs_lock(). So reading it before saves disabling preemption later.

So I'm a bit confused - there's that cond_local_irq_enable() which will
enable interrupts if they were enabled before.

So if they were enabled before and you reenable them here, then that
current could be the wrong one if we schedule in between, right?

IOW, shouldn't those two lines be swapped so that it says:

        tsk = current;

        cond_local_irq_enable(regs);

and you can be sure that tsk is always the right current which caused
the #CP? Or am I way off again?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-02-03 19:44       ` Borislav Petkov
@ 2023-02-03 23:01         ` Edgecombe, Rick P
  2023-02-04 10:37           ` Borislav Petkov
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-03 23:01 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, andrew.cooper3, john.allen, linux-doc, rppt, mingo,
	mtk.manpages, corbet, linux-kernel, linux-api, gorcunov

On Fri, 2023-02-03 at 20:44 +0100, Borislav Petkov wrote:
> > I think we have to read it before we enable interrupts or use
> > fpregs_lock(). So reading it before saves disabling preemption
> > later.
> 
> So I'm a bit confused - there's that cond_local_irq_enable() which
> will
> enable interrupts if they were enabled before.
> 
> So if they were enabled before and you reenable them here, then that
> current could be the wrong one if we schedule in between, right?
> 
> IOW, shouldn't those two lines be swapped so that it says:
> 
>         tsk = current;
> 
>         cond_local_irq_enable(regs);
> 
> and you can be sure that tsk is always the right current which caused
> the #CP? Or am I way off again?

Since this path is only for exceptions coming from userspace, I think
it should be valid either way. It can't be during a task switch.
I can swap the lines if it looks odd, but unless I'm wrong about the
'current' validity I think it's negligibly better as is because it is
preemptible for as long as possible.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/39] x86: Add user control-protection fault handler
  2023-02-03 23:01         ` Edgecombe, Rick P
@ 2023-02-04 10:37           ` Borislav Petkov
  0 siblings, 0 replies; 120+ messages in thread
From: Borislav Petkov @ 2023-02-04 10:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, andrew.cooper3, john.allen, linux-doc, rppt, mingo,
	mtk.manpages, corbet, linux-kernel, linux-api, gorcunov

On Fri, Feb 03, 2023 at 11:01:42PM +0000, Edgecombe, Rick P wrote:
> Since this path is only for exceptions coming from userspace, I think
> it should be valid either way. It can't be during a task switch.
> I can swap the lines if it looks odd, but unless I'm wrong about the
> 'current' validity I think it's negligibly better as is because it is
> preemptible for as long as possible.

Nah, all good. I was confused here. Sorry for the noise.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-01-19 21:22 ` [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
  2023-01-20  0:57   ` Kees Cook
@ 2023-02-09 14:08   ` Borislav Petkov
  2023-02-09 17:09     ` Edgecombe, Rick P
  1 sibling, 1 reply; 120+ messages in thread
From: Borislav Petkov @ 2023-02-09 14:08 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

On Thu, Jan 19, 2023 at 01:22:49PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The Write=0,Dirty=1 PTE has been used to indicate copy-on-write pages.
> However, newer x86 processors also regard a Write=0,Dirty=1 PTE as a
> shadow stack page. In order to separate the two, the software-defined
> _PAGE_DIRTY is changed to _PAGE_COW for the copy-on-write case, and
> pte_*() are updated to do this.

"In order to separate the two, change the software-defined ..."

From section "2) Describe your changes" in
Documentation/process/submitting-patches.rst:

"Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour."

> +static inline pte_t __pte_mkdirty(pte_t pte, bool soft)
> +{
> +	pteval_t dirty = _PAGE_DIRTY;
> +
> +	if (soft)
> +		dirty |= _PAGE_SOFT_DIRTY;
> +
> +	return pte_set_flags(pte, dirty);
> +}

Dunno, do you even need that __pte_mkdirty() helper?

AFAIU, pte_mkdirty() will always set _PAGE_SOFT_DIRTY too so whatever
the __pte_mkdirty() thing needs to do, you can simply do it by foot in
the two callsites.

And this way you won't have the confusion: should I use pte_mkdirty() or
__pte_mkdirty()?

Ditto for the pmd variants.

Otherwise, this is starting to make more sense now.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-02-09 14:08   ` Borislav Petkov
@ 2023-02-09 17:09     ` Edgecombe, Rick P
  2023-02-10 13:57       ` Borislav Petkov
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-09 17:09 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Thu, 2023-02-09 at 15:08 +0100, Borislav Petkov wrote:
> On Thu, Jan 19, 2023 at 01:22:49PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > The Write=0,Dirty=1 PTE has been used to indicate copy-on-write
> > pages.
> > However, newer x86 processors also regard a Write=0,Dirty=1 PTE as
> > a
> > shadow stack page. In order to separate the two, the software-
> > defined
> > _PAGE_DIRTY is changed to _PAGE_COW for the copy-on-write case, and
> > pte_*() are updated to do this.
> 
> "In order to separate the two, change the software-defined ..."
> 
> From section "2) Describe your changes" in
> Documentation/process/submitting-patches.rst:
> 
> "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
> instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
> to do frotz", as if you are giving orders to the codebase to change
> its behaviour."

Yea, this is ambiguous. It's actually trying to say that "the software-
defined..." *were* changed in previous patches. I'll change it to make
that clear.

> 
> > +static inline pte_t __pte_mkdirty(pte_t pte, bool soft)
> > +{
> > +     pteval_t dirty = _PAGE_DIRTY;
> > +
> > +     if (soft)
> > +             dirty |= _PAGE_SOFT_DIRTY;
> > +
> > +     return pte_set_flags(pte, dirty);
> > +}
> 
> Dunno, do you even need that __pte_mkdirty() helper?
> 
> AFAIU, pte_mkdirty() will always set _PAGE_SOFT_DIRTY too so whatever
> the __pte_mkdirty() thing needs to do, you can simply do it by foot
> in
> the two callsites.
> 
> And this way you won't have the confusion: should I use pte_mkdirty()
> or
> __pte_mkdirty()?
> 
> Ditto for the pmd variants.
> 
> Otherwise, this is starting to make more sense now.

The thing is it would need to duplicate the pte_write() and shadow
stack enablement check and know when to set the Cow(soon to be
SavedDirty) bit.

I see that having a similar helper is not ideal, but isn't it nice that
this special critical logic for setting the Cow bit is all in one
place? I actually tried it the other way, but thought that it was nicer
to have a helper that might drive future people to not miss the Cow bit
part.

What do you think, can we leave it or give it a new name? Maybe
pte_set_dirty() to be more like the x86-only pte_set_flags() family of
functions? Then we have:
static inline pte_t pte_mkdirty(pte_t pte)
{
	pte = pte_set_flags(pte, _PAGE_SOFT_DIRTY);

	return pte_set_dirty(pte);
}

And...
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
...
	/*
	 * Dirty bit is not preserved above so it can be done
	 * in a special way for the shadow stack case, where it
	 * may need to set _PAGE_SAVED_DIRTY. __pte_mkdirty() will do
	 * this in the case of shadow stack.
	 */
	if (oldval & _PAGE_DIRTY)
		pte_result = pte_set_dirty(pte_result);

	return pte_result;
}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-02-09 17:09     ` Edgecombe, Rick P
@ 2023-02-10 13:57       ` Borislav Petkov
  2023-02-10 17:00         ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Borislav Petkov @ 2023-02-10 13:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Thu, Feb 09, 2023 at 05:09:15PM +0000, Edgecombe, Rick P wrote:
> What do you think, can we leave it or give it a new name? Maybe
> pte_set_dirty() to be more like the x86-only pte_set_flags() family of
> functions?

I'd do this (ontop of yours, not built, not tested, etc). It is short
and sweet:

pte_mkdirty() set both dirty flags
pte_modifl() sets only _PAGE_DIRTY

No special helpers to lookup what they do, no nothing. Plain and simple.

---
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7942eff2af50..8ba37380966c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -392,19 +392,10 @@ static inline pte_t pte_mkexec(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_NX);
 }
 
-static inline pte_t __pte_mkdirty(pte_t pte, bool soft)
-{
-	pteval_t dirty = _PAGE_DIRTY;
-
-	if (soft)
-		dirty |= _PAGE_SOFT_DIRTY;
-
-	return pte_set_flags(pte, dirty);
-}
-
+/* Set _PAGE_SOFT_DIRTY for shadow stack pages. */
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return __pte_mkdirty(pte, true);
+	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -749,14 +740,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 
 	pte_result = __pte(val);
 
-	/*
-	 * Dirty bit is not preserved above so it can be done
-	 * in a special way for the shadow stack case, where it
-	 * may need to set _PAGE_COW. __pte_mkdirty() will do this in
-	 * the case of shadow stack.
-	 */
 	if (pte_dirty(pte))
-		pte_result = __pte_mkdirty(pte_result, false);
+		pte_result = pte_set_flags(pte_result, _PAGE_DIRTY);
 
 	return pte_result;
 }



-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-02-10 13:57       ` Borislav Petkov
@ 2023-02-10 17:00         ` Edgecombe, Rick P
  2023-02-17 16:11           ` Borislav Petkov
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-10 17:00 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, andrew.cooper3, john.allen, linux-doc, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, 2023-02-10 at 14:57 +0100, Borislav Petkov wrote:
> @@ -749,14 +740,8 @@ static inline pte_t pte_modify(pte_t pte,
> pgprot_t newprot)
>  
>         pte_result = __pte(val);
>  
> -       /*
> -        * Dirty bit is not preserved above so it can be done
> -        * in a special way for the shadow stack case, where it
> -        * may need to set _PAGE_COW. __pte_mkdirty() will do this in
> -        * the case of shadow stack.
> -        */
>         if (pte_dirty(pte))
> -               pte_result = __pte_mkdirty(pte_result, false);
> +               pte_result = pte_set_flags(pte_result, _PAGE_DIRTY);
>  
>         return pte_result;
>  }
> 

Oh, I see what you are seeing now. Did you notice that  the
__pte_mkdirty() logic got expanded in "x86/mm: Start actually marking
_PAGE_COW"? So if we don't put that logic in a usable helper, it ends
up open coded with pte_modify() looking something like this:
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
	pteval_t val = pte_val(pte), oldval = val;
	pte_t pte_result;

	/*
	 * Chop off the NX bit (if present), and add the NX portion of
	 * the newprot (if present):
	 */
	val &= (_PAGE_CHG_MASK & ~_PAGE_DIRTY);
	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);

	pte_result = __pte(val);

	/*
	 * Dirty bit is not preserved above so it can be done
	 * in a special way for the shadow stack case, where it
	 * may need to set _PAGE_SAVED_DIRTY. __pte_mkdirty() will do
	 * this in the case of shadow stack.
	 */
	if (oldval & _PAGE_DIRTY)
		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) &&
		    !pte_write(pte_result))
			pte_set_flags(pte_result, _PAGE_SAVED_DIRTY);
		else
			pte_set_flags(pte_result, _PAGE_DIRTY);
	}

	return pte_result;
}

So the later logic of doing the _PAGE_SAVED_DIRTY (_PAGE_COW) part is
not centralized. It's ok?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-01-19 21:22 ` [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
  2023-01-20  1:01   ` Kees Cook
@ 2023-02-14  0:09   ` Deepak Gupta
  2023-02-14  1:07     ` Edgecombe, Rick P
  1 sibling, 1 reply; 120+ messages in thread
From: Deepak Gupta @ 2023-02-14  0:09 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	Yu-cheng Yu

Since I've a general question on outcome of discussion of how to handle
`pte_mkwrite`, so I am top posting.

I have posted patches yesterday targeting riscv zisslpcfi extension.
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/

Since there're similarities in extension(s), patches have similarity too.
One of the similarity was updating `maybe_mkwrite`. I was asked (by dhildenb
on my patch #11) to look at x86 approach on how to approach this so that
core-mm approach fits multiple architectures along with the need to
update `pte_mkwrite` to consume vma flags.
In x86 CET patch series, I see that locations where `pte_mkwrite` is
invoked are updated to check for shadow stack vma and not necessarily
`pte_mkwrite` itself is updated to consume vma flags. Let me know if my
understanding is correct and that's the current direction (to update
call sites for vma check where `pte_mkwrite` is invoked)

Being said that as I've mentioned in my patch series that there're
similarities between x86, arm and now riscv for implementing shadow stack
and indirect branch tracking, overall it'll be a good thing if we can
collaborate and come up with common bits.


Rest inline.


On Thu, Jan 19, 2023 at 01:22:57PM -0800, Rick Edgecombe wrote:
>From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
>The x86 Control-flow Enforcement Technology (CET) feature includes a new
>type of memory called shadow stack. This shadow stack memory has some
>unusual properties, which requires some core mm changes to function
>properly.
>
>With the introduction of shadow stack memory there are two ways a pte can
>be writable: regular writable memory and shadow stack memory.
>
>In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
>or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
>where a PTE is made writable. However, there are places where pte_mkwrite()
>is called directly and the logic should now also create a shadow stack PTE
>in the case of a shadow stack VMA.
>
>- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
>  directly and call pte_mkwrite(). Teach it about pte_mkwrite_shstk()
>
>- When userfaultfd is creating a PTE after userspace handles the fault
>  it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()
>
>To make the code cleaner, introduce is_shstk_write() which simplifies
>checking for VM_WRITE | VM_SHADOW_STACK together.
>
>In other cases where pte_mkwrite() is called directly, the VMA will not
>be VM_SHADOW_STACK, and so shadow stack memory should not be created.
> - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
> - In the case of the "dirty_accountable" optimization in mprotect(),
>   shadow stack VMA's won't be VM_SHARED, so it is not necessary.
>
>Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>Tested-by: John Allen <john.allen@amd.com>
>Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>Cc: Kees Cook <keescook@chromium.org>
>---
>
>v5:
> - Fix typo in commit log
>
>v3:
> - Restore do_anonymous_page() that accidetally moved commits (Kirill)
> - Open code maybe_mkwrite() cases from v2, so the behavior doesn't change
>   to mark that non-writable PTEs dirty. (Nadav)
>
>v2:
> - Updated commit log with comment's from Dave Hansen
> - Dave also suggested (I understood) to maybe tweak vm_get_page_prot()
>   to avoid having to call maybe_mkwrite(). After playing around with
>   this I opted to *not* do this. Shadow stack memory memory is
>   effectively writable, so having the default permissions be writable
>   ended up mapping the zero page as writable and other surprises. So
>   creating shadow stack memory needs to be done with manual logic
>   like pte_mkwrite().
> - Drop change in change_pte_range() because it couldn't actually trigger
>   for shadow stack VMAs.
> - Clarify reasoning for skipped cases of pte_mkwrite().
>
>Yu-cheng v25:
> - Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().
>
> arch/x86/include/asm/pgtable.h |  3 +++
> arch/x86/mm/pgtable.c          |  6 ++++++
> include/linux/pgtable.h        |  7 +++++++
> mm/memory.c                    |  5 ++++-
> mm/migrate_device.c            |  4 +++-
> mm/userfaultfd.c               | 10 +++++++---
> 6 files changed, 30 insertions(+), 5 deletions(-)
>
>diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>index 45b1a8f058fe..87d3068734ec 100644
>--- a/arch/x86/include/asm/pgtable.h
>+++ b/arch/x86/include/asm/pgtable.h
>@@ -951,6 +951,9 @@ static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
> }
> #endif  /* CONFIG_PAGE_TABLE_ISOLATION */
>
>+#define is_shstk_write is_shstk_write
>+extern bool is_shstk_write(unsigned long vm_flags);
>+
> #endif	/* __ASSEMBLY__ */
>
>
>diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>index e4f499eb0f29..d103945ba502 100644
>--- a/arch/x86/mm/pgtable.c
>+++ b/arch/x86/mm/pgtable.c
>@@ -880,3 +880,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>
> #endif /* CONFIG_X86_64 */
> #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
>+
>+bool is_shstk_write(unsigned long vm_flags)
>+{
>+	return (vm_flags & (VM_SHADOW_STACK | VM_WRITE)) ==
>+	       (VM_SHADOW_STACK | VM_WRITE);
>+}

Can we call this function something along the lines `is_shadow_stack_vma`?
Reason being, we're actually checking for vma property here.

Also can we move this into common code? Common code can then further call  
`arch_is_shadow_stack_vma`. Respective arch can implement their own shadow
stack encoding. I see that x86 is using one of the arch bit. Current riscv
implementation uses presence of only `VM_WRITE` as shadow stack encoding.

Please see patch #11 and #12 in the series I posted (URL at the top of
this e-mail).


>diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>index 14a820a45a37..49ce1f055242 100644
>--- a/include/linux/pgtable.h
>+++ b/include/linux/pgtable.h
>@@ -1578,6 +1578,13 @@ static inline bool arch_has_pfn_modify_check(void)
> }
> #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
>
>+#ifndef is_shstk_write
>+static inline bool is_shstk_write(unsigned long vm_flags)
>+{
>+	return false;
>+}
>+#endif
>+
> /*
>  * Architecture PAGE_KERNEL_* fallbacks
>  *
>diff --git a/mm/memory.c b/mm/memory.c
>index aad226daf41b..5e5107232a26 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -4088,7 +4088,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>
> 	entry = mk_pte(page, vma->vm_page_prot);
> 	entry = pte_sw_mkyoung(entry);
>-	if (vma->vm_flags & VM_WRITE)
>+
>+	if (is_shstk_write(vma->vm_flags))
>+		entry = pte_mkwrite_shstk(pte_mkdirty(entry));
>+	else if (vma->vm_flags & VM_WRITE)
> 		entry = pte_mkwrite(pte_mkdirty(entry));
>
> 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>index 721b2365dbca..53d417683e01 100644
>--- a/mm/migrate_device.c
>+++ b/mm/migrate_device.c
>@@ -645,7 +645,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> 			goto abort;
> 		}
> 		entry = mk_pte(page, vma->vm_page_prot);
>-		if (vma->vm_flags & VM_WRITE)
>+		if (is_shstk_write(vma->vm_flags))
>+			entry = pte_mkwrite_shstk(pte_mkdirty(entry));
>+		else if (vma->vm_flags & VM_WRITE)
> 			entry = pte_mkwrite(pte_mkdirty(entry));
> 	}
>
>diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>index 0499907b6f1a..832f0250ca61 100644
>--- a/mm/userfaultfd.c
>+++ b/mm/userfaultfd.c
>@@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> 	int ret;
> 	pte_t _dst_pte, *dst_pte;
> 	bool writable = dst_vma->vm_flags & VM_WRITE;
>+	bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
> 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> 	bool page_in_cache = page_mapping(page);
> 	spinlock_t *ptl;
>@@ -84,9 +85,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> 		writable = false;
> 	}
>
>-	if (writable)
>-		_dst_pte = pte_mkwrite(_dst_pte);
>-	else
>+	if (writable) {
>+		if (shstk)
>+			_dst_pte = pte_mkwrite_shstk(_dst_pte);
>+		else
>+			_dst_pte = pte_mkwrite(_dst_pte);
>+	} else
> 		/*
> 		 * We need this to make sure write bit removed; as mk_pte()
> 		 * could return a pte with write bit set.
>-- 
>2.17.1
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-02-14  0:09   ` Deepak Gupta
@ 2023-02-14  1:07     ` Edgecombe, Rick P
  2023-02-14  6:10       ` Deepak Gupta
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-14  1:07 UTC (permalink / raw)
  To: debug
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov, akpm

On Mon, 2023-02-13 at 16:09 -0800, Deepak Gupta wrote:
> Since I've a general question on outcome of discussion of how to
> handle
> `pte_mkwrite`, so I am top posting.
> 
> I have posted patches yesterday targeting riscv zisslpcfi extension.
> 
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> 
> Since there're similarities in extension(s), patches have similarity
> too.
> One of the similarity was updating `maybe_mkwrite`. I was asked (by
> dhildenb
> on my patch #11) to look at x86 approach on how to approach this so
> that
> core-mm approach fits multiple architectures along with the need to
> update `pte_mkwrite` to consume vma flags.
> In x86 CET patch series, I see that locations where `pte_mkwrite` is
> invoked are updated to check for shadow stack vma and not necessarily
> `pte_mkwrite` itself is updated to consume vma flags. Let me know if
> my
> understanding is correct and that's the current direction (to update
> call sites for vma check where `pte_mkwrite` is invoked)
> 
> Being said that as I've mentioned in my patch series that there're
> similarities between x86, arm and now riscv for implementing shadow
> stack
> and indirect branch tracking, overall it'll be a good thing if we can
> collaborate and come up with common bits.

Oh interesting. I've made the changes to have pte_mkwrite() take a VMA.
It seems to work pretty well with the core MM code, but I'm letting 0-
day chew on it for a bit because it touched so many arch's. I'll
include you when I send it out, hopefully later this week.

From just a quick look, I see some design aspects that have been
problematic on the x86 implementation.

There was something like PROT_SHADOW_STACK before, but there were two
problems:
1. Writable windows while provisioning restore tokens (maybe this is
just an x86 thing)
2. Adding guard pages when a shadow stack was mprotect()ed to change it
from writable to shadow stack. Again this might be an x86 need, since
it needed to have it writable to add a restore token, and the guard
pages help with security.

So instead this series creates a map_shadow_stack syscall that maps a
shadow stack and writes the token from the kernel side. Then mprotect()
is prevented from making shadow stack's conventionally writable.

another difference is enabling shadow stack based on elf header bits
instead of the arch_prctl()s. See the history and reasoning here
(section "Switch Enabling Interface"):

https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/

Not sure if those two issues would be problems on riscv or not.

For sharing the prctl() interface. The other thing is that x86 also has
this "wrss" instruction that can be enabled with shadow stack. The
current arch_prctl() interface supports both. I'm thinking it's
probably a pretty arch-specific thing.

ABI-wise, are you planning to automatically allocate shadow stacks for
new tasks? If the ABI is completely different it might be best to not
share user interfaces. But also, I wonder why is it different.

> 
> 
> Rest inline.
> 
> 
> On Thu, Jan 19, 2023 at 01:22:57PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> > 
> > With the introduction of shadow stack memory there are two ways a
> > pte can
> > be writable: regular writable memory and shadow stack memory.
> > 
> > In past patches, maybe_mkwrite() has been updated to apply
> > pte_mkwrite()
> > or pte_mkwrite_shstk() depending on the VMA flag. This covers most
> > cases
> > where a PTE is made writable. However, there are places where
> > pte_mkwrite()
> > is called directly and the logic should now also create a shadow
> > stack PTE
> > in the case of a shadow stack VMA.
> > 
> > - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
> >  directly and call pte_mkwrite(). Teach it about
> > pte_mkwrite_shstk()
> > 
> > - When userfaultfd is creating a PTE after userspace handles the
> > fault
> >  it calls pte_mkwrite() directly. Teach it about
> > pte_mkwrite_shstk()
> > 
> > To make the code cleaner, introduce is_shstk_write() which
> > simplifies
> > checking for VM_WRITE | VM_SHADOW_STACK together.
> > 
> > In other cases where pte_mkwrite() is called directly, the VMA will
> > not
> > be VM_SHADOW_STACK, and so shadow stack memory should not be
> > created.
> > - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
> > - In the case of the "dirty_accountable" optimization in
> > mprotect(),
> >   shadow stack VMA's won't be VM_SHARED, so it is not necessary.
> > 
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > ---
> > 
> > v5:
> > - Fix typo in commit log
> > 
> > v3:
> > - Restore do_anonymous_page() that accidetally moved commits
> > (Kirill)
> > - Open code maybe_mkwrite() cases from v2, so the behavior doesn't
> > change
> >   to mark that non-writable PTEs dirty. (Nadav)
> > 
> > v2:
> > - Updated commit log with comment's from Dave Hansen
> > - Dave also suggested (I understood) to maybe tweak
> > vm_get_page_prot()
> >   to avoid having to call maybe_mkwrite(). After playing around
> > with
> >   this I opted to *not* do this. Shadow stack memory memory is
> >   effectively writable, so having the default permissions be
> > writable
> >   ended up mapping the zero page as writable and other surprises.
> > So
> >   creating shadow stack memory needs to be done with manual logic
> >   like pte_mkwrite().
> > - Drop change in change_pte_range() because it couldn't actually
> > trigger
> >   for shadow stack VMAs.
> > - Clarify reasoning for skipped cases of pte_mkwrite().
> > 
> > Yu-cheng v25:
> > - Apply same changes to do_huge_pmd_numa_page() as to
> > do_numa_page().
> > 
> > arch/x86/include/asm/pgtable.h |  3 +++
> > arch/x86/mm/pgtable.c          |  6 ++++++
> > include/linux/pgtable.h        |  7 +++++++
> > mm/memory.c                    |  5 ++++-
> > mm/migrate_device.c            |  4 +++-
> > mm/userfaultfd.c               | 10 +++++++---
> > 6 files changed, 30 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h
> > index 45b1a8f058fe..87d3068734ec 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -951,6 +951,9 @@ static inline pgd_t pti_set_user_pgtbl(pgd_t
> > *pgdp, pgd_t pgd)
> > }
> > #endif  /* CONFIG_PAGE_TABLE_ISOLATION */
> > 
> > +#define is_shstk_write is_shstk_write
> > +extern bool is_shstk_write(unsigned long vm_flags);
> > +
> > #endif	/* __ASSEMBLY__ */
> > 
> > 
> > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> > index e4f499eb0f29..d103945ba502 100644
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -880,3 +880,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long
> > addr)
> > 
> > #endif /* CONFIG_X86_64 */
> > #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
> > +
> > +bool is_shstk_write(unsigned long vm_flags)
> > +{
> > +	return (vm_flags & (VM_SHADOW_STACK | VM_WRITE)) ==
> > +	       (VM_SHADOW_STACK | VM_WRITE);
> > +}
> 
> Can we call this function something along the lines
> `is_shadow_stack_vma`?
> Reason being, we're actually checking for vma property here.
> 
> Also can we move this into common code? Common code can then further
> call  
> `arch_is_shadow_stack_vma`. Respective arch can implement their own
> shadow
> stack encoding. I see that x86 is using one of the arch bit. Current
> riscv
> implementation uses presence of only `VM_WRITE` as shadow stack
> encoding.

In the next version I've successfully moved all of the shadow stack
bits out of core MM. It doesn't need is_shstk_write() after the
pte_mkwrite() change, and changing this other one:

https://lore.kernel.org/lkml/20230119212317.8324-26-rick.p.edgecombe@intel.com/
For that I added an arch_check_zapped_pte() which an arch can use to
add warnings.

So I wonder if riscv won't need anything either?

> 
> Please see patch #11 and #12 in the series I posted (URL at the top
> of
> this e-mail).
> 
> 
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 14a820a45a37..49ce1f055242 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1578,6 +1578,13 @@ static inline bool
> > arch_has_pfn_modify_check(void)
> > }
> > #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
> > 
> > +#ifndef is_shstk_write
> > +static inline bool is_shstk_write(unsigned long vm_flags)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> > /*
> >  * Architecture PAGE_KERNEL_* fallbacks
> >  *
> > diff --git a/mm/memory.c b/mm/memory.c
> > index aad226daf41b..5e5107232a26 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4088,7 +4088,10 @@ static vm_fault_t do_anonymous_page(struct
> > vm_fault *vmf)
> > 
> > 	entry = mk_pte(page, vma->vm_page_prot);
> > 	entry = pte_sw_mkyoung(entry);
> > -	if (vma->vm_flags & VM_WRITE)
> > +
> > +	if (is_shstk_write(vma->vm_flags))
> > +		entry = pte_mkwrite_shstk(pte_mkdirty(entry));
> > +	else if (vma->vm_flags & VM_WRITE)
> > 		entry = pte_mkwrite(pte_mkdirty(entry));
> > 
> > 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf-
> > >address,
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 721b2365dbca..53d417683e01 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -645,7 +645,9 @@ static void migrate_vma_insert_page(struct
> > migrate_vma *migrate,
> > 			goto abort;
> > 		}
> > 		entry = mk_pte(page, vma->vm_page_prot);
> > -		if (vma->vm_flags & VM_WRITE)
> > +		if (is_shstk_write(vma->vm_flags))
> > +			entry = pte_mkwrite_shstk(pte_mkdirty(entry));
> > +		else if (vma->vm_flags & VM_WRITE)
> > 			entry = pte_mkwrite(pte_mkdirty(entry));
> > 	}
> > 
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 0499907b6f1a..832f0250ca61 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct
> > *dst_mm, pmd_t *dst_pmd,
> > 	int ret;
> > 	pte_t _dst_pte, *dst_pte;
> > 	bool writable = dst_vma->vm_flags & VM_WRITE;
> > +	bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
> > 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > 	bool page_in_cache = page_mapping(page);
> > 	spinlock_t *ptl;
> > @@ -84,9 +85,12 @@ int mfill_atomic_install_pte(struct mm_struct
> > *dst_mm, pmd_t *dst_pmd,
> > 		writable = false;
> > 	}
> > 
> > -	if (writable)
> > -		_dst_pte = pte_mkwrite(_dst_pte);
> > -	else
> > +	if (writable) {
> > +		if (shstk)
> > +			_dst_pte = pte_mkwrite_shstk(_dst_pte);
> > +		else
> > +			_dst_pte = pte_mkwrite(_dst_pte);
> > +	} else
> > 		/*
> > 		 * We need this to make sure write bit removed; as
> > mk_pte()
> > 		 * could return a pte with write bit set.
> > -- 
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-02-14  1:07     ` Edgecombe, Rick P
@ 2023-02-14  6:10       ` Deepak Gupta
  2023-02-14 18:24         ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Deepak Gupta @ 2023-02-14  6:10 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov, akpm

On Tue, Feb 14, 2023 at 01:07:24AM +0000, Edgecombe, Rick P wrote:
>On Mon, 2023-02-13 at 16:09 -0800, Deepak Gupta wrote:
>> Since I've a general question on outcome of discussion of how to
>> handle
>> `pte_mkwrite`, so I am top posting.
>>
>> I have posted patches yesterday targeting riscv zisslpcfi extension.
>>
>https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
>>
>> Since there're similarities in extension(s), patches have similarity
>> too.
>> One of the similarity was updating `maybe_mkwrite`. I was asked (by
>> dhildenb
>> on my patch #11) to look at x86 approach on how to approach this so
>> that
>> core-mm approach fits multiple architectures along with the need to
>> update `pte_mkwrite` to consume vma flags.
>> In x86 CET patch series, I see that locations where `pte_mkwrite` is
>> invoked are updated to check for shadow stack vma and not necessarily
>> `pte_mkwrite` itself is updated to consume vma flags. Let me know if
>> my
>> understanding is correct and that's the current direction (to update
>> call sites for vma check where `pte_mkwrite` is invoked)
>>
>> Being said that as I've mentioned in my patch series that there're
>> similarities between x86, arm and now riscv for implementing shadow
>> stack
>> and indirect branch tracking, overall it'll be a good thing if we can
>> collaborate and come up with common bits.
>
>Oh interesting. I've made the changes to have pte_mkwrite() take a VMA.
>It seems to work pretty well with the core MM code, but I'm letting 0-
>day chew on it for a bit because it touched so many arch's. I'll
>include you when I send it out, hopefully later this week.

Thanks.
>
>From just a quick look, I see some design aspects that have been
>problematic on the x86 implementation.
>
>There was something like PROT_SHADOW_STACK before, but there were two
>problems:
>1. Writable windows while provisioning restore tokens (maybe this is
>just an x86 thing)
>2. Adding guard pages when a shadow stack was mprotect()ed to change it
>from writable to shadow stack. Again this might be an x86 need, since
>it needed to have it writable to add a restore token, and the guard
>pages help with security.

I've not seen your earlier patch but I am assuming when you say window you
mean that shadow stack was open to regular stores (or I may be missing
something here)

I am wondering if mapping it as shadow stack (instead of having temporary
writeable mapping) and using `wruss` was an option to put the token or
you wanted to avoid it?

And yes on riscv, architecture itself doesn't define token or its format.
Since it's RISC, software can define the token format and thus can use
either `sspush` or `ssamoswap` to put a token on `shadow stack` virtual
memory.

>
>So instead this series creates a map_shadow_stack syscall that maps a
>shadow stack and writes the token from the kernel side. Then mprotect()
>is prevented from making shadow stack's conventionally writable.
>
>another difference is enabling shadow stack based on elf header bits
>instead of the arch_prctl()s. See the history and reasoning here
>(section "Switch Enabling Interface"):
>
>https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
>
>Not sure if those two issues would be problems on riscv or not.

Apart from mapping and window issue that you mentioned, I couldn't
understand on why elf header bit is an issue only in this case for x86
shadow stack and not an issue for let's say aarch64. I can see that
aarch64 pretty much uses elf header bit for BTI. Eventually indirect
branch tracking also needs to be enabled which is analogous to BTI.

BTW eventually riscv binaries plan to use `.riscv.attributes` section
in riscv elf binary instead of `.gnu.note.property`. So I am hoping that
part will go into arch specific code of elf parsing for riscv and will be
contained.

>
>For sharing the prctl() interface. The other thing is that x86 also has
>this "wrss" instruction that can be enabled with shadow stack. The
>current arch_prctl() interface supports both. I'm thinking it's
>probably a pretty arch-specific thing.

yes ability to perform writes on shadow stack absolutely are prevented on
x86. So enabling that should be a arch specific prctl.

>
>ABI-wise, are you planning to automatically allocate shadow stacks for
>new tasks? If the ABI is completely different it might be best to not
>share user interfaces. But also, I wonder why is it different.

Yes as of now planning both:
- allocate shadow stack for new task based on elf header
- task can create them using `prctls` (from glibc)

And yes `fork` will get the all cfi properties (shdow stack and branch tracking)
from parent.
>
>>
>>
>> Rest inline.
>>
>>
>> On Thu, Jan 19, 2023 at 01:22:57PM -0800, Rick Edgecombe wrote:
>> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>> >
>> > The x86 Control-flow Enforcement Technology (CET) feature includes
>> > a new
>> > type of memory called shadow stack. This shadow stack memory has
>> > some
>> > unusual properties, which requires some core mm changes to function
>> > properly.
>> >
>> > With the introduction of shadow stack memory there are two ways a
>> > pte can
>> > be writable: regular writable memory and shadow stack memory.
>> >
>> > In past patches, maybe_mkwrite() has been updated to apply
>> > pte_mkwrite()
>> > or pte_mkwrite_shstk() depending on the VMA flag. This covers most
>> > cases
>> > where a PTE is made writable. However, there are places where
>> > pte_mkwrite()
>> > is called directly and the logic should now also create a shadow
>> > stack PTE
>> > in the case of a shadow stack VMA.
>> >
>> > - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
>> >  directly and call pte_mkwrite(). Teach it about
>> > pte_mkwrite_shstk()
>> >
>> > - When userfaultfd is creating a PTE after userspace handles the
>> > fault
>> >  it calls pte_mkwrite() directly. Teach it about
>> > pte_mkwrite_shstk()
>> >
>> > To make the code cleaner, introduce is_shstk_write() which
>> > simplifies
>> > checking for VM_WRITE | VM_SHADOW_STACK together.
>> >
>> > In other cases where pte_mkwrite() is called directly, the VMA will
>> > not
>> > be VM_SHADOW_STACK, and so shadow stack memory should not be
>> > created.
>> > - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
>> > - In the case of the "dirty_accountable" optimization in
>> > mprotect(),
>> >   shadow stack VMA's won't be VM_SHARED, so it is not necessary.
>> >
>> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>> > Tested-by: John Allen <john.allen@amd.com>
>> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> > Cc: Kees Cook <keescook@chromium.org>
>> > ---
>> >
>> > v5:
>> > - Fix typo in commit log
>> >
>> > v3:
>> > - Restore do_anonymous_page() that accidetally moved commits
>> > (Kirill)
>> > - Open code maybe_mkwrite() cases from v2, so the behavior doesn't
>> > change
>> >   to mark that non-writable PTEs dirty. (Nadav)
>> >
>> > v2:
>> > - Updated commit log with comment's from Dave Hansen
>> > - Dave also suggested (I understood) to maybe tweak
>> > vm_get_page_prot()
>> >   to avoid having to call maybe_mkwrite(). After playing around
>> > with
>> >   this I opted to *not* do this. Shadow stack memory memory is
>> >   effectively writable, so having the default permissions be
>> > writable
>> >   ended up mapping the zero page as writable and other surprises.
>> > So
>> >   creating shadow stack memory needs to be done with manual logic
>> >   like pte_mkwrite().
>> > - Drop change in change_pte_range() because it couldn't actually
>> > trigger
>> >   for shadow stack VMAs.
>> > - Clarify reasoning for skipped cases of pte_mkwrite().
>> >
>> > Yu-cheng v25:
>> > - Apply same changes to do_huge_pmd_numa_page() as to
>> > do_numa_page().
>> >
>> > arch/x86/include/asm/pgtable.h |  3 +++
>> > arch/x86/mm/pgtable.c          |  6 ++++++
>> > include/linux/pgtable.h        |  7 +++++++
>> > mm/memory.c                    |  5 ++++-
>> > mm/migrate_device.c            |  4 +++-
>> > mm/userfaultfd.c               | 10 +++++++---
>> > 6 files changed, 30 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/arch/x86/include/asm/pgtable.h
>> > b/arch/x86/include/asm/pgtable.h
>> > index 45b1a8f058fe..87d3068734ec 100644
>> > --- a/arch/x86/include/asm/pgtable.h
>> > +++ b/arch/x86/include/asm/pgtable.h
>> > @@ -951,6 +951,9 @@ static inline pgd_t pti_set_user_pgtbl(pgd_t
>> > *pgdp, pgd_t pgd)
>> > }
>> > #endif  /* CONFIG_PAGE_TABLE_ISOLATION */
>> >
>> > +#define is_shstk_write is_shstk_write
>> > +extern bool is_shstk_write(unsigned long vm_flags);
>> > +
>> > #endif	/* __ASSEMBLY__ */
>> >
>> >
>> > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> > index e4f499eb0f29..d103945ba502 100644
>> > --- a/arch/x86/mm/pgtable.c
>> > +++ b/arch/x86/mm/pgtable.c
>> > @@ -880,3 +880,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long
>> > addr)
>> >
>> > #endif /* CONFIG_X86_64 */
>> > #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
>> > +
>> > +bool is_shstk_write(unsigned long vm_flags)
>> > +{
>> > +	return (vm_flags & (VM_SHADOW_STACK | VM_WRITE)) ==
>> > +	       (VM_SHADOW_STACK | VM_WRITE);
>> > +}
>>
>> Can we call this function something along the lines
>> `is_shadow_stack_vma`?
>> Reason being, we're actually checking for vma property here.
>>
>> Also can we move this into common code? Common code can then further
>> call
>> `arch_is_shadow_stack_vma`. Respective arch can implement their own
>> shadow
>> stack encoding. I see that x86 is using one of the arch bit. Current
>> riscv
>> implementation uses presence of only `VM_WRITE` as shadow stack
>> encoding.
>
>In the next version I've successfully moved all of the shadow stack
>bits out of core MM. It doesn't need is_shstk_write() after the
>pte_mkwrite() change, and changing this other one:
>
>https://lore.kernel.org/lkml/20230119212317.8324-26-rick.p.edgecombe@intel.com/
>For that I added an arch_check_zapped_pte() which an arch can use to
>add warnings.
>
>So I wonder if riscv won't need anything either?
>
>>
>> Please see patch #11 and #12 in the series I posted (URL at the top
>> of
>> this e-mail).
>>
>>
>> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> > index 14a820a45a37..49ce1f055242 100644
>> > --- a/include/linux/pgtable.h
>> > +++ b/include/linux/pgtable.h
>> > @@ -1578,6 +1578,13 @@ static inline bool
>> > arch_has_pfn_modify_check(void)
>> > }
>> > #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
>> >
>> > +#ifndef is_shstk_write
>> > +static inline bool is_shstk_write(unsigned long vm_flags)
>> > +{
>> > +	return false;
>> > +}
>> > +#endif
>> > +
>> > /*
>> >  * Architecture PAGE_KERNEL_* fallbacks
>> >  *
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index aad226daf41b..5e5107232a26 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -4088,7 +4088,10 @@ static vm_fault_t do_anonymous_page(struct
>> > vm_fault *vmf)
>> >
>> > 	entry = mk_pte(page, vma->vm_page_prot);
>> > 	entry = pte_sw_mkyoung(entry);
>> > -	if (vma->vm_flags & VM_WRITE)
>> > +
>> > +	if (is_shstk_write(vma->vm_flags))
>> > +		entry = pte_mkwrite_shstk(pte_mkdirty(entry));
>> > +	else if (vma->vm_flags & VM_WRITE)
>> > 		entry = pte_mkwrite(pte_mkdirty(entry));
>> >
>> > 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf-
>> > >address,
>> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> > index 721b2365dbca..53d417683e01 100644
>> > --- a/mm/migrate_device.c
>> > +++ b/mm/migrate_device.c
>> > @@ -645,7 +645,9 @@ static void migrate_vma_insert_page(struct
>> > migrate_vma *migrate,
>> > 			goto abort;
>> > 		}
>> > 		entry = mk_pte(page, vma->vm_page_prot);
>> > -		if (vma->vm_flags & VM_WRITE)
>> > +		if (is_shstk_write(vma->vm_flags))
>> > +			entry = pte_mkwrite_shstk(pte_mkdirty(entry));
>> > +		else if (vma->vm_flags & VM_WRITE)
>> > 			entry = pte_mkwrite(pte_mkdirty(entry));
>> > 	}
>> >
>> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> > index 0499907b6f1a..832f0250ca61 100644
>> > --- a/mm/userfaultfd.c
>> > +++ b/mm/userfaultfd.c
>> > @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct
>> > *dst_mm, pmd_t *dst_pmd,
>> > 	int ret;
>> > 	pte_t _dst_pte, *dst_pte;
>> > 	bool writable = dst_vma->vm_flags & VM_WRITE;
>> > +	bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
>> > 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
>> > 	bool page_in_cache = page_mapping(page);
>> > 	spinlock_t *ptl;
>> > @@ -84,9 +85,12 @@ int mfill_atomic_install_pte(struct mm_struct
>> > *dst_mm, pmd_t *dst_pmd,
>> > 		writable = false;
>> > 	}
>> >
>> > -	if (writable)
>> > -		_dst_pte = pte_mkwrite(_dst_pte);
>> > -	else
>> > +	if (writable) {
>> > +		if (shstk)
>> > +			_dst_pte = pte_mkwrite_shstk(_dst_pte);
>> > +		else
>> > +			_dst_pte = pte_mkwrite(_dst_pte);
>> > +	} else
>> > 		/*
>> > 		 * We need this to make sure write bit removed; as
>> > mk_pte()
>> > 		 * could return a pte with write bit set.
>> > --
>> > 2.17.1
>> >


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-02-14  6:10       ` Deepak Gupta
@ 2023-02-14 18:24         ` Edgecombe, Rick P
  2023-02-15  6:37           ` Deepak Gupta
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-14 18:24 UTC (permalink / raw)
  To: debug
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, linux-doc, Schimpe, Christina,
	mike.kravetz, x86, akpm, pavel, andrew.cooper3, john.allen, rppt,
	tglx, mingo, corbet, linux-kernel, linux-api, gorcunov

On Mon, 2023-02-13 at 22:10 -0800, Deepak Gupta wrote:
> On Tue, Feb 14, 2023 at 01:07:24AM +0000, Edgecombe, Rick P wrote:
> > On Mon, 2023-02-13 at 16:09 -0800, Deepak Gupta wrote:
> > > Since I've a general question on outcome of discussion of how to
> > > handle
> > > `pte_mkwrite`, so I am top posting.
> > > 
> > > I have posted patches yesterday targeting riscv zisslpcfi
> > > extension.
> > > 
> > 
> > 
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> > > 
> > > Since there're similarities in extension(s), patches have
> > > similarity
> > > too.
> > > One of the similarity was updating `maybe_mkwrite`. I was asked
> > > (by
> > > dhildenb
> > > on my patch #11) to look at x86 approach on how to approach this
> > > so
> > > that
> > > core-mm approach fits multiple architectures along with the need
> > > to
> > > update `pte_mkwrite` to consume vma flags.
> > > In x86 CET patch series, I see that locations where `pte_mkwrite`
> > > is
> > > invoked are updated to check for shadow stack vma and not
> > > necessarily
> > > `pte_mkwrite` itself is updated to consume vma flags. Let me know
> > > if
> > > my
> > > understanding is correct and that's the current direction (to
> > > update
> > > call sites for vma check where `pte_mkwrite` is invoked)
> > > 
> > > Being said that as I've mentioned in my patch series that
> > > there're
> > > similarities between x86, arm and now riscv for implementing
> > > shadow
> > > stack
> > > and indirect branch tracking, overall it'll be a good thing if we
> > > can
> > > collaborate and come up with common bits.
> > 
> > Oh interesting. I've made the changes to have pte_mkwrite() take a
> > VMA.
> > It seems to work pretty well with the core MM code, but I'm letting
> > 0-
> > day chew on it for a bit because it touched so many arch's. I'll
> > include you when I send it out, hopefully later this week.
> 
> Thanks.
> > 
> > From just a quick look, I see some design aspects that have been
> > problematic on the x86 implementation.
> > 
> > There was something like PROT_SHADOW_STACK before, but there were
> > two
> > problems:
> > 1. Writable windows while provisioning restore tokens (maybe this
> > is
> > just an x86 thing)
> > 2. Adding guard pages when a shadow stack was mprotect()ed to
> > change it
> > from writable to shadow stack. Again this might be an x86 need,
> > since
> > it needed to have it writable to add a restore token, and the guard
> > pages help with security.
> 
> I've not seen your earlier patch but I am assuming when you say
> window you
> mean that shadow stack was open to regular stores (or I may be
> missing
> something here)
> 
> I am wondering if mapping it as shadow stack (instead of having
> temporary
> writeable mapping) and using `wruss` was an option to put the token
> or
> you wanted to avoid it?
> 
> And yes on riscv, architecture itself doesn't define token or its
> format.
> Since it's RISC, software can define the token format and thus can
> use
> either `sspush` or `ssamoswap` to put a token on `shadow stack`
> virtual
> memory.

With WRSS a token could be created via software, but x86 shadow stack
includes instructions to create and switch to tokens in limited ways
(RSTORSSP, SAVEPREVSSP), where WRSS lets you write anything. These
other instructions are enough for glibc, except for writing a restore
token on a brand new shadow stack.

So WRSS is made optional since it weakens the protection of the shadow
stack. Some apps may prefer to use it to do exotic things, but the
glibc implementation didn't require it.

> 
> > 
> > So instead this series creates a map_shadow_stack syscall that maps
> > a
> > shadow stack and writes the token from the kernel side. Then
> > mprotect()
> > is prevented from making shadow stack's conventionally writable.
> > 
> > another difference is enabling shadow stack based on elf header
> > bits
> > instead of the arch_prctl()s. See the history and reasoning here
> > (section "Switch Enabling Interface"):
> > 
> > 
https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
> > 
> > Not sure if those two issues would be problems on riscv or not.
> 
> Apart from mapping and window issue that you mentioned, I couldn't
> understand on why elf header bit is an issue only in this case for
> x86
> shadow stack and not an issue for let's say aarch64. I can see that
> aarch64 pretty much uses elf header bit for BTI. Eventually indirect
> branch tracking also needs to be enabled which is analogous to BTI.

Well for one, we had to deal with those old glibc's. But doesn't BTI
text need to be mapped with a special PROT as well? So it doesn't just
turn on enforcement automatically if it detects the elf bit.

> 
> BTW eventually riscv binaries plan to use `.riscv.attributes` section
> in riscv elf binary instead of `.gnu.note.property`. So I am hoping
> that
> part will go into arch specific code of elf parsing for riscv and
> will be
> contained.
> 
> > 
> > For sharing the prctl() interface. The other thing is that x86 also
> > has
> > this "wrss" instruction that can be enabled with shadow stack. The
> > current arch_prctl() interface supports both. I'm thinking it's
> > probably a pretty arch-specific thing.
> 
> yes ability to perform writes on shadow stack absolutely are
> prevented on
> x86. So enabling that should be a arch specific prctl.
> 
> > 
> > ABI-wise, are you planning to automatically allocate shadow stacks
> > for
> > new tasks? If the ABI is completely different it might be best to
> > not
> > share user interfaces. But also, I wonder why is it different.
> 
> Yes as of now planning both:
> - allocate shadow stack for new task based on elf header
> - task can create them using `prctls` (from glibc)
> 
> And yes `fork` will get the all cfi properties (shdow stack and
> branch tracking)
> from parent.

Have you looked at a riscv libc implementation yet? For unifying ABI I
think that might be best interface to target, for app developers. Then
each arch can implement enough kernel functionality to support libc
(for example map_shadow_stack).



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly
  2023-02-14 18:24         ` Edgecombe, Rick P
@ 2023-02-15  6:37           ` Deepak Gupta
  0 siblings, 0 replies; 120+ messages in thread
From: Deepak Gupta @ 2023-02-15  6:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, linux-doc, Schimpe, Christina,
	mike.kravetz, x86, akpm, pavel, andrew.cooper3, john.allen, rppt,
	tglx, mingo, corbet, linux-kernel, linux-api, gorcunov

On Tue, Feb 14, 2023 at 10:24 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2023-02-13 at 22:10 -0800, Deepak Gupta wrote:
> > On Tue, Feb 14, 2023 at 01:07:24AM +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2023-02-13 at 16:09 -0800, Deepak Gupta wrote:
> > > > Since I've a general question on outcome of discussion of how to
> > > > handle
> > > > `pte_mkwrite`, so I am top posting.
> > > >
> > > > I have posted patches yesterday targeting riscv zisslpcfi
> > > > extension.
> > > >
> > >
> > >
> https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> > > >
> > > > Since there're similarities in extension(s), patches have
> > > > similarity
> > > > too.
> > > > One of the similarity was updating `maybe_mkwrite`. I was asked
> > > > (by
> > > > dhildenb
> > > > on my patch #11) to look at x86 approach on how to approach this
> > > > so
> > > > that
> > > > core-mm approach fits multiple architectures along with the need
> > > > to
> > > > update `pte_mkwrite` to consume vma flags.
> > > > In x86 CET patch series, I see that locations where `pte_mkwrite`
> > > > is
> > > > invoked are updated to check for shadow stack vma and not
> > > > necessarily
> > > > `pte_mkwrite` itself is updated to consume vma flags. Let me know
> > > > if
> > > > my
> > > > understanding is correct and that's the current direction (to
> > > > update
> > > > call sites for vma check where `pte_mkwrite` is invoked)
> > > >
> > > > Being said that as I've mentioned in my patch series that
> > > > there're
> > > > similarities between x86, arm and now riscv for implementing
> > > > shadow
> > > > stack
> > > > and indirect branch tracking, overall it'll be a good thing if we
> > > > can
> > > > collaborate and come up with common bits.
> > >
> > > Oh interesting. I've made the changes to have pte_mkwrite() take a
> > > VMA.
> > > It seems to work pretty well with the core MM code, but I'm letting
> > > 0-
> > > day chew on it for a bit because it touched so many arch's. I'll
> > > include you when I send it out, hopefully later this week.
> >
> > Thanks.
> > >
> > > From just a quick look, I see some design aspects that have been
> > > problematic on the x86 implementation.
> > >
> > > There was something like PROT_SHADOW_STACK before, but there were
> > > two
> > > problems:
> > > 1. Writable windows while provisioning restore tokens (maybe this
> > > is
> > > just an x86 thing)
> > > 2. Adding guard pages when a shadow stack was mprotect()ed to
> > > change it
> > > from writable to shadow stack. Again this might be an x86 need,
> > > since
> > > it needed to have it writable to add a restore token, and the guard
> > > pages help with security.
> >
> > I've not seen your earlier patch but I am assuming when you say
> > window you
> > mean that shadow stack was open to regular stores (or I may be
> > missing
> > something here)
> >
> > I am wondering if mapping it as shadow stack (instead of having
> > temporary
> > writeable mapping) and using `wruss` was an option to put the token
> > or
> > you wanted to avoid it?
> >
> > And yes on riscv, architecture itself doesn't define token or its
> > format.
> > Since it's RISC, software can define the token format and thus can
> > use
> > either `sspush` or `ssamoswap` to put a token on `shadow stack`
> > virtual
> > memory.
>
> With WRSS a token could be created via software, but x86 shadow stack
> includes instructions to create and switch to tokens in limited ways
> (RSTORSSP, SAVEPREVSSP), where WRSS lets you write anything. These
> other instructions are enough for glibc, except for writing a restore
> token on a brand new shadow stack.
>
> So WRSS is made optional since it weakens the protection of the shadow
> stack. Some apps may prefer to use it to do exotic things, but the
> glibc implementation didn't require it.
>

Yes, I understand WRSS in user mode is not safe and defeat the purpose as well.

I actually had meant why WRUSS couldn't be used in the kernel to
manufacture the token when the kernel
creates the shadow stack while parsing elf bits. But then I went
through you earlier patch series now and I've a
a little bit of context now. There is a lot of history and context
(and mess) here.

> >
> > >
> > > So instead this series creates a map_shadow_stack syscall that maps
> > > a
> > > shadow stack and writes the token from the kernel side. Then
> > > mprotect()
> > > is prevented from making shadow stack's conventionally writable.
> > >
> > > another difference is enabling shadow stack based on elf header
> > > bits
> > > instead of the arch_prctl()s. See the history and reasoning here
> > > (section "Switch Enabling Interface"):
> > >
> > >
> https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
> > >
> > > Not sure if those two issues would be problems on riscv or not.
> >
> > Apart from mapping and window issue that you mentioned, I couldn't
> > understand on why elf header bit is an issue only in this case for
> > x86
> > shadow stack and not an issue for let's say aarch64. I can see that
> > aarch64 pretty much uses elf header bit for BTI. Eventually indirect
> > branch tracking also needs to be enabled which is analogous to BTI.
>
> Well for one, we had to deal with those old glibc's. But doesn't BTI
> text need to be mapped with a special PROT as well? So it doesn't just
> turn on enforcement automatically if it detects the elf bit.
>
> >
> > BTW eventually riscv binaries plan to use `.riscv.attributes` section
> > in riscv elf binary instead of `.gnu.note.property`. So I am hoping
> > that
> > part will go into arch specific code of elf parsing for riscv and
> > will be
> > contained.
> >
> > >
> > > For sharing the prctl() interface. The other thing is that x86 also
> > > has
> > > this "wrss" instruction that can be enabled with shadow stack. The
> > > current arch_prctl() interface supports both. I'm thinking it's
> > > probably a pretty arch-specific thing.
> >
> > yes ability to perform writes on shadow stack absolutely are
> > prevented on
> > x86. So enabling that should be a arch specific prctl.
> >
> > >
> > > ABI-wise, are you planning to automatically allocate shadow stacks
> > > for
> > > new tasks? If the ABI is completely different it might be best to
> > > not
> > > share user interfaces. But also, I wonder why is it different.
> >
> > Yes as of now planning both:
> > - allocate shadow stack for new task based on elf header
> > - task can create them using `prctls` (from glibc)
> >
> > And yes `fork` will get the all cfi properties (shdow stack and
> > branch tracking)
> > from parent.
>
> Have you looked at a riscv libc implementation yet? For unifying ABI I
> think that might be best interface to target, for app developers. Then
> each arch can implement enough kernel functionality to support libc
> (for example map_shadow_stack).
>
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-02-10 17:00         ` Edgecombe, Rick P
@ 2023-02-17 16:11           ` Borislav Petkov
  2023-02-17 16:53             ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Borislav Petkov @ 2023-02-17 16:11 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, andrew.cooper3, john.allen, linux-doc, rppt, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, Feb 10, 2023 at 05:00:05PM +0000, Edgecombe, Rick P wrote:
> 	/*
> 	 * Dirty bit is not preserved above so it can be done
> 	 * in a special way for the shadow stack case, where it
> 	 * may need to set _PAGE_SAVED_DIRTY. __pte_mkdirty() will do
> 	 * this in the case of shadow stack.
> 	 */
> 	if (oldval & _PAGE_DIRTY)
> 		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) &&
> 		    !pte_write(pte_result))
> 			pte_set_flags(pte_result, _PAGE_SAVED_DIRTY);
> 		else
> 			pte_set_flags(pte_result, _PAGE_DIRTY);
> 	}
> 
> 	return pte_result;
> }
> 
> So the later logic of doing the _PAGE_SAVED_DIRTY (_PAGE_COW) part is
> not centralized. It's ok?

I think so.

1. If you have a single pte_mkdirty() and not also a __ helper, then
   there's less confusion for callers as to which interface they should be
   using

2. The not centralized part is a single conditional so it's not like
   you're saving on gazillion code lines

So I'd prefer that.

If we end up needing this in more places then we can carve it out into
a proper helper which is not in a header file such that anyone can use
it but move the whole functionality into cet.c or so where we can
control its visibility to the rest of the kernel.

I'd say.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2023-02-17 16:11           ` Borislav Petkov
@ 2023-02-17 16:53             ` Edgecombe, Rick P
  0 siblings, 0 replies; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-02-17 16:53 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, andrew.cooper3, oleg, Yang, Weijiang,
	Lutomirski, Andy, hjl.tools, jamorris, arnd, tglx, Schimpe,
	Christina, x86, mike.kravetz, akpm, john.allen, linux-doc, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Fri, 2023-02-17 at 17:11 +0100, Borislav Petkov wrote:
> On Fri, Feb 10, 2023 at 05:00:05PM +0000, Edgecombe, Rick P wrote:
> >        /*
> >         * Dirty bit is not preserved above so it can be done
> >         * in a special way for the shadow stack case, where it
> >         * may need to set _PAGE_SAVED_DIRTY. __pte_mkdirty() will
> > do
> >         * this in the case of shadow stack.
> >         */
> >        if (oldval & _PAGE_DIRTY)
> >                if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) &&
> >                    !pte_write(pte_result))
> >                        pte_set_flags(pte_result,
> > _PAGE_SAVED_DIRTY);
> >                else
> >                        pte_set_flags(pte_result, _PAGE_DIRTY);
> >        }
> > 
> >        return pte_result;
> > }
> > 
> > So the later logic of doing the _PAGE_SAVED_DIRTY (_PAGE_COW) part
> > is
> > not centralized. It's ok?
> 
> I think so.
> 
> 1. If you have a single pte_mkdirty() and not also a __ helper, then
>    there's less confusion for callers as to which interface they
> should be
>    using
> 
> 2. The not centralized part is a single conditional so it's not like
>    you're saving on gazillion code lines
> 
> So I'd prefer that.
> 
> 
Fair enough, I'll adjust it. Thanks!

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2023-02-17 16:53 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-19 21:22 [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 01/39] Documentation/x86: Add CET shadow stack description Rick Edgecombe
2023-01-20  0:38   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 02/39] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
2023-01-20  0:40   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
2023-01-20  0:44   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
2023-01-20  0:46   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
2023-01-20  0:46   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
2023-01-20  0:47   ` Kees Cook
2023-02-01 11:01   ` Borislav Petkov
2023-02-01 17:31     ` Edgecombe, Rick P
2023-02-01 18:18       ` Borislav Petkov
2023-01-19 21:22 ` [PATCH v5 07/39] x86: Add user control-protection fault handler Rick Edgecombe
2023-01-20  0:50   ` Kees Cook
2023-02-03 19:09   ` Borislav Petkov
2023-02-03 19:24     ` Edgecombe, Rick P
2023-02-03 19:44       ` Borislav Petkov
2023-02-03 23:01         ` Edgecombe, Rick P
2023-02-04 10:37           ` Borislav Petkov
2023-01-19 21:22 ` [PATCH v5 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
2023-01-20  0:52   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
2023-01-20  0:55   ` Kees Cook
2023-01-23  9:16   ` David Hildenbrand
2023-01-23  9:28   ` David Hildenbrand
2023-01-23 20:56     ` Edgecombe, Rick P
2023-01-24 16:28       ` David Hildenbrand
2023-01-19 21:22 ` [PATCH v5 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
2023-01-20  0:57   ` Kees Cook
2023-02-09 14:08   ` Borislav Petkov
2023-02-09 17:09     ` Edgecombe, Rick P
2023-02-10 13:57       ` Borislav Petkov
2023-02-10 17:00         ` Edgecombe, Rick P
2023-02-17 16:11           ` Borislav Petkov
2023-02-17 16:53             ` Edgecombe, Rick P
2023-01-19 21:22 ` [PATCH v5 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
2023-01-20  0:58   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 13/39] x86/mm: Start actually marking _PAGE_COW Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 14/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 15/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 16/39] x86/mm: Check shadow stack page fault errors Rick Edgecombe
2023-01-20  0:59   ` Kees Cook
2023-01-19 21:22 ` [PATCH v5 17/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk Rick Edgecombe
2023-01-23  9:50   ` David Hildenbrand
2023-01-23 20:47     ` Edgecombe, Rick P
2023-01-24 16:24       ` David Hildenbrand
2023-01-24 18:14         ` Edgecombe, Rick P
2023-01-25  9:27           ` David Hildenbrand
2023-01-25 18:43             ` Edgecombe, Rick P
2023-01-26  0:59               ` Edgecombe, Rick P
2023-01-26  8:46                 ` David Hildenbrand
2023-01-26 20:19                   ` Edgecombe, Rick P
2023-01-27 16:12                     ` David Hildenbrand
2023-01-28  0:51                       ` Edgecombe, Rick P
2023-01-31  8:46                         ` David Hildenbrand
2023-01-31 23:33                           ` Edgecombe, Rick P
2023-02-01  9:03                             ` David Hildenbrand
2023-02-01 17:32                               ` Edgecombe, Rick P
2023-02-01 18:03                                 ` David Hildenbrand
2023-01-26  8:57               ` David Hildenbrand
2023-01-26 20:16                 ` Edgecombe, Rick P
2023-01-27 16:19                   ` David Hildenbrand
2023-01-19 21:22 ` [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
2023-01-20  1:01   ` Kees Cook
2023-02-14  0:09   ` Deepak Gupta
2023-02-14  1:07     ` Edgecombe, Rick P
2023-02-14  6:10       ` Deepak Gupta
2023-02-14 18:24         ` Edgecombe, Rick P
2023-02-15  6:37           ` Deepak Gupta
2023-01-19 21:22 ` [PATCH v5 20/39] mm: Add guard pages around a shadow stack Rick Edgecombe
2023-01-19 21:22 ` [PATCH v5 21/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 22/39] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 23/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
2023-01-23  9:10   ` David Hildenbrand
2023-01-23 10:45     ` Florian Weimer
2023-01-23 20:46       ` Edgecombe, Rick P
2023-01-24 16:26         ` David Hildenbrand
2023-01-24 18:42           ` Edgecombe, Rick P
2023-01-24 23:08             ` Kees Cook
2023-01-24 23:41               ` Edgecombe, Rick P
2023-01-25  9:29                 ` David Hildenbrand
2023-01-25 15:23                   ` Kees Cook
2023-01-25 15:36             ` Schimpe, Christina
2023-01-25 16:43               ` Schimpe, Christina
2023-01-19 21:23 ` [PATCH v5 24/39] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 25/39] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
2023-01-20  1:01   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 26/39] x86: Introduce userspace API for shadow stack Rick Edgecombe
2023-01-20  1:04   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 27/39] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
2023-01-20  1:05   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 28/39] x86/shstk: Handle thread shadow stack Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 29/39] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
2023-01-20  1:05   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 30/39] x86/shstk: Handle signals for shadow stack Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 31/39] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
2023-01-20  1:07   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 32/39] x86/shstk: Support WRSS for userspace Rick Edgecombe
2023-01-20  1:06   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 33/39] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 34/39] x86/shstk: Wire in shadow stack interface Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 35/39] selftests/x86: Add shadow stack test Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 37/39] x86: Add PTRACE interface for shadow stack Rick Edgecombe
2023-01-20  1:08   ` Kees Cook
2023-01-19 21:23 ` [PATCH v5 38/39] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
2023-01-19 21:23 ` [PATCH v5 39/39] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
2023-01-20  1:08   ` Kees Cook
2023-01-19 22:26 ` [PATCH v5 00/39] Shadow stacks for userspace Andrew Morton
2023-01-20 17:27   ` Edgecombe, Rick P
2023-01-20 19:19     ` Kees Cook
2023-01-25 19:46       ` Edgecombe, Rick P
2023-01-20 17:48 ` John Allen
2023-01-22  8:20 ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).