linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/41] Shadow stacks for userspace
@ 2023-02-27 22:29 Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description Rick Edgecombe
                   ` (40 more replies)
  0 siblings, 41 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow 
Enforcement Technology (CET). CET consists of two related security features: 
shadow stacks and indirect branch tracking. This series implements just the 
shadow stack part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against return 
oriented programming attacks. It works by maintaining a secondary (shadow) 
stack using a special memory type that has protections against modification. 
When executing a CALL instruction, the processor pushes the return address to 
both the normal stack and to the special permission shadow stack. Upon RET, 
the processor pops the shadow stack copy and compares it to the normal stack 
copy. For more details, see the coverletter from v1 [0].

The changes for this version are some more cleanup of comment and commit log
verbiage, and small refactor in the memory accounting patch. There was also
some feedback from David Hildenbrand about adding GUP tests for the
!FOLL_FORCE case. This is currently planned for a fast follow on patch.

Previous version [1].

Thanks,
Rick


[0] https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/20230218211433.26859-1-rick.p.edgecombe@intel.com/

Kirill A. Shutemov (1):
  x86: Introduce userspace API for shadow stack

Mike Rapoport (1):
  x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (19):
  x86/fpu: Add helper for modifying xstate
  x86: Move control protection handler to separate file
  mm: Introduce pte_mkwrite_kernel()
  s390/mm: Introduce pmd_mkwrite_kernel()
  mm: Make pte_mkwrite() take a VMA
  x86/mm: Introduce _PAGE_SAVED_DIRTY
  x86/mm: Start actually marking _PAGE_SAVED_DIRTY
  x86/mm: Teach pte_mkwrite() about stack memory
  mm: Don't allow write GUPs to shadow stack memory
  x86/mm: Introduce MAP_ABOVE4G
  mm: Warn on shadow stack memory in wrong vma
  x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Support WRSS for userspace
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Wire in shadow stack interface
  selftests/x86: Add shadow stack test
  x86/fpu: Add helper for initing features
  x86/shstk: Add ARCH_SHSTK_STATUS

Yu-cheng Yu (20):
  Documentation/x86: Add CET shadow stack description
  x86/shstk: Add Kconfig option for shadow stack
  x86/cpufeatures: Add CPU feature flags for shadow stacks
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/shstk: Add user control-protection fault handler
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory
  x86/mm: Check shadow stack page fault errors
  mm: Add guard pages around a shadow stack.
  mm/mmap: Add shadow stack pages to memory accounting
  mm: Re-introduce vm_flags to do_mmap()
  x86/shstk: Add user-mode shadow stack support
  x86/shstk: Handle thread shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle signals for shadow stack
  x86: Add PTRACE interface for shadow stack

 Documentation/filesystems/proc.rst            |   1 +
 Documentation/mm/arch_pgtable_helpers.rst     |   9 +-
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/shstk.rst                   | 176 +++++
 arch/alpha/include/asm/pgtable.h              |   6 +-
 arch/arc/include/asm/hugepage.h               |   2 +-
 arch/arc/include/asm/pgtable-bits-arcv2.h     |   7 +-
 arch/arm/include/asm/pgtable-3level.h         |   7 +-
 arch/arm/include/asm/pgtable.h                |   2 +-
 arch/arm/kernel/signal.c                      |   2 +-
 arch/arm64/include/asm/pgtable.h              |   9 +-
 arch/arm64/kernel/signal.c                    |   2 +-
 arch/arm64/kernel/signal32.c                  |   2 +-
 arch/arm64/mm/trans_pgd.c                     |   4 +-
 arch/csky/include/asm/pgtable.h               |   2 +-
 arch/hexagon/include/asm/pgtable.h            |   2 +-
 arch/ia64/include/asm/pgtable.h               |   2 +-
 arch/loongarch/include/asm/pgtable.h          |   4 +-
 arch/m68k/include/asm/mcf_pgtable.h           |   2 +-
 arch/m68k/include/asm/motorola_pgtable.h      |   6 +-
 arch/m68k/include/asm/sun3_pgtable.h          |   6 +-
 arch/microblaze/include/asm/pgtable.h         |   2 +-
 arch/mips/include/asm/pgtable.h               |   6 +-
 arch/nios2/include/asm/pgtable.h              |   2 +-
 arch/openrisc/include/asm/pgtable.h           |   2 +-
 arch/parisc/include/asm/pgtable.h             |   6 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h  |   2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   4 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |   2 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  |   2 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h  |   2 +-
 arch/riscv/include/asm/pgtable.h              |   6 +-
 arch/s390/include/asm/hugetlb.h               |   4 +-
 arch/s390/include/asm/pgtable.h               |  14 +-
 arch/s390/mm/pageattr.c                       |   4 +-
 arch/sh/include/asm/pgtable_32.h              |  10 +-
 arch/sparc/include/asm/pgtable_32.h           |   2 +-
 arch/sparc/include/asm/pgtable_64.h           |   6 +-
 arch/sparc/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal_64.c                 |   2 +-
 arch/um/include/asm/pgtable.h                 |   2 +-
 arch/x86/Kconfig                              |  24 +
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |  16 +-
 arch/x86/include/asm/fpu/api.h                |   9 +
 arch/x86/include/asm/fpu/regset.h             |   7 +-
 arch/x86/include/asm/fpu/sched.h              |   3 +-
 arch/x86/include/asm/fpu/types.h              |  16 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   2 +-
 arch/x86/include/asm/mmu_context.h            |   2 +
 arch/x86/include/asm/msr.h                    |  11 +
 arch/x86/include/asm/pgtable.h                | 322 +++++++-
 arch/x86/include/asm/pgtable_types.h          |  56 +-
 arch/x86/include/asm/processor.h              |   8 +
 arch/x86/include/asm/shstk.h                  |  40 +
 arch/x86/include/asm/special_insns.h          |  13 +
 arch/x86/include/asm/tlbflush.h               |   3 +-
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/asm/traps.h                  |  12 +
 arch/x86/include/uapi/asm/mman.h              |   4 +
 arch/x86/include/uapi/asm/prctl.h             |  12 +
 arch/x86/kernel/Makefile                      |   4 +
 arch/x86/kernel/cet.c                         | 152 ++++
 arch/x86/kernel/cpu/common.c                  |  35 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/cpu/proc.c                    |  23 +
 arch/x86/kernel/fpu/core.c                    |  59 +-
 arch/x86/kernel/fpu/regset.c                  |  86 +++
 arch/x86/kernel/fpu/xstate.c                  | 148 ++--
 arch/x86/kernel/fpu/xstate.h                  |   6 +
 arch/x86/kernel/idt.c                         |   2 +-
 arch/x86/kernel/process.c                     |  18 +-
 arch/x86/kernel/process_64.c                  |   9 +-
 arch/x86/kernel/ptrace.c                      |  12 +
 arch/x86/kernel/shstk.c                       | 491 +++++++++++++
 arch/x86/kernel/signal.c                      |   1 +
 arch/x86/kernel/signal_32.c                   |   2 +-
 arch/x86/kernel/signal_64.c                   |   8 +-
 arch/x86/kernel/sys_x86_64.c                  |   6 +-
 arch/x86/kernel/traps.c                       |  87 ---
 arch/x86/mm/fault.c                           |  31 +
 arch/x86/mm/pat/set_memory.c                  |   4 +-
 arch/x86/mm/pgtable.c                         |  38 +
 arch/x86/xen/enlighten_pv.c                   |   2 +-
 arch/x86/xen/mmu_pv.c                         |   2 +-
 arch/x86/xen/xen-asm.S                        |   2 +-
 arch/xtensa/include/asm/pgtable.h             |   2 +-
 fs/aio.c                                      |   2 +-
 fs/proc/array.c                               |   6 +
 fs/proc/task_mmu.c                            |   3 +
 include/asm-generic/hugetlb.h                 |   4 +-
 include/linux/mm.h                            |  46 +-
 include/linux/mman.h                          |   4 +
 include/linux/pgtable.h                       |  14 +
 include/linux/proc_fs.h                       |   2 +
 include/linux/syscalls.h                      |   1 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/asm-generic/unistd.h             |   2 +-
 include/uapi/linux/elf.h                      |   2 +
 ipc/shm.c                                     |   2 +-
 kernel/sys_ni.c                               |   1 +
 mm/debug_vm_pgtable.c                         |  16 +-
 mm/gup.c                                      |   2 +-
 mm/huge_memory.c                              |   7 +-
 mm/hugetlb.c                                  |   4 +-
 mm/internal.h                                 |   8 +-
 mm/memory.c                                   |   5 +-
 mm/migrate_device.c                           |   2 +-
 mm/mmap.c                                     |  10 +-
 mm/mprotect.c                                 |   2 +-
 mm/nommu.c                                    |   4 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |   2 +-
 tools/testing/selftests/x86/Makefile          |   2 +-
 .../testing/selftests/x86/test_shadow_stack.c | 695 ++++++++++++++++++
 118 files changed, 2669 insertions(+), 327 deletions(-)
 create mode 100644 Documentation/x86/shstk.rst
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-01 14:21   ` Szabolcs Nagy
  2023-02-27 22:29 ` [PATCH v7 02/41] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
                   ` (39 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce a new document on Control-flow Enforcement Technology (CET).

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Literal format tweaks (Bagas Sanjaya)
 - Update EOPNOTSUPP text due to unification after comment from (Kees)
 - Update 32 bit signal support with new behavior
 - Remove capitalization on shadow stack (Boris)
 - Fix typo

v4:
 - Drop clearcpuid piece (Boris)
 - Add some info about 32 bit

v3:
 - Clarify kernel IBT is supported by the kernel. (Kees, Andrew Cooper)
 - Clarify which arch_prctl's can take multiple bits. (Kees)
 - Describe ASLR characteristics of thread shadow stacks. (Kees)
 - Add exec section. (Andrew Cooper)
 - Fix some capitalization (Bagas Sanjaya)
 - Update new location of enablement status proc.
 - Add info about new user_shstk software capability.
 - Add more info about what the kernel pushes to the shadow stack on
   signal.

v2:
 - Updated to new arch_prctl() API
 - Add bit about new proc status
---
 Documentation/x86/index.rst |   1 +
 Documentation/x86/shstk.rst | 166 ++++++++++++++++++++++++++++++++++++
 2 files changed, 167 insertions(+)
 create mode 100644 Documentation/x86/shstk.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index c73d133fd37c..8ac64d7de4dc 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -22,6 +22,7 @@ x86-specific Documentation
    mtrr
    pat
    intel-hfi
+   shstk
    iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
new file mode 100644
index 000000000000..f2e6f323cf68
--- /dev/null
+++ b/Documentation/x86/shstk.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Control-flow Enforcement Technology (CET) Shadow Stack
+======================================================
+
+CET Background
+==============
+
+Control-flow Enforcement Technology (CET) is term referring to several
+related x86 processor features that provides protection against control
+flow hijacking attacks. The HW feature itself can be set up to protect
+both applications and the kernel.
+
+CET introduces shadow stack and indirect branch tracking (IBT). Shadow stack
+is a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+shadow stack and kernel IBT are supported.
+
+Requirements to use Shadow Stack
+================================
+
+To use userspace shadow stack you need HW that supports it, a kernel
+configured with it and userspace libraries compiled with it.
+
+The kernel Kconfig option is X86_USER_SHADOW_STACK, and it can be disabled
+with the kernel parameter: nousershstk.
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "user_shstk" means that userspace shadow stack is supported on the current
+kernel and HW.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output::
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+Enabling arch_prctl()'s
+=======================
+
+Elf features should be enabled by the loader using the below arch_prctl's. They
+are only supported in 64 bit user applications.
+
+arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
+    Enable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
+    Disable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
+    Lock in features at their current enabled or disabled status. 'features'
+    is a mask of all features to lock. All bits set are processed, unset bits
+    are ignored. The mask is ORed with the existing value. So any feature bits
+    set here cannot be enabled or disabled afterwards.
+
+The return values are as follows. On success, return 0. On error, errno can
+be::
+
+        -EPERM if any of the passed feature are locked.
+        -ENOTSUPP if the feature is not supported by the hardware or
+         kernel.
+        -EINVAL arguments (non existing feature, etc)
+
+The feature's bits supported are::
+
+    ARCH_SHSTK_SHSTK - Shadow stack
+    ARCH_SHSTK_WRSS  - WRSS
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc Status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+    x86_Thread_features: shstk wrss
+    x86_Thread_features_locked: shstk wrss
+
+Implementation of the Shadow Stack
+==================================
+
+Shadow Stack Size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+By default, the main program and its signal handlers use the same shadow
+stack. Because the shadow stack stores only return addresses, a large
+shadow stack covers the condition that both the program stack and the
+signal alternate stack run out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+    |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+                    (bit 63 set to 1)
+    |        ...| - Other state may be added in the future
+
+
+32 bit ABI signals are not supported in shadow stack processes. Linux prevents
+32 bit execution while shadow stack is enabled by the allocating shadow stack's
+outside of the 32 bit address space. When execution enters 32 bit mode, either
+via far call or returning to userspace, a #GP is generated by the hardware
+which, will be delivered to the process as a segfault. When transitioning to
+userspace the register's state will be as if the userspace ip being returned to
+caused the segfault.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack's behave like mmap() with respect to
+ASLR behavior.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 02/41] x86/shstk: Add Kconfig option for shadow stack
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 03/41] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-shadow stack applications continue to work, but
without protection.

Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with shadow stacks, create CONFIG_CET to signify
that at least one CET feature is configured.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Remove capitalization of shadow stack (Boris)

v3:
 - Add X86_CET (Kees)
 - Add back WRUSS dependency (Kees)
 - Fix verbiage (Dave)
 - Change from promt to bool (Kirill)
 - Add more to commit log

v2:
 - Remove already wrong kernel size increase info (tlgx)
 - Change prompt to remove "Intel" (tglx)
 - Update line about what CPUs are supported (Dave)

Yu-cheng v25:
 - Remove X86_CET and use X86_SHADOW_STACK directly.
---
 arch/x86/Kconfig           | 24 ++++++++++++++++++++++++
 arch/x86/Kconfig.assembler |  5 +++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..f03791b73f9f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1851,6 +1851,11 @@ config CC_HAS_IBT
 		  (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
 		  $(as-instr,endbr64)
 
+config X86_CET
+	def_bool n
+	help
+	  CET features configured (Shadow stack or IBT)
+
 config X86_KERNEL_IBT
 	prompt "Indirect Branch Tracking"
 	def_bool y
@@ -1858,6 +1863,7 @@ config X86_KERNEL_IBT
 	# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
 	depends on !LD_IS_LLD || LLD_VERSION >= 140000
 	select OBJTOOL
+	select X86_CET
 	help
 	  Build the kernel with support for Indirect Branch Tracking, a
 	  hardware support course-grain forward-edge Control Flow Integrity
@@ -1952,6 +1958,24 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config X86_USER_SHADOW_STACK
+	bool "X86 userspace shadow stack"
+	depends on AS_WRUSS
+	depends on X86_64
+	select ARCH_USES_HIGH_VMA_FLAGS
+	select X86_CET
+	help
+	  Shadow stack protection is a hardware feature that detects function
+	  return address corruption.  This helps mitigate ROP attacks.
+	  Applications must be enabled to use it, and old userspace does not
+	  get protection "for free".
+
+	  CPUs supporting shadow stacks were first released in 2020.
+
+	  See Documentation/x86/shstk.rst for more information.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index b88f784cb02e..8ad41da301e5 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -24,3 +24,8 @@ config AS_GFNI
 	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
 	help
 	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_WRUSS
+	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+	help
+	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 03/41] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 02/41] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 04/41] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Drop "shstk" from cpuinfo (Boris)
 - Remove capitalization on shadow stack (Boris)

v3:
 - Add user specific shadow stack cpu cap (Andrew Cooper)
 - Drop reviewed-bys from Boris and Kees due to the above change.

v2:
 - Remove IBT reference in commit log (Kees)
 - Describe xsaves dependency using text from (Dave)

v1:
 - Remove IBT, can be added in a follow on IBT series.
---
 arch/x86/include/asm/cpufeatures.h       | 2 ++
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/kernel/cpu/cpuid-deps.c         | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 389ea336258f..d01afabcf03e 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -309,6 +309,7 @@
 #define X86_FEATURE_MSR_TSX_CTRL	(11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
 #define X86_FEATURE_SMBA		(11*32+21) /* "" Slow Memory Bandwidth Allocation */
 #define X86_FEATURE_BMEC		(11*32+22) /* "" Bandwidth Monitoring Event Configuration */
+#define X86_FEATURE_USER_SHSTK		(11*32+23) /* Shadow stack support for user mode applications */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
@@ -375,6 +376,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* "" Shadow stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..505f78ddca82 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK	0
+#else
+#define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -114,7 +120,7 @@
 #define DISABLED_MASK9	(DISABLE_SGX)
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
-			 DISABLE_CALL_DEPTH_TRACKING)
+			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	0
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index f6748c8bd647..e462c1d3800a 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 04/41] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (2 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 03/41] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 05/41] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Setting CR4.CET is a prerequisite for utilizing any CET features, most of
which also require setting MSRs.

Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
and is configured with kernel IBT. However, future patches that enable
userspace shadow stack support will need the bit set as well. So change
the logic to enable it in either case.

Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Remove #ifdeffery (Boris)

v4:
 - Add back dedicated command line disable: "nousershstk" (Boris)

v3:
 - Remove stay new line (Boris)
 - Simplify commit log (Andrew Cooper)

v2:
 - In the shadow stack case, go back to only setting CR4.CET if the
   kernel is compiled with user shadow stack support.
 - Clear MSR_IA32_U_CET as well. (PeterZ)
---
 arch/x86/kernel/cpu/common.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8cd4126d8253..cc686e5039be 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -600,27 +600,43 @@ __noendbr void ibt_restore(u64 save)
 
 static __always_inline void setup_cet(struct cpuinfo_x86 *c)
 {
-	u64 msr = CET_ENDBR_EN;
+	bool user_shstk, kernel_ibt;
 
-	if (!HAS_KERNEL_IBT ||
-	    !cpu_feature_enabled(X86_FEATURE_IBT))
+	if (!IS_ENABLED(CONFIG_X86_CET))
 		return;
 
-	wrmsrl(MSR_IA32_S_CET, msr);
+	kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
+	user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+		     IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK);
+
+	if (!kernel_ibt && !user_shstk)
+		return;
+
+	if (user_shstk)
+		set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
+
+	if (kernel_ibt)
+		wrmsrl(MSR_IA32_S_CET, CET_ENDBR_EN);
+	else
+		wrmsrl(MSR_IA32_S_CET, 0);
+
 	cr4_set_bits(X86_CR4_CET);
 
-	if (!ibt_selftest()) {
+	if (kernel_ibt && !ibt_selftest()) {
 		pr_err("IBT selftest: Failed!\n");
 		wrmsrl(MSR_IA32_S_CET, 0);
 		setup_clear_cpu_cap(X86_FEATURE_IBT);
-		return;
 	}
 }
 
 __noendbr void cet_disable(void)
 {
-	if (cpu_feature_enabled(X86_FEATURE_IBT))
-		wrmsrl(MSR_IA32_S_CET, 0);
+	if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
+	      cpu_feature_enabled(X86_FEATURE_SHSTK)))
+		return;
+
+	wrmsrl(MSR_IA32_S_CET, 0);
+	wrmsrl(MSR_IA32_U_CET, 0);
 }
 
 /*
@@ -1482,6 +1498,9 @@ static void __init cpu_parse_early_param(void)
 	if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
 		setup_clear_cpu_cap(X86_FEATURE_XSAVES);
 
+	if (cmdline_find_option_bool(boot_command_line, "nousershstk"))
+		setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK);
+
 	arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg));
 	if (arglen <= 0)
 		return;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 05/41] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (3 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 04/41] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 06/41] x86/fpu: Add helper for modifying xstate Rick Edgecombe
                   ` (35 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stack register state can be managed with XSAVE. The registers
can logically be separated into two groups:
        * Registers controlling user-mode operation
        * Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for each group
of those groups of registers. This lets an OS manage them separately if
it chooses. Future patches for host userspace and KVM guests will only
utilize the user-mode registers, so only configure XSAVE to save
user-mode registers. This state will add 16 bytes to the xsave buffer
size.

Future patches will use the user-mode XSAVE area to save guest user-mode
CET state. However, VMCS includes new fields for guest CET supervisor
states. KVM can use these to save and restore guest supervisor state, so
host supervisor XSAVE support is not required.

Adding this exacerbates the already unwieldy if statement in
check_xstate_against_struct() that handles warning about un-implemented
xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
it actually check's the xfeature. This ends up exceeding 80 chars, but was
better on balance than other options explored. Pass the bool as pointer to
make it clear that XCHECK_SZ() can change the variable.

While configuring user-mode XSAVE, clarify kernel-mode registers are not
managed by XSAVE by defining the xfeature in
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
This serves more of a documentation as code purpose, and functionally,
only enables a few safety checks.

Both XSAVE state components are supervisor states, even the state
controlling user-mode operation. This is a departure from earlier features
like protection keys where the PKRU state is a normal user
(non-supervisor) state. Having the user state be supervisor-managed
ensures there is no direct, unprivileged access to it, making it harder
for an attacker to subvert CET.

To facilitate this privileged access, define the two user-mode CET MSRs,
and the bits defined in those MSRs relevant to future shadow stack
enablement patches.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Move comments from end of lines in cet_user_state struct (Boris)

v3:
 - Add missing "is" in commit log (Boris)
 - Change to case statement for struct size checking (Boris)
 - Adjust commas on xfeature_names (Kees, Boris)

v2:
 - Change name to XFEATURE_CET_KERNEL_UNUSED (peterz)

KVM refresh:
 - Reword commit log using some verbiage posted by Dave Hansen
 - Remove unlikely to be used supervisor cet xsave struct
 - Clarify that supervisor cet state is not saved by xsave
 - Remove unused supervisor MSRs
---
 arch/x86/include/asm/fpu/types.h  | 16 +++++-
 arch/x86/include/asm/fpu/xstate.h |  6 ++-
 arch/x86/kernel/fpu/xstate.c      | 90 +++++++++++++++----------------
 3 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 7f6d858ff47a..eb810074f1e7 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
 	XFEATURE_PASID,
-	XFEATURE_RSRVD_COMP_11,
-	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL_UNUSED,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL_UNUSED)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 #define XFEATURE_MASK_XTILE_CFG		(1 << XFEATURE_XTILE_CFG)
 #define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,16 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	/* user control-flow settings */
+	u64 user_cet;
+	/* user shadow stack pointer */
+	u64 user_ssp;
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
 #define XFEATURE_MASK_USER_DYNAMIC	XFEATURE_MASK_XTILE_DATA
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+					    XFEATURE_MASK_CET_USER)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+					      XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 714166cc25f2..13a80521dd51 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,26 +39,26 @@
  */
 static const char *xfeature_names[] =
 {
-	"x87 floating point registers"	,
-	"SSE registers"			,
-	"AVX registers"			,
-	"MPX bounds registers"		,
-	"MPX CSR"			,
-	"AVX-512 opmask"		,
-	"AVX-512 Hi256"			,
-	"AVX-512 ZMM_Hi256"		,
-	"Processor Trace (unused)"	,
+	"x87 floating point registers",
+	"SSE registers",
+	"AVX registers",
+	"MPX bounds registers",
+	"MPX CSR",
+	"AVX-512 opmask",
+	"AVX-512 Hi256",
+	"AVX-512 ZMM_Hi256",
+	"Processor Trace (unused)",
 	"Protection Keys User registers",
 	"PASID state",
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"AMX Tile config"		,
-	"AMX Tile data"			,
-	"unknown xstate feature"	,
+	"Control-flow User registers",
+	"Control-flow Kernel registers (unused)",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"AMX Tile config",
+	"AMX Tile data",
+	"unknown xstate feature",
 };
 
 static unsigned short xsave_cpuid_features[] __initdata = {
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_PKU,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
+	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
 	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
 	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
@@ -276,6 +277,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
 	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
 	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
@@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
 	 XFEATURE_MASK_PASID |			\
+	 XFEATURE_MASK_CET_USER |		\
 	 XFEATURE_MASK_XTILE)
 
 /*
@@ -446,14 +449,15 @@ static void __init __xstate_dump_leaves(void)
 	}									\
 } while (0)
 
-#define XCHECK_SZ(sz, nr, nr_macro, __struct) do {			\
-	if ((nr == nr_macro) &&						\
-	    WARN_ONCE(sz != sizeof(__struct),				\
-		"%s: struct is %zu bytes, cpu state %d bytes\n",	\
-		__stringify(nr_macro), sizeof(__struct), sz)) {		\
+#define XCHECK_SZ(sz, nr, __struct) ({					\
+	if (WARN_ONCE(sz != sizeof(__struct),				\
+	    "[%s]: struct is %zu bytes, cpu state %d bytes\n",		\
+	    xfeature_names[nr], sizeof(__struct), sz)) {		\
 		__xstate_dump_leaves();					\
 	}								\
-} while (0)
+	true;								\
+})
+
 
 /**
  * check_xtile_data_against_struct - Check tile data state size.
@@ -527,36 +531,28 @@ static bool __init check_xstate_against_struct(int nr)
 	 * Ask the CPU for the size of the state.
 	 */
 	int sz = xfeature_size(nr);
+
 	/*
 	 * Match each CPU state with the corresponding software
 	 * structure.
 	 */
-	XCHECK_SZ(sz, nr, XFEATURE_YMM,       struct ymmh_struct);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDREGS,   struct mpx_bndreg_state);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDCSR,    struct mpx_bndcsr_state);
-	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
-	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
-	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
-	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
-
-	/* The tile data size varies between implementations. */
-	if (nr == XFEATURE_XTILE_DATA)
-		check_xtile_data_against_struct(sz);
-
-	/*
-	 * Make *SURE* to add any feature numbers in below if
-	 * there are "holes" in the xsave state component
-	 * numbers.
-	 */
-	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX) ||
-	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+	switch (nr) {
+	case XFEATURE_YMM:	  return XCHECK_SZ(sz, nr, struct ymmh_struct);
+	case XFEATURE_BNDREGS:	  return XCHECK_SZ(sz, nr, struct mpx_bndreg_state);
+	case XFEATURE_BNDCSR:	  return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state);
+	case XFEATURE_OPMASK:	  return XCHECK_SZ(sz, nr, struct avx_512_opmask_state);
+	case XFEATURE_ZMM_Hi256:  return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state);
+	case XFEATURE_Hi16_ZMM:	  return XCHECK_SZ(sz, nr, struct avx_512_hi16_state);
+	case XFEATURE_PKRU:	  return XCHECK_SZ(sz, nr, struct pkru_state);
+	case XFEATURE_PASID:	  return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
+	case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
+	case XFEATURE_CET_USER:	  return XCHECK_SZ(sz, nr, struct cet_user_state);
+	case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
+	default:
 		XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
 		return false;
 	}
+
 	return true;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 06/41] x86/fpu: Add helper for modifying xstate
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (4 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 05/41] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 07/41] x86: Move control protection handler to separate file Rick Edgecombe
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.

When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.

The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.

Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Drop "but appear to work" (Boris)

v5:
 - Fix spelling error (Boris)
 - Don't export fpregs_lock_and_load() (Boris)

v3:
 - Rename to fpregs_lock_and_load() to match the unlocking
   fpregs_unlock(). (Kees)
 - Elaborate in comment about helper. (Dave)

v2:
 - Drop optimization of writing directly the buffer, and change API
   accordingly.
 - fpregs_lock_and_load() suggested by tglx
 - Some commit log verbiage from dhansen
---
 arch/x86/include/asm/fpu/api.h |  9 +++++++++
 arch/x86/kernel/fpu/core.c     | 18 ++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..aadc6893dcaa 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,15 @@ static inline void fpregs_unlock(void)
 		preempt_enable();
 }
 
+/*
+ * FPU state gets lazily restored before returning to userspace. So when in the
+ * kernel, the valid FPU state may be kept in the buffer. This function will force
+ * restore all the fpu state to the registers early if needed, and lock them from
+ * being automatically saved/restored. Then FPU state can be modified safely in the
+ * registers, before unlocking with fpregs_unlock().
+ */
+void fpregs_lock_and_load(void);
+
 #ifdef CONFIG_X86_DEBUG_FPU
 extern void fpregs_assert_state_consistent(void);
 #else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..f851558b673f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -753,6 +753,24 @@ void switch_fpu_return(void)
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
 
+void fpregs_lock_and_load(void)
+{
+	/*
+	 * fpregs_lock() only disables preemption (mostly). So modifying state
+	 * in an interrupt could screw up some in progress fpregs operation.
+	 * Warn about it.
+	 */
+	WARN_ON_ONCE(!irq_fpu_usable());
+	WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+	fpregs_lock();
+
+	fpregs_assert_state_consistent();
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		fpregs_restore_userregs();
+}
+
 #ifdef CONFIG_X86_DEBUG_FPU
 /*
  * If current FPU state according to its tracking (loaded FPU context on this
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 07/41] x86: Move control protection handler to separate file
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (5 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 06/41] x86/fpu: Add helper for modifying xstate Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-01 15:38   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler Rick Edgecombe
                   ` (33 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

Today the control protection handler is defined in traps.c and used only
for the kernel IBT feature. To reduce ifdeffery, move it to it's own file.
In future patches, functionality will be added to make this handler also
handle user shadow stack faults. So name the file cet.c.

No functional change.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Split move to cet.c and shadow stack enhancements to fault handler to
   separate files. (Kees)
---
 arch/x86/kernel/Makefile |  2 ++
 arch/x86/kernel/cet.c    | 76 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c  | 75 ---------------------------------------
 3 files changed, 78 insertions(+), 75 deletions(-)
 create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..92446f1dedd7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -144,6 +144,8 @@ obj-$(CONFIG_CFI_CLANG)			+= cfi.o
 
 obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
+obj-$(CONFIG_X86_CET)			+= cet.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..7ad22b705b64
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <asm/bugs.h>
+#include <asm/traps.h>
+
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
+
+enum cp_error_code {
+	CP_EC        = (1 << 15) - 1,
+
+	CP_RET       = 1,
+	CP_IRET      = 2,
+	CP_ENDBR     = 3,
+	CP_RSTRORSSP = 4,
+	CP_SETSSBSY  = 5,
+
+	CP_ENCL	     = 1 << 15,
+};
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
+		pr_err("Unexpected #CP\n");
+		BUG();
+	}
+
+	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+		return;
+
+	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
+		regs->ax = 0;
+		return;
+	}
+
+	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
+	if (!ibt_fatal) {
+		printk(KERN_DEFAULT CUT_HERE);
+		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+		return;
+	}
+	BUG();
+}
+
+/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
+noinline bool ibt_selftest(void)
+{
+	unsigned long ret;
+
+	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
+	     ANNOTATE_RETPOLINE_SAFE
+	     "	jmp *%%rax\n\t"
+	     "ibt_selftest_ip:\n\t"
+	     UNWIND_HINT_FUNC
+	     ANNOTATE_NOENDBR
+	     "	nop\n\t"
+
+	     : "=a" (ret) : : "memory");
+
+	return !ret;
+}
+
+static int __init ibt_setup(char *str)
+{
+	if (!strcmp(str, "off"))
+		setup_clear_cpu_cap(X86_FEATURE_IBT);
+
+	if (!strcmp(str, "warn"))
+		ibt_fatal = false;
+
+	return 1;
+}
+
+__setup("ibt=", ibt_setup);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..cc223e60aba2 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -213,81 +213,6 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
-enum cp_error_code {
-	CP_EC        = (1 << 15) - 1,
-
-	CP_RET       = 1,
-	CP_IRET      = 2,
-	CP_ENDBR     = 3,
-	CP_RSTRORSSP = 4,
-	CP_SETSSBSY  = 5,
-
-	CP_ENCL	     = 1 << 15,
-};
-
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
-{
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
-	}
-
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
-		return;
-
-	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
-		regs->ax = 0;
-		return;
-	}
-
-	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
-	if (!ibt_fatal) {
-		printk(KERN_DEFAULT CUT_HERE);
-		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
-		return;
-	}
-	BUG();
-}
-
-/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
-noinline bool ibt_selftest(void)
-{
-	unsigned long ret;
-
-	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
-	     ANNOTATE_RETPOLINE_SAFE
-	     "	jmp *%%rax\n\t"
-	     "ibt_selftest_ip:\n\t"
-	     UNWIND_HINT_FUNC
-	     ANNOTATE_NOENDBR
-	     "	nop\n\t"
-
-	     : "=a" (ret) : : "memory");
-
-	return !ret;
-}
-
-static int __init ibt_setup(char *str)
-{
-	if (!strcmp(str, "off"))
-		setup_clear_cpu_cap(X86_FEATURE_IBT);
-
-	if (!strcmp(str, "warn"))
-		ibt_fatal = false;
-
-	return 1;
-}
-
-__setup("ibt=", ibt_setup);
-
-#endif /* CONFIG_X86_KERNEL_IBT */
-
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (6 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 07/41] x86: Move control protection handler to separate file Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-01 18:06   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 09/41] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
                   ` (32 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>

---
v7:
 - Adjust alignment of WARN statement

v6:
 - Split into separate patches (Kees)
 - Change to "x86/shstk" in commit log (Boris)

v5:
 - Move to separate file to avoid ifdeffery (Boris)
 - Improvements to commit log (Boris)
 - Rename control_protection_err (Boris)
 - Move comment from end of line in IBT fault handler (Boris)

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.
---
 arch/arm/kernel/signal.c                 |  2 +-
 arch/arm64/kernel/signal.c               |  2 +-
 arch/arm64/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal_64.c            |  2 +-
 arch/x86/include/asm/disabled-features.h |  8 +-
 arch/x86/include/asm/idtentry.h          |  2 +-
 arch/x86/include/asm/traps.h             | 12 +++
 arch/x86/kernel/cet.c                    | 94 +++++++++++++++++++++---
 arch/x86/kernel/idt.c                    |  2 +-
 arch/x86/kernel/signal_32.c              |  2 +-
 arch/x86/kernel/signal_64.c              |  2 +-
 arch/x86/kernel/traps.c                  | 12 ---
 arch/x86/xen/enlighten_pv.c              |  2 +-
 arch/x86/xen/xen-asm.S                   |  2 +-
 include/uapi/asm-generic/siginfo.h       |  3 +-
 16 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 06a02707f488..19b6b292892c 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1341,7 +1341,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 505f78ddca82..652e366b68a0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -128,7 +134,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK20	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_disable();
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 7ad22b705b64..cc10d8be9d74 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -4,10 +4,6 @@
 #include <asm/bugs.h>
 #include <asm/traps.h>
 
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -20,15 +16,80 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char cp_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(cp_err))
+		cpec = 0;
+	return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		  user_mode(regs) ? "user mode" : "kernel mode",
+		  cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
 		return;
+	}
 
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
@@ -74,3 +135,18 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cc223e60aba2..18fb9d620824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@
 
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_disable();
-}
-
 __always_inline int is_valid_bugaddr(unsigned long addr)
 {
 	if (addr < TASK_SIZE_MAX)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index bb59cc6ddb2d..9c29cd5393cc 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 09/41] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (7 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 10/41] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu, Christoph Hellwig

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when the memory gets written to. This
creates additional work for the CPU. So traditional wisdom was to simply
set the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read-only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>

---
v6:
 - Also remove dirty from newly added set_memory_rox()

v5:
 - Spelling and grammar in commit log (Boris)

v3:
 - Update commit log (Andrew Cooper, Peterz)

v2:
 - Normalize PTE bit descriptions between patches
---
 arch/x86/include/asm/pgtable_types.h | 6 +++---
 arch/x86/mm/pat/set_memory.c         | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 447d4bee25c4..0646ad00178b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@ enum page_cache_mode {
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 356758b7d4b4..1b5c0dc9f32b 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2073,12 +2073,12 @@ int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
 }
 
 int set_memory_rox(unsigned long addr, int numpages)
 {
-	pgprot_t clr = __pgprot(_PAGE_RW);
+	pgprot_t clr = __pgprot(_PAGE_RW | _PAGE_DIRTY);
 
 	if (__supported_pte_mask & _PAGE_NX)
 		clr.pgprot |= _PAGE_NX;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 10/41] x86/mm: Move pmd_write(), pud_write() up in the file
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (8 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 09/41] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 11/41] mm: Introduce pte_mkwrite_kernel() Rick Edgecombe
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

To prepare the introduction of _PAGE_SAVED_DIRTY, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below.  No functional changes.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7425f32e5293..56eea96502c6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -160,6 +160,18 @@ static inline int pte_write(pte_t pte)
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_RW;
+}
+
 static inline int pte_huge(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_PSE;
@@ -1120,12 +1132,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pmd_t *pmdp)
@@ -1155,12 +1161,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
-	return pud_flags(pud) & _PAGE_RW;
-}
-
 #ifndef pmdp_establish
 #define pmdp_establish pmdp_establish
 static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 11/41] mm: Introduce pte_mkwrite_kernel()
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (9 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 10/41] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 12/41] s390/mm: Introduce pmd_mkwrite_kernel() Rick Edgecombe
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, linux-arm-kernel, linux-s390, xen-devel

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One of these changes is to allow for pte_mkwrite() to create different
types of writable memory (the existing conventionally writable type and
also the new shadow stack type). Future patches will convert pte_mkwrite()
to take a VMA in order to facilitate this, however there are places in the
kernel where pte_mkwrite() is called outside of the context of a VMA.
These are for kernel memory. So create a new variant called
pte_mkwrite_kernel() and switch the kernel users over to it. Have
pte_mkwrite() and pte_mkwrite_kernel() be the same for now. Future patches
will introduce changes to make pte_mkwrite() take a VMA.

Only do this for architectures that need it because they call pte_mkwrite()
in arch code without an associated VMA. Since it will only currently be
used in arch code, so do not include it in arch_pgtable_helpers.rst.

Cc: linux-doc@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push shadow stack memory details inside arch/x86.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/#t

v6:
 - New patch
---
 arch/arm64/include/asm/pgtable.h | 7 ++++++-
 arch/arm64/mm/trans_pgd.c        | 4 ++--
 arch/s390/include/asm/pgtable.h  | 7 ++++++-
 arch/s390/mm/pageattr.c          | 2 +-
 arch/x86/include/asm/pgtable.h   | 7 ++++++-
 arch/x86/xen/mmu_pv.c            | 2 +-
 6 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b6ba466e2e8a..cccf8885792e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -180,13 +180,18 @@ static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
 	return pmd;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_kernel(pte_t pte)
 {
 	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
 	pte = clear_pte_bit(pte, __pgprot(PTE_RDONLY));
 	return pte;
 }
 
+static inline pte_t pte_mkwrite(pte_t pte)
+{
+	return pte_mkwrite_kernel(pte);
+}
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	pte = clear_pte_bit(pte, __pgprot(PTE_DIRTY));
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 4ea2eefbc053..5c07e68d80ea 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -40,7 +40,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite(pte));
+		set_pte(dst_ptep, pte_mkwrite_kernel(pte));
 	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
 		/*
 		 * debug_pagealloc will removed the PTE_VALID bit if
@@ -53,7 +53,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
+		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_kernel(pte)));
 	}
 }
 
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2c70b4d1263d..d4943f2d3f00 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1005,7 +1005,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 	return set_pte_bit(pte, __pgprot(_PAGE_PROTECT));
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_kernel(pte_t pte)
 {
 	pte = set_pte_bit(pte, __pgprot(_PAGE_WRITE));
 	if (pte_val(pte) & _PAGE_DIRTY)
@@ -1013,6 +1013,11 @@ static inline pte_t pte_mkwrite(pte_t pte)
 	return pte;
 }
 
+static inline pte_t pte_mkwrite(pte_t pte)
+{
+	return pte_mkwrite_kernel(pte);
+}
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	pte = clear_pte_bit(pte, __pgprot(_PAGE_DIRTY));
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 85195c18b2e8..4ee5fe5caa23 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -96,7 +96,7 @@ static int walk_pte_level(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		if (flags & SET_MEMORY_RO)
 			new = pte_wrprotect(new);
 		else if (flags & SET_MEMORY_RW)
-			new = pte_mkwrite(pte_mkdirty(new));
+			new = pte_mkwrite_kernel(pte_mkdirty(new));
 		if (flags & SET_MEMORY_NX)
 			new = set_pte_bit(new, __pgprot(_PAGE_NOEXEC));
 		else if (flags & SET_MEMORY_X)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 56eea96502c6..3607f2572f9e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -364,11 +364,16 @@ static inline pte_t pte_mkyoung(pte_t pte)
 	return pte_set_flags(pte, _PAGE_ACCESSED);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_kernel(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
+static inline pte_t pte_mkwrite(pte_t pte)
+{
+	return pte_mkwrite_kernel(pte);
+}
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_PSE);
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index ee29fb558f2e..a23f04243c19 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -150,7 +150,7 @@ void make_lowmem_page_readwrite(void *vaddr)
 	if (pte == NULL)
 		return;		/* vaddr missing */
 
-	ptev = pte_mkwrite(*pte);
+	ptev = pte_mkwrite_kernel(*pte);
 
 	if (HYPERVISOR_update_va_mapping(address, ptev, 0))
 		BUG();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 12/41] s390/mm: Introduce pmd_mkwrite_kernel()
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (10 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 11/41] mm: Introduce pte_mkwrite_kernel() Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, linux-s390

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One of these changes is to allow for pmd_mkwrite() to create different
types of writable memory (the existing conventionally writable type and
also the new shadow stack type). Future patches will convert pmd_mkwrite()
to take a VMA in order to facilitate this, however there are places in the
kernel where pmd_mkwrite() is called outside of the context of a VMA.
These are for kernel memory. So create a new variant called
pmd_mkwrite_kernel() and switch the kernel users over to it. Have
pmd_mkwrite() and pmd_mkwrite_kernel() be the same for now. Future patches
will introduce changes to make pmd_mkwrite() take a VMA.

Only do this for architectures that need it because they call pmd_mkwrite()
in arch code without an associated VMA. Since it will only currently be
used in arch code, so do not include it in arch_pgtable_helpers.rst.

Cc: linux-kernel@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push shadow stack memory details inside arch/x86.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/#t

v6:
 - New patch
---
 arch/s390/include/asm/pgtable.h | 7 ++++++-
 arch/s390/mm/pageattr.c         | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index d4943f2d3f00..deeb918cae1d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1491,7 +1491,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 	return set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_PROTECT));
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_kernel(pmd_t pmd)
 {
 	pmd = set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_WRITE));
 	if (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY)
@@ -1499,6 +1499,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
 	return pmd;
 }
 
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_mkwrite_kernel(pmd);
+}
+
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
 	pmd = clear_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_DIRTY));
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 4ee5fe5caa23..7b6967dfacd0 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -146,7 +146,7 @@ static void modify_pmd_page(pmd_t *pmdp, unsigned long addr,
 	if (flags & SET_MEMORY_RO)
 		new = pmd_wrprotect(new);
 	else if (flags & SET_MEMORY_RW)
-		new = pmd_mkwrite(pmd_mkdirty(new));
+		new = pmd_mkwrite_kernel(pmd_mkdirty(new));
 	if (flags & SET_MEMORY_NX)
 		new = set_pmd_bit(new, __pgprot(_SEGMENT_ENTRY_NOEXEC));
 	else if (flags & SET_MEMORY_X)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (11 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 12/41] s390/mm: Introduce pmd_mkwrite_kernel() Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-01  7:03   ` Christophe Leroy
  2023-03-02 12:19   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
                   ` (27 subsequent siblings)
  40 siblings, 2 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, linux-ia64, loongarch, linux-m68k,
	Michal Simek, Dinh Nguyen, linux-mips, linux-openrisc,
	linux-parisc, linuxppc-dev, linux-riscv, linux-s390, linux-sh,
	sparclinux, linux-um, xen-devel

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite().

In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
that they are special shadow stack flavor of writable memory. So make
pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
create regular writable memory or shadow stack memory.

Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().

No functional change.

Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-csky@vger.kernel.org
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linux-m68k@lists.linux-m68k.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-openrisc@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push shadow stack memory details inside arch/x86.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/#t

v6:
 - New patch
---
 Documentation/mm/arch_pgtable_helpers.rst    |  9 ++++++---
 arch/alpha/include/asm/pgtable.h             |  6 +++++-
 arch/arc/include/asm/hugepage.h              |  2 +-
 arch/arc/include/asm/pgtable-bits-arcv2.h    |  7 ++++++-
 arch/arm/include/asm/pgtable-3level.h        |  7 ++++++-
 arch/arm/include/asm/pgtable.h               |  2 +-
 arch/arm64/include/asm/pgtable.h             |  4 ++--
 arch/csky/include/asm/pgtable.h              |  2 +-
 arch/hexagon/include/asm/pgtable.h           |  2 +-
 arch/ia64/include/asm/pgtable.h              |  2 +-
 arch/loongarch/include/asm/pgtable.h         |  4 ++--
 arch/m68k/include/asm/mcf_pgtable.h          |  2 +-
 arch/m68k/include/asm/motorola_pgtable.h     |  6 +++++-
 arch/m68k/include/asm/sun3_pgtable.h         |  6 +++++-
 arch/microblaze/include/asm/pgtable.h        |  2 +-
 arch/mips/include/asm/pgtable.h              |  6 +++---
 arch/nios2/include/asm/pgtable.h             |  2 +-
 arch/openrisc/include/asm/pgtable.h          |  2 +-
 arch/parisc/include/asm/pgtable.h            |  6 +++++-
 arch/powerpc/include/asm/book3s/32/pgtable.h |  2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++--
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |  2 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h |  2 +-
 arch/riscv/include/asm/pgtable.h             |  6 +++---
 arch/s390/include/asm/hugetlb.h              |  4 ++--
 arch/s390/include/asm/pgtable.h              |  4 ++--
 arch/sh/include/asm/pgtable_32.h             | 10 ++++++++--
 arch/sparc/include/asm/pgtable_32.h          |  2 +-
 arch/sparc/include/asm/pgtable_64.h          |  6 +++---
 arch/um/include/asm/pgtable.h                |  2 +-
 arch/x86/include/asm/pgtable.h               |  6 ++++--
 arch/xtensa/include/asm/pgtable.h            |  2 +-
 include/asm-generic/hugetlb.h                |  4 ++--
 include/linux/mm.h                           |  2 +-
 mm/debug_vm_pgtable.c                        | 16 ++++++++--------
 mm/huge_memory.c                             |  6 +++---
 mm/hugetlb.c                                 |  4 ++--
 mm/memory.c                                  |  4 ++--
 mm/migrate_device.c                          |  2 +-
 mm/mprotect.c                                |  2 +-
 mm/userfaultfd.c                             |  2 +-
 42 files changed, 106 insertions(+), 69 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index 30d9a09f01f4..78ac3ff2fe1d 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -46,7 +46,8 @@ PTE Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pte_mkclean               | Creates a clean PTE                              |
 +---------------------------+--------------------------------------------------+
-| pte_mkwrite               | Creates a writable PTE                           |
+| pte_mkwrite               | Creates a writable PTE of the type specified by  |
+|                           | the VMA.                                         |
 +---------------------------+--------------------------------------------------+
 | pte_wrprotect             | Creates a write protected PTE                    |
 +---------------------------+--------------------------------------------------+
@@ -118,7 +119,8 @@ PMD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pmd_mkclean               | Creates a clean PMD                              |
 +---------------------------+--------------------------------------------------+
-| pmd_mkwrite               | Creates a writable PMD                           |
+| pmd_mkwrite               | Creates a writable PMD of the type specified by  |
+|                           | the VMA.                                         |
 +---------------------------+--------------------------------------------------+
 | pmd_wrprotect             | Creates a write protected PMD                    |
 +---------------------------+--------------------------------------------------+
@@ -222,7 +224,8 @@ HugeTLB Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | huge_pte_mkdirty          | Creates a dirty HugeTLB                          |
 +---------------------------+--------------------------------------------------+
-| huge_pte_mkwrite          | Creates a writable HugeTLB                       |
+| huge_pte_mkwrite          | Creates a writable HugeTLB of the type specified |
+|                           | by the VMA.                                      |
 +---------------------------+--------------------------------------------------+
 | huge_pte_wrprotect        | Creates a write protected HugeTLB                |
 +---------------------------+--------------------------------------------------+
diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index ba43cb841d19..fb5d207c2a89 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -256,9 +256,13 @@ extern inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED;
 extern inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) |= _PAGE_FOW; return pte; }
 extern inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~(__DIRTY_BITS); return pte; }
 extern inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~(__ACCESS_BITS); return pte; }
-extern inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) &= ~_PAGE_FOW; return pte; }
 extern inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= __DIRTY_BITS; return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= __ACCESS_BITS; return pte; }
+extern inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	pte_val(pte) &= ~_PAGE_FOW;
+	return pte;
+}
 
 /*
  * The smp_rmb() in the following functions are required to order the load of
diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 5001b796fb8d..223a96967188 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -21,7 +21,7 @@ static inline pmd_t pte_pmd(pte_t pte)
 }
 
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd, vma)	pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/asm/pgtable-bits-arcv2.h
index 6e9f8ca6d6a1..a5b8bc955015 100644
--- a/arch/arc/include/asm/pgtable-bits-arcv2.h
+++ b/arch/arc/include/asm/pgtable-bits-arcv2.h
@@ -87,7 +87,6 @@
 
 PTE_BIT_FUNC(mknotpresent,     &= ~(_PAGE_PRESENT));
 PTE_BIT_FUNC(wrprotect,	&= ~(_PAGE_WRITE));
-PTE_BIT_FUNC(mkwrite,	|= (_PAGE_WRITE));
 PTE_BIT_FUNC(mkclean,	&= ~(_PAGE_DIRTY));
 PTE_BIT_FUNC(mkdirty,	|= (_PAGE_DIRTY));
 PTE_BIT_FUNC(mkold,	&= ~(_PAGE_ACCESSED));
@@ -95,6 +94,12 @@ PTE_BIT_FUNC(mkyoung,	|= (_PAGE_ACCESSED));
 PTE_BIT_FUNC(mkspecial,	|= (_PAGE_SPECIAL));
 PTE_BIT_FUNC(mkhuge,	|= (_PAGE_HW_SZ));
 
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	pte_val(pte) |= (_PAGE_WRITE);
+	return pte;
+}
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 106049791500..df071a807610 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -202,11 +202,16 @@ static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
 
 PMD_BIT_FUNC(wrprotect,	|= L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
-PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkclean,   &= ~L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	pmd_val(pmd) |= L_PMD_SECT_RDONLY;
+	return pmd;
+}
+
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
 
 #define pmd_pfn(pmd)		(((pmd_val(pmd) & PMD_MASK) & PHYS_MASK) >> PAGE_SHIFT)
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a58ccbb406ad..39ad1ae1308d 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -227,7 +227,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 	return set_pte_bit(pte, __pgprot(L_PTE_RDONLY));
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return clear_pte_bit(pte, __pgprot(L_PTE_RDONLY));
 }
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cccf8885792e..913bf370f74a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -187,7 +187,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return pte_mkwrite_kernel(pte);
 }
@@ -492,7 +492,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #define pmd_cont(pmd)		pte_cont(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd, vma)	pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index d4042495febc..c2f92c991e37 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -176,7 +176,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	if (pte_val(pte) & _PAGE_MODIFIED)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index 59393613d086..14ab9c789c0e 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -300,7 +300,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 }
 
 /* pte_mkwrite - mark page as writable */
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	return pte;
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 21c97e31a28a..f879dd626da6 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -268,7 +268,7 @@ ia64_phys_addr_valid (unsigned long addr)
  * access rights:
  */
 #define pte_wrprotect(pte)	(__pte(pte_val(pte) & ~_PAGE_AR_RW))
-#define pte_mkwrite(pte)	(__pte(pte_val(pte) | _PAGE_AR_RW))
+#define pte_mkwrite(pte, vma)	(__pte(pte_val(pte) | _PAGE_AR_RW))
 #define pte_mkold(pte)		(__pte(pte_val(pte) & ~_PAGE_A))
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index d28fb9dbec59..ebf645f40298 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -390,7 +390,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	if (pte_val(pte) & _PAGE_MODIFIED)
@@ -490,7 +490,7 @@ static inline int pmd_write(pmd_t pmd)
 	return !!(pmd_val(pmd) & _PAGE_WRITE);
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	pmd_val(pmd) |= _PAGE_WRITE;
 	if (pmd_val(pmd) & _PAGE_MODIFIED)
diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h
index 13741c1245e1..37d77e055016 100644
--- a/arch/m68k/include/asm/mcf_pgtable.h
+++ b/arch/m68k/include/asm/mcf_pgtable.h
@@ -211,7 +211,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= CF_PAGE_WRITABLE;
 	return pte;
diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h
index ec0dc19ab834..c4e8eb76286d 100644
--- a/arch/m68k/include/asm/motorola_pgtable.h
+++ b/arch/m68k/include/asm/motorola_pgtable.h
@@ -155,7 +155,6 @@ static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED;
 static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) |= _PAGE_RONLY; return pte; }
 static inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~_PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) &= ~_PAGE_RONLY; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mknocache(pte_t pte)
@@ -168,6 +167,11 @@ static inline pte_t pte_mkcache(pte_t pte)
 	pte_val(pte) = (pte_val(pte) & _CACHEMASK040) | m68k_supervisor_cachemode;
 	return pte;
 }
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	pte_val(pte) &= ~_PAGE_RONLY;
+	return pte;
+}
 
 #define swapper_pg_dir kernel_pg_dir
 extern pgd_t kernel_pg_dir[128];
diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h
index e582b0484a55..2a06bea51a1e 100644
--- a/arch/m68k/include/asm/sun3_pgtable.h
+++ b/arch/m68k/include/asm/sun3_pgtable.h
@@ -143,10 +143,14 @@ static inline int pte_young(pte_t pte)		{ return pte_val(pte) & SUN3_PAGE_ACCESS
 static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; }
 static inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; }
 static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_MODIFIED; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mknocache(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_NOCACHE; return pte; }
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	pte_val(pte) |= SUN3_PAGE_WRITEABLE;
+	return pte;
+}
 // use this version when caches work...
 //static inline pte_t pte_mkcache(pte_t pte)	{ pte_val(pte) &= SUN3_PAGE_NOCACHE; return pte; }
 // until then, use:
diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h
index d1b8272abcd9..5b83e82f8d7e 100644
--- a/arch/microblaze/include/asm/pgtable.h
+++ b/arch/microblaze/include/asm/pgtable.h
@@ -266,7 +266,7 @@ static inline pte_t pte_mkread(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_USER; return pte; }
 static inline pte_t pte_mkexec(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_USER | _PAGE_EXEC; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte) \
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) \
 	{ pte_val(pte) |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 791389bf3c12..06efd567144a 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -309,7 +309,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte.pte_low |= _PAGE_WRITE;
 	if (pte.pte_low & _PAGE_MODIFIED) {
@@ -364,7 +364,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	if (pte_val(pte) & _PAGE_MODIFIED)
@@ -626,7 +626,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 	return pmd;
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	pmd_val(pmd) |= _PAGE_WRITE;
 	if (pmd_val(pmd) & _PAGE_MODIFIED)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 0f5c2564e9f5..edd458518e0e 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -129,7 +129,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	return pte;
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 3eb9b9555d0d..fd40aec189d1 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -250,7 +250,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	return pte;
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index e2950f5db7c9..89f62137e67f 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -331,8 +331,12 @@ static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~_PAGE_ACCESSED; retu
 static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) &= ~_PAGE_WRITE; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) |= _PAGE_WRITE; return pte; }
 static inline pte_t pte_mkspecial(pte_t pte)	{ pte_val(pte) |= _PAGE_SPECIAL; return pte; }
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	pte_val(pte) |= _PAGE_WRITE;
+	return pte;
+}
 
 /*
  * Huge pte definitions.
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 7bf1fe7297c6..10d9a1d2aca9 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -498,7 +498,7 @@ static inline pte_t pte_mkpte(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return __pte(pte_val(pte) | _PAGE_RW);
 }
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..be0636522d36 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -600,7 +600,7 @@ static inline pte_t pte_mkexec(pte_t pte)
 	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC));
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	/*
 	 * write implies read, hence set both
@@ -1071,7 +1071,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd, vma)	pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index fec56d965f00..7bfbcb9ba55b 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -171,7 +171,7 @@ void unmap_kernel_page(unsigned long va);
 	do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0)
 
 #ifndef pte_mkwrite
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return __pte(pte_val(pte) | _PAGE_RW);
 }
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 1a89ebdc3acc..f32450eb270a 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -101,7 +101,7 @@ static inline int pte_write(pte_t pte)
 
 #define pte_write pte_write
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return __pte(pte_val(pte) & ~_PAGE_RO);
 }
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 287e25864ffa..589009555877 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -85,7 +85,7 @@
 #ifndef __ASSEMBLY__
 /* pte_clear moved to later in this file */
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return __pte(pte_val(pte) | _PAGE_RW);
 }
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index d8d8de0ded99..fed1b81fbe07 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -338,7 +338,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 /* static inline pte_t pte_mkread(pte_t pte) */
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return __pte(pte_val(pte) | _PAGE_WRITE);
 }
@@ -624,9 +624,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
-	return pte_pmd(pte_mkwrite(pmd_pte(pmd)));
+	return pte_pmd(pte_mkwrite(pmd_pte(pmd), vma));
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index ccdbccfde148..558f7eef9c4d 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -102,9 +102,9 @@ static inline int huge_pte_dirty(pte_t pte)
 	return pte_dirty(pte);
 }
 
-static inline pte_t huge_pte_mkwrite(pte_t pte)
+static inline pte_t huge_pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
-	return pte_mkwrite(pte);
+	return pte_mkwrite(pte, vma);
 }
 
 static inline pte_t huge_pte_mkdirty(pte_t pte)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index deeb918cae1d..8f2c743da0eb 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1013,7 +1013,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return pte_mkwrite_kernel(pte);
 }
@@ -1499,7 +1499,7 @@ static inline pmd_t pmd_mkwrite_kernel(pmd_t pmd)
 	return pmd;
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	return pmd_mkwrite_kernel(pmd);
 }
diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index 21952b094650..9f2dcb9eafc8 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -351,6 +351,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 
 #define PTE_BIT_FUNC(h,fn,op) \
 static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
+#define PTE_BIT_FUNC_VMA(h,fn,op) \
+static inline pte_t pte_##fn(pte_t pte, struct vm_area_struct *vma) \
+{ \
+	pte.pte_##h op; \
+	return pte; \
+}
 
 #ifdef CONFIG_X2TLB
 /*
@@ -359,11 +365,11 @@ static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
  * kernel permissions), we attempt to couple them a bit more sanely here.
  */
 PTE_BIT_FUNC(high, wrprotect, &= ~(_PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE));
-PTE_BIT_FUNC(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
+PTE_BIT_FUNC_VMA(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
 PTE_BIT_FUNC(high, mkhuge, |= _PAGE_SZHUGE);
 #else
 PTE_BIT_FUNC(low, wrprotect, &= ~_PAGE_RW);
-PTE_BIT_FUNC(low, mkwrite, |= _PAGE_RW);
+PTE_BIT_FUNC_VMA(low, mkwrite, |= _PAGE_RW);
 PTE_BIT_FUNC(low, mkhuge, |= _PAGE_SZHUGE);
 #endif
 
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index d4330e3c57a6..3e8836179456 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -241,7 +241,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return __pte(pte_val(pte) & ~SRMMU_REF);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return __pte(pte_val(pte) | SRMMU_WRITE);
 }
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 2dc8d4641734..c5cd5c03f557 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -466,7 +466,7 @@ static inline pte_t pte_mkclean(pte_t pte)
 	return __pte(val);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	unsigned long val = pte_val(pte), mask;
 
@@ -756,11 +756,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	pte_t pte = __pte(pmd_val(pmd));
 
-	pte = pte_mkwrite(pte);
+	pte = pte_mkwrite(pte, vma);
 
 	return __pmd(pte_val(pte));
 }
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index a70d1618eb35..963479c133b7 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -207,7 +207,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
 	return(pte);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	if (unlikely(pte_get_bits(pte,  _PAGE_RW)))
 		return pte;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3607f2572f9e..66c514808276 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -369,7 +369,9 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+struct vm_area_struct;
+
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return pte_mkwrite_kernel(pte);
 }
@@ -470,7 +472,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_ACCESSED);
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index fc7a14884c6c..d72632d9c53c 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -262,7 +262,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)
 	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 	{ pte_val(pte) |= _PAGE_WRITABLE; return pte; }
 
 #define pgprot_noncached(prot) \
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index d7f6335d3999..e86c830728de 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -20,9 +20,9 @@ static inline unsigned long huge_pte_dirty(pte_t pte)
 	return pte_dirty(pte);
 }
 
-static inline pte_t huge_pte_mkwrite(pte_t pte)
+static inline pte_t huge_pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
-	return pte_mkwrite(pte);
+	return pte_mkwrite(pte, vma);
 }
 
 #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..af652444fbba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1163,7 +1163,7 @@ void free_compound_page(struct page *page);
 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
+		pte = pte_mkwrite(pte, vma);
 	return pte;
 }
 
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index af59cc7bd307..7bc5592900bc 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -109,10 +109,10 @@ static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx)
 	WARN_ON(!pte_same(pte, pte));
 	WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte))));
 	WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte))));
-	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte))));
+	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma)));
 	WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte))));
 	WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte))));
-	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte))));
+	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma))));
 	WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte))));
 	WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte))));
 }
@@ -153,7 +153,7 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
 	pte = pte_mkclean(pte);
 	set_pte_at(args->mm, args->vaddr, args->ptep, pte);
 	flush_dcache_page(page);
-	pte = pte_mkwrite(pte);
+	pte = pte_mkwrite(pte, args->vma);
 	pte = pte_mkdirty(pte);
 	ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1);
 	pte = ptep_get(args->ptep);
@@ -199,10 +199,10 @@ static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx)
 	WARN_ON(!pmd_same(pmd, pmd));
 	WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd))));
 	WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd))));
-	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd))));
+	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma)));
 	WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd))));
 	WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd))));
-	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd))));
+	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma))));
 	WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd))));
 	WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd))));
 	/*
@@ -253,7 +253,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 	pmd = pmd_mkclean(pmd);
 	set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
 	flush_dcache_page(page);
-	pmd = pmd_mkwrite(pmd);
+	pmd = pmd_mkwrite(pmd, args->vma);
 	pmd = pmd_mkdirty(pmd);
 	pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1);
 	pmd = READ_ONCE(*args->pmdp);
@@ -928,8 +928,8 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
 	pte = mk_huge_pte(page, args->page_prot);
 
 	WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
-	WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
-	WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte))));
+	WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte), args->vma)));
+	WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte, args->vma))));
 
 #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
 	pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4fc43859e59a..aaf815838144 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,7 +555,7 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
-		pmd = pmd_mkwrite(pmd);
+		pmd = pmd_mkwrite(pmd, vma);
 	return pmd;
 }
 
@@ -1580,7 +1580,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
 	if (writable)
-		pmd = pmd_mkwrite(pmd);
+		pmd = pmd_mkwrite(pmd, vma);
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
@@ -1926,7 +1926,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	/* See change_pte_range(). */
 	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
 	    can_change_pmd_writable(vma, addr, entry))
-		entry = pmd_mkwrite(entry);
+		entry = pmd_mkwrite(entry, vma);
 
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 07abcb6eb203..6af471bdcff8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4900,7 +4900,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
 
 	if (writable) {
 		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
-					 vma->vm_page_prot)));
+					 vma->vm_page_prot)), vma);
 	} else {
 		entry = huge_pte_wrprotect(mk_huge_pte(page,
 					   vma->vm_page_prot));
@@ -4916,7 +4916,7 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
 {
 	pte_t entry;
 
-	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)));
+	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)), vma);
 	if (huge_ptep_set_access_flags(vma, address, ptep, entry, 1))
 		update_mmu_cache(vma, address, ptep);
 }
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..d0972d2d6f36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4067,7 +4067,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	entry = mk_pte(&folio->page, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
 	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
+		entry = pte_mkwrite(pte_mkdirty(entry), vma);
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
@@ -4755,7 +4755,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
 	if (writable)
-		pte = pte_mkwrite(pte);
+		pte = pte_mkwrite(pte, vma);
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d30c9de60b0d..df3f5e9d5f76 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -646,7 +646,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		}
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
-			entry = pte_mkwrite(pte_mkdirty(entry));
+			entry = pte_mkwrite(pte_mkdirty(entry), vma);
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1d4843c97c2a..381163a41e88 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -198,7 +198,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
 			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent);
+				ptent = pte_mkwrite(ptent, vma);
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_needs_flush(oldpte, ptent))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 53c3d916ff66..3db6f87c0aca 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -75,7 +75,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (page_in_cache && !vm_shared)
 		writable = false;
 	if (writable)
-		_dst_pte = pte_mkwrite(_dst_pte);
+		_dst_pte = pte_mkwrite(_dst_pte, dst_vma);
 	if (wp_copy)
 		_dst_pte = pte_mkuffd_wp(_dst_pte);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (12 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-02 12:48   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 15/41] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
                   ` (26 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To avoid
inadvertently created shadow stack memory, in places where Linux normally
creates Write=0,Dirty=1, it can use the software-defined _PAGE_SAVED_DIRTY
in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs
to create Write=0,Dirty=1, it instead creates Write=0,SavedDirty=1 except
for shadow stack, which is Write=0,Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-bit
kernels because shadow stacks are not enabled there.

Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to
actually begin creating _PAGE_SAVED_DIRTY PTEs will follow once other
pieces are in place.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v7:
 - Use lightly edited comment verbiage from (David Hildenbrand)
 - Update commit log to reduce verbosity (David Hildenbrand)

v6:
 - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
 - Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK

v5:
 - Fix log, comments and whitespace (Boris)
 - Remove capitalization on shadow stack (Boris)

v4:
 - Teach pte_flags_need_flush() about _PAGE_COW bit
 - Break apart patch for better bisectability

v3:
 - Add comment around _PAGE_TABLE in response to comment
   from (Andrew Cooper)
 - Check for PSE in pmd_shstk (Andrew Cooper)
 - Get to the point quicker in commit log (Andrew Cooper)
 - Clarify and reorder commit log for why the PTE bit examples have
   multiple entries. Apply same changes for comment. (peterz)
 - Fix comment that implied dirty bit for COW was a specific x86 thing
   (peterz)
 - Fix swapping of Write/Dirty (PeterZ)
---
 arch/x86/include/asm/pgtable.h       | 79 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h | 50 +++++++++++++++---
 arch/x86/include/asm/tlbflush.h      |  3 +-
 3 files changed, 123 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 66c514808276..7360783f2140 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -301,6 +301,45 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+/*
+ * Write protection operations can result in Dirty=1,Write=0 PTEs. But in the
+ * case of X86_FEATURE_USER_SHSTK, the software SavedDirty bit is used, since
+ * the Dirty=1,Write=0 will result in the memory being treated as shadow stack
+ * by the HW. So when creating dirty, write-protected memory, a software bit is
+ * used _PAGE_BIT_SAVED_DIRTY. The following functions pte_mksaveddirty() and
+ * pte_clear_saveddirty() take a conventional dirty, write-protected PTE
+ * (Write=0,Dirty=1) and transition it to the shadow stack compatible
+ * version. (Write=0,SavedDirty=1).
+ */
+static inline pte_t pte_mksaveddirty(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
+}
+
+static inline pte_t pte_clear_saveddirty(pte_t pte)
+{
+	/*
+	 * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK kernels,
+	 * since the HW dirty bit can be used without creating shadow stack
+	 * memory. See the _PAGE_SAVED_DIRTY definition for more details.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pte;
+
+	/*
+	 * PTE is getting copied-on-write, so it will be dirtied
+	 * if writable, or made shadow stack if shadow stack and
+	 * being copied on access. Set the dirty bit for both
+	 * cases.
+	 */
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_RW);
@@ -420,6 +459,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+/* See comments above pte_mksaveddirty() */
+static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
+}
+
+/* See comments above pte_mksaveddirty() */
+static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_RW);
@@ -491,6 +550,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+/* See comments above pte_mksaveddirty() */
+static inline pud_t pud_mksaveddirty(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
+}
+
+/* See comments above pte_mksaveddirty() */
+static inline pud_t pud_clear_saveddirty(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0646ad00178b..56b374d1bffb 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a Saved Dirty bit page.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_BIT_SAVED_DIRTY		_PAGE_BIT_SOFTW5 /* Saved Dirty bit */
+#else
+#define _PAGE_BIT_SAVED_DIRTY		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +127,25 @@
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be Write=0,Dirty=1. However,
+ * there are valid cases where the kernel might create read-only PTEs that
+ * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty  tracking). In
+ * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit,
+ * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have
+ * (Write=0,SavedDirty=1,Dirty=0) set.
+ *
+ * Note that on processors without shadow stack support, the 
+ * _PAGE_SAVED_DIRTY remains unused.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_SAVED_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY)
+#else
+#define _PAGE_SAVED_DIRTY	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
@@ -125,9 +154,9 @@
  * instance, and is *not* included in this mask since
  * pte_modify() does modify it.
  */
-#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
-			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |  \
+#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		     \
+			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_BITS | \
+			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |	     \
 			 _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
@@ -186,12 +215,17 @@ enum page_cache_mode {
 #define PAGE_READONLY	     __pg(__PP|   0|_USR|___A|__NX|   0|   0|   0)
 #define PAGE_READONLY_EXEC   __pg(__PP|   0|_USR|___A|   0|   0|   0|   0)
 
-#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
-#define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
-#define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
+/*
+ * Page tables needs to have Write=1 in order for any lower PTEs to be
+ * writable. This includes shadow stack memory (Write=0, Dirty=1)
+ */
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
+#define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
+#define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
+
+#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
 #define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..6c5ef14060a8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -273,7 +273,8 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
 					_PAGE_ACCESSED;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
-					_PAGE_SOFTW3 | _PAGE_SOFTW4;
+					_PAGE_SOFTW3 | _PAGE_SOFTW4 |
+					_PAGE_SAVED_DIRTY;
 	const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
 			  _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
 			  _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 15/41] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (13 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 16/41] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Rick Edgecombe
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for
shadow stack. Copy-on-write PTEs then have Write=0,SavedDirty=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,SavedDirty=1, it could
become a transient shadow stack PTE in two cases:

1. Some processors can start a write but end up seeing a Write=0 PTE by
   the time they get to the Dirty bit, creating a transient shadow stack
   PTE. However, this will not occur on processors supporting shadow
   stack, and a TLB flush is not necessary.

2. When _PAGE_DIRTY is replaced with _PAGE_SAVED_DIRTY non-atomically, a
   transient shadow stack PTE can be created as a result. Thus, prevent
   that with cmpxchg.

In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
on does not exist and the logic would need to be different. Although the
extra functionality will normally be optimized out when user shadow
stacks are not configured, also exclude it in the preprocessor stage so
that it will still compile. User shadow stack is not supported there by
Linux anyway. Leave the cpu_feature_enabled() check so that the
functionality also gets disabled based on runtime detection of the
feature.

Similarly, compile it out in ptep_set_wrprotect() due to a clang warning
on i386. Like above, the code path should get optimized out on i386
since shadow stack is not supported on 32 bit kernels, but this makes
the compiler happy.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Fix comment and log to update for _PAGE_COW being replaced with
   _PAGE_SAVED_DIRTY.

v5:
 - Commit log verbiage and formatting (Boris)
 - Remove capitalization on shadow stack (Boris)
 - Fix i386 warning on recent clang

v3:
 - Remove unnecessary #ifdef (Dave Hansen)

v2:
 - Compile out some code due to clang build error
 - Clarify commit log (dhansen)
 - Normalize PTE bit descriptions between patches (dhansen)
 - Update comment with text from (dhansen)
---
 arch/x86/include/asm/pgtable.h | 35 ++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7360783f2140..349fcab0405a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1192,6 +1192,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		old_pte = READ_ONCE(*ptep);
+		do {
+			new_pte = pte_wrprotect(old_pte);
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1244,6 +1261,24 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		old_pmd = READ_ONCE(*pmdp);
+		do {
+			new_pmd = pmd_wrprotect(old_pmd);
+		} while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+		return;
+	}
+#endif
+
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 16/41] x86/mm: Start actually marking _PAGE_SAVED_DIRTY
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (14 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 15/41] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 17/41] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
the same changes to pmd and pud.

For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
was not necessarily created with any of the existing PTE bit helpers.
That means that it can return a pte_t with Write=0,Dirty=1, a shadow
stack PTE, when it did not intend to create one.

Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
1. Marking Write=0 PTEs Dirty=1
2. Marking Dirty=1 PTEs Write=0

The first case cannot happen as the existing behavior of pte_modify() is to
filter out any Dirty bit passed in newprot. Handle the second case by
shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
protected by the pte_modify() call. Apply the same changes to
pmd_modify().

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
 - Open code _PAGE_SAVED_DIRTY part in pte_modify() (Boris)
 - Change the logic so the open coded part is not too ugly
 - Merge pte_modify() patch with this one because of the above

v4:
 - Break part patch for better bisectability
---
 arch/x86/include/asm/pgtable.h | 168 ++++++++++++++++++++++++++++-----
 1 file changed, 145 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 349fcab0405a..05dfdbdf96b4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return false;
+
+	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -134,9 +142,18 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return false;
+
+	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+	       (_PAGE_DIRTY | _PAGE_PSE);
 }
 
 #define pmd_young pmd_young
@@ -145,9 +162,9 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -157,13 +174,21 @@ static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -342,7 +367,16 @@ static inline pte_t pte_clear_saveddirty(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pte_dirty(pte))
+		pte = pte_mksaveddirty(pte);
+	return pte;
 }
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
@@ -380,7 +414,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -395,7 +429,19 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pteval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating Dirty=1,Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_SAVED_DIRTY;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	/* pte_clear_saveddirty() also sets Dirty=1 */
+	return pte_clear_saveddirty(pte);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -412,7 +458,12 @@ struct vm_area_struct;
 
 static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
-	return pte_mkwrite_kernel(pte);
+	pte = pte_mkwrite_kernel(pte);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_saveddirty(pte);
+
+	return pte;
 }
 
 static inline pte_t pte_mkhuge(pte_t pte)
@@ -481,7 +532,15 @@ static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pmd_dirty(pmd))
+		pmd = pmd_mksaveddirty(pmd);
+	return pmd;
 }
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
@@ -508,12 +567,23 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pmdval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pmd_write(pmd))
+		dirty = _PAGE_SAVED_DIRTY;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd_clear_saveddirty(pmd);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -533,7 +603,12 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
-	return pmd_set_flags(pmd, _PAGE_RW);
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_saveddirty(pmd);
+
+	return pmd;
 }
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -577,17 +652,32 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pud_dirty(pud))
+		pud = pud_mksaveddirty(pud);
+	return pud;
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pud_write(pud))
+		dirty = _PAGE_SAVED_DIRTY;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -607,7 +697,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	if (pud_dirty(pud))
+		pud = pud_clear_saveddirty(pud);
+	return pud;
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
@@ -724,6 +818,8 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pteval_t val = pte_val(pte), oldval = val;
+	bool wr_protected;
+	pte_t pte_result;
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
@@ -732,17 +828,43 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
-	return __pte(val);
+
+	pte_result = __pte(val);
+
+	/*
+	 * Do the saveddirty fixup if the PTE was just write protected and
+	 * it's dirty.
+	 */
+	wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
+	    (val & _PAGE_DIRTY))
+		pte_result = pte_mksaveddirty(pte_result);
+
+	return pte_result;
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
 	pmdval_t val = pmd_val(pmd), oldval = val;
+	bool wr_protected;
+	pmd_t pmd_result;
 
-	val &= _HPAGE_CHG_MASK;
+	val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY);
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
-	return __pmd(val);
+
+	pmd_result = __pmd(val);
+
+	/*
+	 * Do the saveddirty fixup if the PMD was just write protected and
+	 * it's dirty.
+	 */
+	wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
+	    (val & _PAGE_DIRTY))
+		pmd_result = pmd_mksaveddirty(pmd_result);
+
+	return pmd_result;
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 17/41] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (15 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 16/41] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu, Peter Xu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Future patches will introduce a new VM flag VM_SHADOW_STACK that will be
VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are
bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake
of order, make all VM_HIGH_ARCH_BITs stay together by moving
VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be
introduced as 37.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index af652444fbba..a1b31caae013 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -377,7 +377,7 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT	37
+# define VM_UFFD_MINOR_BIT	38
 # define VM_UFFD_MINOR		BIT(VM_UFFD_MINOR_BIT)	/* UFFD minor faults */
 #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 # define VM_UFFD_MINOR		VM_NONE
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (16 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 17/41] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors Rick Edgecombe
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

New hardware extensions implement support for shadow stack memory, such
as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to
identify these areas, for example, to be used to properly indicate shadow
stack PTEs to the hardware.

Shadow stack VMA creation will be tightly controlled and limited to
anonymous memory to make the implementation simpler and since that is all
that is required. The solution will rely on pte_mkwrite() to create the
shadow stack PTEs, so it will not be required for vm_get_page_prot() to
learn how to create shadow stack memory. For this reason document that
VM_SHADOW_STACK should not be mixed with VM_SHARED.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v7:
 - Use lightly edited commit log verbiage from (David Hildenbrand)
 - Add explanation for VM_SHARED limitation (David Hildenbrand)

v6:
 - Add comment about VM_SHADOW_STACK not being allowed with VM_SHARED
   (David Hildenbrand)

v3:
 - Drop arch specific change in arch_vma_name(). The memory can show as
   anonymous (Kirill)
 - Change CONFIG_ARCH_HAS_SHADOW_STACK to CONFIG_X86_USER_SHADOW_STACK
   in show_smap_vma_flags() (Boris)
---
 Documentation/filesystems/proc.rst | 1 +
 fs/proc/task_mmu.c                 | 3 +++
 include/linux/mm.h                 | 8 ++++++++
 3 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 9d5fd9424e8b..8b314df7ccdf 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -564,6 +564,7 @@ encoded manner. The codes are the following:
     mt    arm64 MTE allocation tags are enabled
     um    userfaultfd missing tracking
     uw    userfaultfd wr-protect tracking
+    ss    shadow stack page
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6a96e1713fd5..324b092c2ac9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+		[ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a1b31caae013..097544afb1aa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -326,11 +326,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -346,6 +348,12 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+# define VM_SHADOW_STACK	VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
+#else
+# define VM_SHADOW_STACK	VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (17 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-03 14:00   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 20/41] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
                   ` (21 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stack accesses to shadow-stack mappings can result in faults in
normal, valid operation just like regular accesses to regular mappings.
Shadow stacks need some of the same features like delayed allocation, swap
and copy-on-write. The kernel needs to use faults to implement those
features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

For the purpose of making this clearer, consider the following example.
If a process has a shadow stack, and forks, the shadow stack PTEs will
become read-only due to COW. If the CPU in one process performs a shadow
stack read access to the shadow stack, for example executing a RET and
causing the CPU to read the shadow stack copy of the return address, then
in order for the fault to be resolved the PTE will need to be set with
shadow stack permissions. But then the memory would be changeable from
userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
COW, otherwise the shared page would be changeable from both processes.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v7:
 - Update comment in fault handler (David Hildenbrand)

v6:
 - Update comment due to rename of Cow bit to SavedDirty

v5:
 - Add description of COW example (Boris)
 - Replace "permissioned" (Boris)
 - Remove capitalization of shadow stack (Boris)

v4:
 - Further improve comment talking about FAULT_FLAG_WRITE (Peterz)

v3:
 - Improve comment talking about using FAULT_FLAG_WRITE (Peterz)
---
 arch/x86/include/asm/trap_pf.h |  2 ++
 arch/x86/mm/fault.c            | 31 +++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  *   bit 15 ==				1: SGX MMU page-fault
  */
 enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 	X86_PF_SGX	=		1 << 15,
 };
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a498ae1fbe66..776b92339cfe 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1117,8 +1117,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Shadow stack accesses (PF_SHSTK=1) are only permitted to
+	 * shadow stack VMAs. All other accesses result in an error.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
+			return 1;
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
+		if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
+			return 1;
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
 		return 0;
@@ -1310,6 +1324,23 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * For conventionally writable pages, a read can be serviced with a
+	 * read only PTE. But for shadow stack, there isn't a concept of
+	 * read-only shadow stack memory. If it a PTE has the shadow stack
+	 * permission, it can be modified via CALL and RET instructions. So
+	 * core MM needs to fault in a writable PTE and do things it already
+	 * does for write faults.
+	 *
+	 * Shadow stack accesses (read or write) need to be serviced with
+	 * shadow stack permission memory, which always include write
+	 * permissions. So in the case of a shadow stack read access, treat it
+	 * as a WRITE fault. This will make sure that MM will prepare
+	 * everything (e.g., break COW) such that maybe_mkwrite() can create a
+	 * proper shadow stack PTE.
+	 */
+	if (error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_INSTR)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 20/41] x86/mm: Teach pte_mkwrite() about stack memory
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (18 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-03 15:37   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 21/41] mm: Add guard pages around a shadow stack Rick Edgecombe
                   ` (20 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
when it is made writable with pte_mkwrite(), it should create shadow
stack memory, not conventionally writable memory. Now that pte_mkwrite()
takes a VMA, and places where shadow stack memory might be created pass
one, pte_mkwrite() can know when it should do this.

So make pte_mkwrite() create shadow stack memory when the VMA has the
VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().

This requires referencing VM_SHADOW_STACK in these functions, which are
currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
located) can't be pulled in without causing problems for files that
reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
they can safely reference VM_SHADOW_STACK.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - New patch
---
 arch/x86/include/asm/pgtable.h | 20 ++------------------
 arch/x86/mm/pgtable.c          | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 05dfdbdf96b4..d81e7ec27507 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -456,15 +456,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
 
 struct vm_area_struct;
 
-static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	pte = pte_mkwrite_kernel(pte);
-
-	if (pte_dirty(pte))
-		pte = pte_clear_saveddirty(pte);
-
-	return pte;
-}
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
@@ -601,15 +593,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_ACCESSED);
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
-{
-	pmd = pmd_set_flags(pmd, _PAGE_RW);
-
-	if (pmd_dirty(pmd))
-		pmd = pmd_clear_saveddirty(pmd);
-
-	return pmd;
-}
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..98856bcc8102 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -880,3 +880,29 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #endif /* CONFIG_X86_64 */
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return pte_mkwrite_shstk(pte);
+
+	pte = pte_mkwrite_kernel(pte);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_saveddirty(pte);
+
+	return pte;
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return pmd_mkwrite_shstk(pmd);
+
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_saveddirty(pmd);
+
+	return pmd;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (19 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 20/41] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-06  8:08   ` Borislav Petkov
  2023-03-17 17:09   ` Deepak Gupta
  2023-02-27 22:29 ` [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
                   ` (19 subsequent siblings)
  40 siblings, 2 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to  prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the ssp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET  and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

	addq    $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp);       // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8);     // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Fix typo in commit log

v4:
 - Drop references to 32 bit instructions
 - Switch to generic code to drop __weak (Peterz)

v2:
 - Use __weak instead of #ifdef (Dave Hansen)
 - Only have start gap on shadow stack (Andy Luto)
 - Create stack_guard_start_gap() to not duplicate code
   in an arch version of vm_start_gap() (Dave Hansen)
 - Improve commit log partly with verbiage from (Dave Hansen)

Yu-cheng v25:
 - Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.
---
 include/linux/mm.h | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 097544afb1aa..6a093daced88 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3107,15 +3107,36 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
 	return mtree_load(&mm->mm_mt, addr);
 }
 
+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_GROWSDOWN)
+		return stack_guard_gap;
+
+	/*
+	 * Shadow stack pointer is moved by CALL, RET, and INCSSPQ.
+	 * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+	 * and touches the first and the last element in the range, which
+	 * triggers a page fault if the range is not in a shadow stack.
+	 * Because of this, creating 4-KB guard pages around a shadow
+	 * stack prevents these instructions from going beyond.
+	 *
+	 * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+	 * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+	 */
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return PAGE_SIZE;
+
+	return 0;
+}
+
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
+	unsigned long gap = stack_guard_start_gap(vma);
 	unsigned long vm_start = vma->vm_start;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
-		if (vm_start > vma->vm_start)
-			vm_start = 0;
-	}
+	vm_start -= gap;
+	if (vm_start > vma->vm_start)
+		vm_start = 0;
 	return vm_start;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (20 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 21/41] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-06 13:01   ` Borislav Petkov
                     ` (2 more replies)
  2023-02-27 22:29 ` [PATCH v7 23/41] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
                   ` (18 subsequent siblings)
  40 siblings, 3 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Account shadow stack pages to stack memory. Do this by adding a
VM_SHADOW_STACK check in is_stack_mapping().

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v7:
 - Change is_stack_mapping() to know about VM_SHADOW_STACK so the
   additions in vm_stat_account() can be dropped. (David Hildenbrand)

v3:
 - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
   (Kirill)

v2:
 - Remove is_shadow_stack_mapping() and just change it to directly bitwise
   and VM_SHADOW_STACK.

Yu-cheng v26:
 - Remove redundant #ifdef CONFIG_MMU.

Yu-cheng v25:
 - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().
---
 mm/internal.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 7920a8b7982e..1d13d5580f64 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -491,14 +491,14 @@ static inline bool is_exec_mapping(vm_flags_t flags)
 }
 
 /*
- * Stack area - automatically grows in one direction
+ * Stack area
  *
- * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
- * do_mmap() forbids all other combinations.
+ * VM_GROWSUP, VM_GROWSDOWN VMAs are always private
+ * anonymous. do_mmap() forbids all other combinations.
  */
 static inline bool is_stack_mapping(vm_flags_t flags)
 {
-	return (flags & VM_STACK) == VM_STACK;
+	return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 23/41] mm: Re-introduce vm_flags to do_mmap()
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (21 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

    commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap().  Thus, re-introduce vm_flags to do_mmap().

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: linux-mm@kvack.org
---
 fs/aio.c           |  2 +-
 include/linux/mm.h |  3 ++-
 ipc/shm.c          |  2 +-
 mm/mmap.c          | 10 +++++-----
 mm/nommu.c         |  4 ++--
 mm/util.c          |  2 +-
 6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b0b17bd098bb..4a7576989719 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -558,7 +558,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
 				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+				 MAP_SHARED, 0, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a093daced88..87e46a9e0e93 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3014,7 +3014,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf);
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool downgrade);
diff --git a/ipc/shm.c b/ipc/shm.c
index 60e45e7045d4..576a543b7cff 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 20f21f0949dd..eedae44dfc78 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1191,11 +1191,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  */
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
-			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long flags, vm_flags_t vm_flags,
+			unsigned long pgoff, unsigned long *populate,
+			struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	vm_flags_t vm_flags;
 	int pkey = 0;
 
 	validate_mm(mm);
@@ -1256,7 +1256,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
@@ -2829,7 +2829,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, 0, pgoff, &populate, NULL);
 	fput(file);
 out:
 	mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a37..f6ddd084671f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1002,6 +1002,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long len,
 			unsigned long prot,
 			unsigned long flags,
+			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
 			struct list_head *uf)
@@ -1009,7 +1010,6 @@ unsigned long do_mmap(struct file *file,
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	struct rb_node *rb;
-	vm_flags_t vm_flags;
 	unsigned long capabilities, result;
 	int ret;
 	VMA_ITERATOR(vmi, current->mm, 0);
@@ -1029,7 +1029,7 @@ unsigned long do_mmap(struct file *file,
 
 	/* we've determined that we can make the mapping, now translate what we
 	 * now know into VMA flags */
-	vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+	vm_flags |= determine_vm_flags(file, prot, flags, capabilities);
 
 
 	/* we're going to need to record the mapping */
diff --git a/mm/util.c b/mm/util.c
index b8ed9dbc7fd5..a93e832f4065 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -539,7 +539,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
 			      &uf);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (22 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 23/41] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-06 13:10   ` Borislav Petkov
  2023-03-17 17:05   ` Deepak Gupta
  2023-02-27 22:29 ` [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
                   ` (16 subsequent siblings)
  40 siblings, 2 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Shadow stack memory is writable only in very specific, controlled ways.
However, since it is writable, the kernel treats it as such. As a result
there remain many ways for userspace to trigger the kernel to write to
shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
little less exposed, block writable GUPs for shadow stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v3:
 - Add comment in __pte_access_permitted() (Dave)
 - Remove unneeded shadow stack specific check in
   __pte_access_permitted() (Jann)
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 mm/gup.c                       | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d81e7ec27507..2e3d8cca1195 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1638,6 +1638,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
 {
 	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
 
+	/*
+	 * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
+	 * shouldn't generally allow access to, but since they
+	 * are already Write=0, the below logic covers both cases.
+	 */
 	if (write)
 		need_pte_bits |= _PAGE_RW;
 
diff --git a/mm/gup.c b/mm/gup.c
index eab18ba045db..e7c7bcc0e268 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -978,7 +978,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;
 
 	if (write) {
-		if (!(vm_flags & VM_WRITE)) {
+		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
 			/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (23 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-06 18:09   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
                   ` (15 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which require some core mm changes to function
properly.

One of the properties is that the shadow stack pointer (SSP), which is a
CPU register that points to the shadow stack like the stack pointer points
to the stack, can't be pointing outside of the 32 bit address space when
the CPU is executing in 32 bit mode. It is desirable to prevent executing
in 32 bit mode when shadow stack is enabled because the kernel can't easily
support 32 bit signals.

On x86 it is possible to transition to 32 bit mode without any special
interaction with the kernel, by doing a "far call" to a 32 bit segment.
So the shadow stack implementation can use this address space behavior
as a feature, by enforcing that shadow stack memory is always crated
outside of the 32 bit address space. This way userspace will trigger a
general protection fault which will in turn trigger a segfault if it
tries to transition to 32 bit mode with shadow stack enabled.

This provides a clean error generating border for the user if they try
attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
half working state for userspace to be surprised by.

So to allow future shadow stack enabling patches to map shadow stacks
out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
is pretty much like MAP_32BIT, except that it has the opposite address
range. The are a few differences though.

If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
syscall.

Since the default search behavior is top down, the normal kaslr base can
be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add it's
own randomization in the bottom up case.

For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
both are potentially valid, so both are used. In the bottomup search
path, the default behavior is already consistent with MAP_ABOVE4G since
mmap base should be above 4GB.

Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
with shadow stack enabled would usually segfault anyway. This is already
pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
complexity spent to make it make it more complete.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v5:
 - New patch
---
 arch/x86/include/uapi/asm/mman.h | 1 +
 arch/x86/kernel/sys_x86_64.c     | 6 +++++-
 include/linux/mman.h             | 4 ++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 775dbd3aff73..5a0256e73f1e 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMAN_H
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
+#define MAP_ABOVE4G	0x80		/* only map above 4GB */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 #define arch_calc_vm_prot_bits(prot, key) (		\
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 8cc653ffdccd..06378b5682c1 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
-	info.low_limit = PAGE_SIZE;
+	if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
+		info.low_limit = 0x100000000;
+	else
+		info.low_limit = PAGE_SIZE;
+
 	info.high_limit = get_mmap_base(0);
 
 	/*
diff --git a/include/linux/mman.h b/include/linux/mman.h
index cee1e4b566d8..40d94411d492 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -15,6 +15,9 @@
 #ifndef MAP_32BIT
 #define MAP_32BIT 0
 #endif
+#ifndef MAP_ABOVE4G
+#define MAP_ABOVE4G 0
+#endif
 #ifndef MAP_HUGE_2MB
 #define MAP_HUGE_2MB 0
 #endif
@@ -50,6 +53,7 @@
 		| MAP_STACK \
 		| MAP_HUGETLB \
 		| MAP_32BIT \
+		| MAP_ABOVE4G \
 		| MAP_HUGE_2MB \
 		| MAP_HUGE_1GB)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (24 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-08  8:53   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage. It will be compiled out when shadow
stack is not configured.

In order to check if a pte is shadow stack in core mm code, add two arch
breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack
specific code to be kept in arch/x86.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Add arch breakout to remove shstk from core MM code.

v5:
 - Fix typo in commit log

v3:
 - New patch
---
 arch/x86/include/asm/pgtable.h |  6 ++++++
 arch/x86/mm/pgtable.c          | 12 ++++++++++++
 include/linux/pgtable.h        | 14 ++++++++++++++
 mm/huge_memory.c               |  1 +
 mm/memory.c                    |  1 +
 5 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2e3d8cca1195..e5b3dce0d9fe 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1684,6 +1684,12 @@ static inline bool arch_has_hw_pte_young(void)
 	return true;
 }
 
+#define arch_check_zapped_pte arch_check_zapped_pte
+void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
+
+#define arch_check_zapped_pmd arch_check_zapped_pmd
+void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
+
 #ifdef CONFIG_XEN_PV
 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
 static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 98856bcc8102..afab0bc7862b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -906,3 +906,15 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 	return pmd;
 }
+
+void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
+{
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pte_shstk(pte));
+}
+
+void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
+{
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pmd_shstk(pmd));
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c63cd44777ec..4a8970b9fb11 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,6 +291,20 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_check_zapped_pte
+static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
+					 pte_t pte)
+{
+}
+#endif
+
+#ifndef arch_check_zapped_pmd
+static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
+					 pmd_t pmd)
+{
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aaf815838144..24797be05fcb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1689,6 +1689,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
 						tlb->fullmm);
+	arch_check_zapped_pmd(vma, orig_pmd);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	if (vma_is_special_huge(vma)) {
 		if (arch_needs_pgtable_deposit())
diff --git a/mm/memory.c b/mm/memory.c
index d0972d2d6f36..c953c2c4588c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1389,6 +1389,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			arch_check_zapped_pte(vma, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
 						      ptent);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (25 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:54   ` Kees Cook
  2023-03-08  9:23   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 28/41] x86: Introduce userspace API for shadow stack Rick Edgecombe
                   ` (13 subsequent siblings)
  40 siblings, 2 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

When user shadow stack is use, Write=0,Dirty=1 is treated by the CPU as
shadow stack memory. So for shadow stack memory this bit combination is
valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
write protected, the kernel has been taught to transition the Dirty=1
bit to SavedDirty=1, to avoid inadvertently creating shadow stack
memory. It does this inside pte_wrprotect() because it knows the PTE is
not intended to be a writable shadow stack entry, it is supposed to be
write protected.

However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
distinguish between the caller intending to create a shadow stack PTE or
needing the SavedDirty shift.

The kernel has been updated to not do this, and so Write=0,Dirty=1
memory should only be created by the pte_mkfoo() helpers. Add a warning
to make sure no new mk_pte() start doing this.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - New patch (Note, this has already been a useful warning, it caught the
   newly added set_memory_rox() doing this)
---
 arch/x86/include/asm/pgtable.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e5b3dce0d9fe..7142f99d3fbb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1032,7 +1032,15 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
  * (Currently stuck as a macro because of indirect forward reference
  * to linux/mm.h:page_to_nid())
  */
-#define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_pte(page, pgprot)						 \
+({									 \
+	pgprot_t __pgprot = pgprot;					 \
+									 \
+	WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_USER_SHSTK) &&	 \
+		    (pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == \
+		    _PAGE_DIRTY);					 \
+	pfn_pte(page_to_pfn(page), __pgprot);				 \
+})
 
 static inline int pmd_bad(pmd_t pmd)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (26 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-08 10:27   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 29/41] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
                   ` (12 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Add three new arch_prctl() handles:

 - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
   feature. Returns 0 on success or an error.

 - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
   specified feature. Returns 0 on success or an error

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

This is preparation patch. It does not implement any features.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[tweaked with feedback from tglx]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v4:
 - Remove references to CET and replace with shadow stack (Peterz)

v3:
 - Move shstk.c Makefile changes earlier (Kees)
 - Add #ifdef around features_locked and features (Kees)
 - Encapsulate features reset earlier in reset_thread_features() so
   features and features_locked are not referenced in code that would be
   compiled !CONFIG_X86_USER_SHADOW_STACK. (Kees)
 - Fix typo in commit log (Kees)
 - Switch arch_prctl() numbers to avoid conflict with LAM

v2:
 - Only allow one enable/disable per call (tglx)
 - Return error code like a normal arch_prctl() (Alexander Potapenko)
 - Make CET only (tglx)
---
 arch/x86/include/asm/processor.h  |  6 +++++
 arch/x86/include/asm/shstk.h      | 21 +++++++++++++++
 arch/x86/include/uapi/asm/prctl.h |  6 +++++
 arch/x86/kernel/Makefile          |  2 ++
 arch/x86/kernel/process_64.c      |  7 ++++-
 arch/x86/kernel/shstk.c           | 44 +++++++++++++++++++++++++++++++
 6 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 8d73004e4cac..bd16e012b3e9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -28,6 +28,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/shstk.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -475,6 +476,11 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	unsigned long		features;
+	unsigned long		features_locked;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
new file mode 100644
index 000000000000..ec753809f074
--- /dev/null
+++ b/arch/x86/include/asm/shstk.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHSTK_H
+#define _ASM_X86_SHSTK_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+void reset_thread_features(void);
+#else
+static inline long shstk_prctl(struct task_struct *task, int option,
+			       unsigned long arg2) { return -EINVAL; }
+static inline void reset_thread_features(void) {}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_SHSTK_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..b2b3b7200b2d 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,10 @@
 #define ARCH_MAP_VDSO_32		0x2002
 #define ARCH_MAP_VDSO_64		0x2003
 
+/* Don't use 0x3001-0x3004 because of old glibcs */
+
+#define ARCH_SHSTK_ENABLE		0x5001
+#define ARCH_SHSTK_DISABLE		0x5002
+#define ARCH_SHSTK_LOCK			0x5003
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 92446f1dedd7..b366641703e3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -146,6 +146,8 @@ obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
 obj-$(CONFIG_X86_CET)			+= cet.o
 
+obj-$(CONFIG_X86_USER_SHADOW_STACK)	+= shstk.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4e34b3b68ebd..71094c8a305f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		load_gs_index(__USER_DS);
 	}
 
+	reset_thread_features();
+
 	loadsegment(fs, 0);
 	loadsegment(es, _ds);
 	loadsegment(ds, _ds);
@@ -830,7 +832,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_MAP_VDSO_64:
 		return prctl_map_vdso(&vdso_image_64, arg2);
 #endif
-
+	case ARCH_SHSTK_ENABLE:
+	case ARCH_SHSTK_DISABLE:
+	case ARCH_SHSTK_LOCK:
+		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..41ed6552e0a5
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <asm/prctl.h>
+
+void reset_thread_features(void)
+{
+	current->thread.features = 0;
+	current->thread.features_locked = 0;
+}
+
+long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+{
+	if (option == ARCH_SHSTK_LOCK) {
+		task->thread.features_locked |= features;
+		return 0;
+	}
+
+	/* Don't allow via ptrace */
+	if (task != current)
+		return -EINVAL;
+
+	/* Do not allow to change locked features */
+	if (features & task->thread.features_locked)
+		return -EPERM;
+
+	/* Only support enabling/disabling one feature at a time. */
+	if (hweight_long(features) > 1)
+		return -EINVAL;
+
+	if (option == ARCH_SHSTK_DISABLE) {
+		return -EINVAL;
+	}
+
+	/* Handle ARCH_SHSTK_ENABLE */
+	return -EINVAL;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 29/41] x86/shstk: Add user-mode shadow stack support
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (27 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 28/41] x86: Introduce userspace API for shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 30/41] x86/shstk: Handle thread shadow stack Rick Edgecombe
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

32 bit shadow stack is not expected to have many users and it will
complicate the signal implementation. So do not support IA32 emulation
or x32.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v7:
 - Add explanation for not supporting 32 bit in commit log (Boris)

v5:
 - Switch to EOPNOTSUPP
 - Use MAP_ABOVE4G
 - Move set_clr_bits_msrl() to patch where it is first used

v4:
 - Just set MSR_IA32_U_CET when disabling shadow stack, since we don't
   have IBT yet. (Peterz)

v3:
 - Use define for set_clr_bits_msrl() (Kees)
 - Make some functions static (Kees)
 - Change feature_foo() to features_foo() (Kees)
 - Centralize shadow stack size rlimit checks (Kees)
 - Disable x32 support

v2:
 - Get rid of unnecessary shstk->base checks
 - Don't support IA32 emulation
---
 arch/x86/include/asm/processor.h  |   2 +
 arch/x86/include/asm/shstk.h      |   7 ++
 arch/x86/include/uapi/asm/prctl.h |   3 +
 arch/x86/kernel/shstk.c           | 145 ++++++++++++++++++++++++++++++
 4 files changed, 157 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bd16e012b3e9..ff98cd6d5af2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -479,6 +479,8 @@ struct thread_struct {
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 	unsigned long		features;
 	unsigned long		features_locked;
+
+	struct thread_shstk	shstk;
 #endif
 
 	/* Floating point and extended processor state */
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ec753809f074..2b1f7c9b9995 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -8,12 +8,19 @@
 struct task_struct;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
+struct thread_shstk {
+	u64	base;
+	u64	size;
+};
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index b2b3b7200b2d..7dfd9dc00509 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,4 +26,7 @@
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
 
+/* ARCH_SHSTK_ features bits */
+#define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 41ed6552e0a5..3cb85224d856 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,14 +8,159 @@
 
 #include <linux/sched.h>
 #include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/shstk.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+static bool features_enabled(unsigned long features)
+{
+	return current->thread.features & features;
+}
+
+static void features_set(unsigned long features)
+{
+	current->thread.features |= features;
+}
+
+static void features_clr(unsigned long features)
+{
+	current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, unused;
+
+	mmap_write_lock(mm);
+	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+	mmap_write_unlock(mm);
+
+	return addr;
+}
+
+static unsigned long adjust_shstk_size(unsigned long size)
+{
+	if (size)
+		return PAGE_ALIGN(size);
+
+	return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+	while (1) {
+		int r;
+
+		r = vm_munmap(base, size);
+
+		/*
+		 * vm_munmap() returns -EINTR when mmap_lock is held by
+		 * something else, and that lock should not be held for a
+		 * long time.  Retry it for the case.
+		 */
+		if (r == -EINTR) {
+			cond_resched();
+			continue;
+		}
+
+		/*
+		 * For all other types of vm_munmap() failure, either the
+		 * system is out of memory or there is bug.
+		 */
+		WARN_ON_ONCE(r);
+		break;
+	}
+}
+
+static int shstk_setup(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long addr, size;
+
+	/* Already enabled */
+	if (features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/* Also not supported for 32 bit and x32 */
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
+		return -EOPNOTSUPP;
+
+	size = adjust_shstk_size(0);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+	fpregs_unlock();
+
+	shstk->base = addr;
+	shstk->size = size;
+	features_set(ARCH_SHSTK_SHSTK);
+
+	return 0;
+}
+
 void reset_thread_features(void)
 {
+	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
 	current->thread.features = 0;
 	current->thread.features_locked = 0;
 }
 
+void shstk_free(struct task_struct *tsk)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return;
+
+	if (!tsk->mm)
+		return;
+
+	unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+static int shstk_disable(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	/* Already disabled? */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	fpregs_lock_and_load();
+	/* Disable WRSS too when disabling shadow stack */
+	wrmsrl(MSR_IA32_U_CET, 0);
+	wrmsrl(MSR_IA32_PL3_SSP, 0);
+	fpregs_unlock();
+
+	shstk_free(current);
+	features_clr(ARCH_SHSTK_SHSTK);
+
+	return 0;
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (28 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 29/41] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-02 17:34   ` Szabolcs Nagy
  2023-03-08 15:26   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
                   ` (10 subsequent siblings)
  40 siblings, 2 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v3:
 - Fix update_fpu_shstk() stub (Mike Rapoport)
 - Fix chunks around alloc_shstk() in wrong patch (Kees)
 - Fix stack_size/flags swap (Kees)
 - Use centralized stack size logic (Kees)

v2:
 - Have fpu_clone() take new shadow stack pointer and update SSP in
   xsave buffer for new task. (tglx)

v1:
 - Expand commit log.
 - Add more comments.
 - Switch to xsave helpers.

Yu-cheng v30:
 - Update comments about clone()/clone3(). (Borislav Petkov)
---
 arch/x86/include/asm/fpu/sched.h   |  3 ++-
 arch/x86/include/asm/mmu_context.h |  2 ++
 arch/x86/include/asm/shstk.h       |  7 +++++
 arch/x86/kernel/fpu/core.c         | 41 +++++++++++++++++++++++++++-
 arch/x86/kernel/process.c          | 18 ++++++++++++-
 arch/x86/kernel/shstk.c            | 43 ++++++++++++++++++++++++++++--
 6 files changed, 109 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..3c2903bbb456 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@
 
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
 extern void fpu__drop(struct fpu *fpu);
-extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+		      unsigned long shstk_addr);
 extern void fpu_flush_thread(void);
 
 /*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e01aa74a6de7..9714f08d941b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -147,6 +147,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 2b1f7c9b9995..1399f4df098b 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -15,11 +15,18 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+			     unsigned long stack_size,
+			     unsigned long *shstk_addr);
 void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+					   unsigned long clone_flags,
+					   unsigned long stack_size,
+					   unsigned long *shstk_addr) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index f851558b673f..bc3de4aeb661 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
 	}
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+	struct cet_user_state *xstate;
+
+	/* If ssp update is not needed. */
+	if (!ssp)
+		return 0;
+
+	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+				XFEATURE_CET_USER);
+
+	/*
+	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+	 * stack and the fpu state should be up to date since it was just copied
+	 * from the parent in fpu_clone(). So there must be a valid non-init CET
+	 * state location in the buffer.
+	 */
+	if (WARN_ON_ONCE(!xstate))
+		return 1;
+
+	xstate->user_ssp = (u64)ssp;
+
+	return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
+{
+	return 0;
+}
+#endif
+
 /* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+	      unsigned long ssp)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
@@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
 	if (use_xsave())
 		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
 
+	/*
+	 * Update shadow stack pointer, in case it changed during clone.
+	 */
+	if (update_fpu_shstk(dst, ssp))
+		return 1;
+
 	trace_x86_fpu_copy_src(src_fpu);
 	trace_x86_fpu_copy_dst(dst_fpu);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b650cde3f64d..bf703f53fa49 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -48,6 +48,7 @@
 #include <asm/frame.h>
 #include <asm/unwind.h>
 #include <asm/tdx.h>
+#include <asm/shstk.h>
 
 #include "process.h"
 
@@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
+	unsigned long shstk_addr = 0;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-	fpu_clone(p, clone_flags, args->fn);
+	/* Allocate a new shadow stack for pthread if needed */
+	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
+				       &shstk_addr);
+	if (ret)
+		return ret;
+
+	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
@@ -220,6 +229,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
+	/*
+	 * If copy_thread() if failing, don't leak the shadow stack possibly
+	 * allocated in shstk_alloc_thread_stack() above.
+	 */
+	if (ret)
+		shstk_free(p);
+
 	return ret;
 }
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3cb85224d856..1d30295e0066 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size)
 	unsigned long addr, unused;
 
 	mmap_write_lock(mm);
-	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
 		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 
 	mmap_write_unlock(mm);
@@ -126,6 +126,39 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+			     unsigned long stack_size, unsigned long *shstk_addr)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long addr, size;
+
+	/*
+	 * If shadow stack is not enabled on the new thread, skip any
+	 * switch to a new shadow stack.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+	size = adjust_shstk_size(stack_size);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	shstk->base = addr;
+	shstk->size = size;
+
+	*shstk_addr = addr + size;
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -134,7 +167,13 @@ void shstk_free(struct task_struct *tsk)
 	    !features_enabled(ARCH_SHSTK_SHSTK))
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	unmap_shadow_stack(shstk->base, shstk->size);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (29 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 30/41] x86/shstk: Handle thread shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-09 16:48   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack Rick Edgecombe
                   ` (9 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stacks are normally written to via CALL/RET or specific CET
instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
operations the kernel will need to write to directly using the ring-0 only
WRUSS instruction.

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.

In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.

To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It is also can't be a
valid restore token.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Fix typo in commit log

v3:
 - Drop shstk_check_rstor_token()
 - Fail put_shstk_data() if bit 63 is set in the data (Kees)
 - Add comment in create_rstor_token() (Kees)
 - Pull in create_rstor_token() changes from future patch (Kees)

v2:
 - Add data helpers for writing to shadow stack.

v1:
 - Use xsave helpers.
---
 arch/x86/include/asm/special_insns.h | 13 +++++
 arch/x86/kernel/shstk.c              | 73 ++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index de48d1389936..d6cd9344f6c7 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: [addr] "r" (addr), [val] "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EFAULT;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
 #define nop() asm volatile ("nop")
 
 static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 1d30295e0066..13c02747386f 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
 #include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+#define SS_FRAME_SIZE 8
+
 static bool features_enabled(unsigned long features)
 {
 	return current->thread.features & features;
@@ -40,6 +42,35 @@ static void features_clr(unsigned long features)
 	current->thread.features &= ~features;
 }
 
+/*
+ * Create a restore token on the shadow stack.  A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+	unsigned long addr;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(ssp, 8))
+		return -EINVAL;
+
+	addr = ssp - SS_FRAME_SIZE;
+
+	/*
+	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+	 * the token 64-bit.
+	 */
+	ssp |= BIT(0);
+
+	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+		return -EFAULT;
+
+	if (token_addr)
+		*token_addr = addr;
+
+	return 0;
+}
+
 static unsigned long alloc_shstk(unsigned long size)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
@@ -159,6 +190,48 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 	return 0;
 }
 
+static unsigned long get_user_shstk_addr(void)
+{
+	unsigned long long ssp;
+
+	fpregs_lock_and_load();
+
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	fpregs_unlock();
+
+	return ssp;
+}
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+	if (WARN_ON_ONCE(data & BIT(63)))
+		return -EINVAL;
+
+	/*
+	 * Mark the high bit so that the sigframe can't be processed as a
+	 * return address.
+	 */
+	if (write_user_shstk_64(addr, data | BIT(63)))
+		return -EFAULT;
+	return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+	unsigned long ldata;
+
+	if (unlikely(get_user(ldata, addr)))
+		return -EFAULT;
+
+	if (!(ldata & BIT(63)))
+		return -EINVAL;
+
+	*data = ldata & ~BIT(63);
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (30 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-09 17:02   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
                   ` (8 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.

For this token we can use the shadow stack data format defined earlier.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.

So, when handling a signal push
 - the SSP pointing in the shadow stack data format
 - the restorer address below the restore token.

In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>

---
v3:
 - Drop shstk_setup_rstor_token() (Kees)
 - Drop x32 signal support, since x32 support is dropped

v2:
 - Switch to new shstk signal format

v1:
 - Use xsave helpers.
 - Expand commit log.

Yu-cheng v27:
 - Eliminate saving shadow stack pointer to signal context.
---
 arch/x86/include/asm/shstk.h |  5 ++
 arch/x86/kernel/shstk.c      | 98 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c     |  1 +
 arch/x86/kernel/signal_64.c  |  6 +++
 4 files changed, 110 insertions(+)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 1399f4df098b..acee68d30a07 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 struct thread_shstk {
@@ -19,6 +20,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 			     unsigned long stack_size,
 			     unsigned long *shstk_addr);
 void shstk_free(struct task_struct *p);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
@@ -28,6 +31,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
 					   unsigned long stack_size,
 					   unsigned long *shstk_addr) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 13c02747386f..40f0a55762a9 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -232,6 +232,104 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
 	return 0;
 }
 
+static int shstk_push_sigframe(unsigned long *ssp)
+{
+	unsigned long target_ssp = *ssp;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(*ssp, 8))
+		return -EINVAL;
+
+	if (!IS_ALIGNED(target_ssp, 8))
+		return -EINVAL;
+
+	*ssp -= SS_FRAME_SIZE;
+	if (put_shstk_data((void *__user)*ssp, target_ssp))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+	unsigned long token_addr;
+	int err;
+
+	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Restore SSP aligned? */
+	if (unlikely(!IS_ALIGNED(token_addr, 8)))
+		return -EINVAL;
+
+	/* SSP in userspace? */
+	if (unlikely(token_addr >= TASK_SIZE_MAX))
+		return -EINVAL;
+
+	*ssp = token_addr;
+
+	return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+	void __user *restorer = ksig->ka.sa.sa_restorer;
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	if (!restorer)
+		return -EINVAL;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_push_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Push restorer address */
+	ssp -= SS_FRAME_SIZE;
+	err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+	if (unlikely(err))
+		return -EFAULT;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_pop_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 004cb30b7419..356253e85ce9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -40,6 +40,7 @@
 #include <asm/syscall.h>
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/shstk.h>
 
 static inline int is_ia32_compat_frame(struct ksignal *ksig)
 {
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 0e808c72bf7e..cacf2ede6217 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -175,6 +175,9 @@ int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	frame = get_sigframe(ksig, regs, sizeof(struct rt_sigframe), &fp);
 	uc_flags = frame_uc_flags(regs);
 
+	if (setup_signal_shadow_stack(ksig))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -260,6 +263,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (31 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-02 17:22   ` Szabolcs Nagy
  2023-03-10 16:11   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 34/41] x86/shstk: Support WRSS for userspace Rick Edgecombe
                   ` (7 subsequent siblings)
  40 siblings, 2 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v5:
 - Fix addr/mapped_addr (Kees)
 - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch
   suggests this)
 - Return error for addresses below 4G

v3:
 - Change syscall common -> 64 (Kees)
 - Use bit shift notation instead of 0x1 for uapi header (Kees)
 - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
 - Block unsupported flags (Kees)
 - Require size >= 8 to set token (Kees)

v2:
 - Change syscall to take address like mmap() for CRIU's usage

v1:
 - New patch (replaces PROT_SHADOW_STACK).
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/include/uapi/asm/mman.h       |  3 ++
 arch/x86/kernel/shstk.c                | 59 ++++++++++++++++++++++----
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +-
 kernel/sys_ni.c                        |  1 +
 6 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..f65c671ce3b1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	64	map_shadow_stack	sys_map_shadow_stack
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 5a0256e73f1e..8148bdddbd2c 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -13,6 +13,9 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+/* Flags for map_shadow_stack(2) */
+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)	/* Set up a restore token in the shadow stack */
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 40f0a55762a9..0a3decab70ee 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@
 #include <linux/compat.h>
 #include <linux/sizes.h>
 #include <linux/user.h>
+#include <linux/syscalls.h>
 #include <asm/msr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
 	return 0;
 }
 
-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+				 unsigned long token_offset, bool set_res_tok)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, unused;
+	unsigned long mapped_addr, unused;
 
-	mmap_write_lock(mm);
-	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
-		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+	if (addr)
+		flags |= MAP_FIXED_NOREPLACE;
 
+	mmap_write_lock(mm);
+	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 
-	return addr;
+	if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
+		goto out;
+
+	if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+		vm_munmap(mapped_addr, size);
+		return -EINVAL;
+	}
+
+out:
+	return mapped_addr;
 }
 
 static unsigned long adjust_shstk_size(unsigned long size)
@@ -134,7 +147,7 @@ static int shstk_setup(void)
 		return -EOPNOTSUPP;
 
 	size = adjust_shstk_size(0);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 		return 0;
 
 	size = adjust_shstk_size(stack_size);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -371,6 +384,36 @@ static int shstk_disable(void)
 	return 0;
 }
 
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+	bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+	unsigned long aligned_size;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	if (flags & ~SHADOW_STACK_SET_TOKEN)
+		return -EINVAL;
+
+	/* If there isn't space for a token */
+	if (set_tok && size < 8)
+		return -EINVAL;
+
+	if (addr && addr <= 0xFFFFFFFF)
+		return -EINVAL;
+
+	/*
+	 * An overflow would result in attempting to write the restore token
+	 * to the wrong location. Not catastrophic, but just return the right
+	 * error code and block it.
+	 */
+	aligned_size = PAGE_ALIGN(size);
+	if (aligned_size < size)
+		return -EOVERFLOW;
+
+	return alloc_shstk(addr, aligned_size, size, set_tok);
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..392dc11e3556 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
 COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 34/41] x86/shstk: Support WRSS for userspace
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (32 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-10 16:44   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 35/41] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
                   ` (6 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, WRSS, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace WRSS instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Make set_clr_bits_msrl() avoid side affects in 'msr'

v5:
 - Switch to EOPNOTSUPP
 - Move set_clr_bits_msrl() to patch where it is first used
 - Commit log formatting

v3:
 - Make wrss_control() static
 - Fix verbiage in commit log (Kees)

v2:
 - Add some commit log verbiage from (Dave Hansen)

v1:
 - New patch.
---
 arch/x86/include/asm/msr.h        | 11 +++++++++++
 arch/x86/include/uapi/asm/prctl.h |  1 +
 arch/x86/kernel/shstk.c           | 32 ++++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..2d3b35c957ad 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
 int msr_set_bit(u32 msr, u8 bit);
 int msr_clear_bit(u32 msr, u8 bit);
 
+/* Helper that can never get accidentally un-inlined. */
+#define set_clr_bits_msrl(msr, set, clear)	do {	\
+	u64 __val, __new_val, __msr = msr;		\
+							\
+	rdmsrl(__msr, __val);				\
+	__new_val = (__val & ~(clear)) | (set);		\
+							\
+	if (__new_val != __val)				\
+		wrmsrl(__msr, __new_val);		\
+} while (0)
+
 #ifdef CONFIG_SMP
 int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
 int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 7dfd9dc00509..e31495668056 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -28,5 +28,6 @@
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+#define ARCH_SHSTK_WRSS			(1ULL <<  1)
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 0a3decab70ee..009cb3fa0ae5 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -363,6 +363,36 @@ void shstk_free(struct task_struct *tsk)
 	unmap_shadow_stack(shstk->base, shstk->size);
 }
 
+static int wrss_control(bool enable)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Only enable wrss if shadow stack is enabled. If shadow stack is not
+	 * enabled, wrss will already be disabled, so don't bother clearing it
+	 * when disabling.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return -EPERM;
+
+	/* Already enabled/disabled? */
+	if (features_enabled(ARCH_SHSTK_WRSS) == enable)
+		return 0;
+
+	fpregs_lock_and_load();
+	if (enable) {
+		set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
+		features_set(ARCH_SHSTK_WRSS);
+	} else {
+		set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
+		features_clr(ARCH_SHSTK_WRSS);
+	}
+	fpregs_unlock();
+
+	return 0;
+}
+
 static int shstk_disable(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
@@ -379,7 +409,7 @@ static int shstk_disable(void)
 	fpregs_unlock();
 
 	shstk_free(current);
-	features_clr(ARCH_SHSTK_SHSTK);
+	features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 35/41] x86: Expose thread features in /proc/$PID/status
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (33 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 34/41] x86/shstk: Support WRSS for userspace Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 36/41] x86/shstk: Wire in shadow stack interface Rick Edgecombe
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

Applications and loaders can have logic to decide whether to enable
shadow stack. They usually don't report whether shadow stack has been
enabled or not, so there is no way to verify whether an application
actually is protected by shadow stack.

Add two lines in /proc/$PID/status to report enabled and locked features.

Since, this involves referring to arch specific defines in asm/prctl.h,
implement an arch breakout to emit the feature lines.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[Switched to CET, added to commit log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v4:
 - Remove "CET" references

v3:
 - Move to /proc/pid/status (Kees)

v2:
 - New patch
---
 arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++
 fs/proc/array.c            |  6 ++++++
 include/linux/proc_fs.h    |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 099b6f0d96bd..31c0e68f6227 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -4,6 +4,8 @@
 #include <linux/string.h>
 #include <linux/seq_file.h>
 #include <linux/cpufreq.h>
+#include <asm/prctl.h>
+#include <linux/proc_fs.h>
 
 #include "cpu.h"
 
@@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op = {
 	.stop	= c_stop,
 	.show	= show_cpuinfo,
 };
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static void dump_x86_features(struct seq_file *m, unsigned long features)
+{
+	if (features & ARCH_SHSTK_SHSTK)
+		seq_puts(m, "shstk ");
+	if (features & ARCH_SHSTK_WRSS)
+		seq_puts(m, "wrss ");
+}
+
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task)
+{
+	seq_puts(m, "x86_Thread_features:\t");
+	dump_x86_features(m, task->thread.features);
+	seq_putc(m, '\n');
+
+	seq_puts(m, "x86_Thread_features_locked:\t");
+	dump_x86_features(m, task->thread.features_locked);
+	seq_putc(m, '\n');
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49283b8103c7..7ac43ecda1c2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -428,6 +428,11 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+__weak void arch_proc_pid_thread_features(struct seq_file *m,
+					  struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -451,6 +456,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_thread_features(m, task);
 	return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 0260f5ea98fe..80ff8e533cbd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task);
 #endif /* CONFIG_PROC_PID_ARCH_STATUS */
 
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task);
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 36/41] x86/shstk: Wire in shadow stack interface
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (34 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 35/41] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 37/41] selftests/x86: Add shadow stack test Rick Edgecombe
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

The kernel now has the main shadow stack functionality to support
applications. Wire in the WRSS and shadow stack enable/disable functions
into the existing shadow stack API skeleton.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v4:
 - Remove "CET" references

v2:
 - Split from other patches
---
 arch/x86/kernel/shstk.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 009cb3fa0ae5..2faf9b45ac72 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -464,9 +464,17 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 		return -EINVAL;
 
 	if (option == ARCH_SHSTK_DISABLE) {
+		if (features & ARCH_SHSTK_WRSS)
+			return wrss_control(false);
+		if (features & ARCH_SHSTK_SHSTK)
+			return shstk_disable();
 		return -EINVAL;
 	}
 
 	/* Handle ARCH_SHSTK_ENABLE */
+	if (features & ARCH_SHSTK_SHSTK)
+		return shstk_setup();
+	if (features & ARCH_SHSTK_WRSS)
+		return wrss_control(true);
 	return -EINVAL;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 37/41] selftests/x86: Add shadow stack test
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (35 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 36/41] x86/shstk: Wire in shadow stack interface Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 38/41] x86/fpu: Add helper for initing features Rick Edgecombe
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

Add a simple selftest for exercising some shadow stack behavior:
 - map_shadow_stack syscall and pivot
 - Faulting in shadow stack memory
 - Handling shadow stack violations
 - GUP of shadow stack memory
 - mprotect() of shadow stack memory
 - Userfaultfd on shadow stack memory
 - 32 bit segmentation

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v7:
 - Remove KHDR_INCLUDES and just add a copy of the defines (Boris)

v6:
 - Tweak mprotect test
 - Code style tweaks

v5:
 - Update 32 bit signal test with new ABI and better asm

v4:
 - Add test for 32 bit signal ABI blocking

v3:
 - Change "+m" to "=m" in write_shstk() (Andrew Cooper)
 - Fix userfaultfd test with transparent huge pages by doing a
   MADV_DONTNEED, since the token write faults in the while stack with
   huge pages.
---
 tools/testing/selftests/x86/Makefile          |   2 +-
 .../testing/selftests/x86/test_shadow_stack.c | 695 ++++++++++++++++++
 2 files changed, 696 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index ca9374b56ead..cfc8a26ad151 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
-			corrupt_xstate_header amx
+			corrupt_xstate_header amx test_shadow_stack
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
new file mode 100644
index 000000000000..94eb223456f6
--- /dev/null
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -0,0 +1,695 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program test's basic kernel shadow stack support. It enables shadow
+ * stack manual via the arch_prctl(), instead of relying on glibc. It's
+ * Makefile doesn't compile with shadow stack support, so it doesn't rely on
+ * any particular glibc. As a result it can't do any operations that require
+ * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just
+ * stick to the basics and hope the compiler doesn't do anything strange.
+ */
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <asm/mman.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <stdint.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <setjmp.h>
+
+/*
+ * Define the ABI defines if needed, so people can run the tests
+ * without building the headers.
+ */
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack	451
+
+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)
+
+#define ARCH_SHSTK_ENABLE	0x5001
+#define ARCH_SHSTK_DISABLE	0x5002
+#define ARCH_SHSTK_LOCK		0x5003
+#define ARCH_SHSTK_UNLOCK	0x5004
+#define ARCH_SHSTK_STATUS	0x5005
+
+#define ARCH_SHSTK_SHSTK	(1ULL <<  0)
+#define ARCH_SHSTK_WRSS		(1ULL <<  1)
+#endif
+
+#define SS_SIZE 0x200000
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+	printf("[SKIP]\tCompiler does not support CET.\n");
+	return 0;
+}
+#else
+void write_shstk(unsigned long *addr, unsigned long val)
+{
+	asm volatile("wrssq %[val], (%[addr])\n"
+		     : "=m" (addr)
+		     : [addr] "r" (addr), [val] "r" (val));
+}
+
+static inline unsigned long __attribute__((always_inline)) get_ssp(void)
+{
+	unsigned long ret = 0;
+
+	asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
+	return ret;
+}
+
+/*
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack gets enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull in all
+ * of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2)					\
+({								\
+	long _ret;						\
+	register long _num  asm("eax") = __NR_arch_prctl;	\
+	register long _arg1 asm("rdi") = (long)(arg1);		\
+	register long _arg2 asm("rsi") = (long)(arg2);		\
+								\
+	asm volatile (						\
+		"syscall\n"					\
+		: "=a"(_ret)					\
+		: "r"(_arg1), "r"(_arg2),			\
+		  "0"(_num)					\
+		: "rcx", "r11", "memory", "cc"			\
+	);							\
+	_ret;							\
+})
+
+void *create_shstk(void *addr)
+{
+	return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+void *create_normal_mem(void *addr)
+{
+	return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
+		    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+}
+
+void free_shstk(void *shstk)
+{
+	munmap(shstk, SS_SIZE);
+}
+
+int reset_shstk(void *shstk)
+{
+	return madvise(shstk, SS_SIZE, MADV_DONTNEED);
+}
+
+void try_shstk(unsigned long new_ssp)
+{
+	unsigned long ssp;
+
+	printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
+	       new_ssp, *((unsigned long *)new_ssp));
+
+	ssp = get_ssp();
+	printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
+
+	asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+	asm volatile("saveprevssp");
+	printf("[INFO]\tssp is now %lx\n", get_ssp());
+
+	/* Switch back to original shadow stack */
+	ssp -= 8;
+	asm volatile("rstorssp (%0)\n":: "r" (ssp));
+	asm volatile("saveprevssp");
+}
+
+int test_shstk_pivot(void)
+{
+	void *shstk = create_shstk(0);
+
+	if (shstk == MAP_FAILED) {
+		printf("[FAIL]\tError creating shadow stack: %d\n", errno);
+		return 1;
+	}
+	try_shstk((unsigned long)shstk + SS_SIZE - 8);
+	free_shstk(shstk);
+
+	printf("[OK]\tShadow stack pivot\n");
+	return 0;
+}
+
+int test_shstk_faults(void)
+{
+	unsigned long *shstk = create_shstk(0);
+
+	/* Read shadow stack, test if it's zero to not get read optimized out */
+	if (*shstk != 0)
+		goto err;
+
+	/* Wrss memory that was already read. */
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	/* Page out memory, so we can wrss it again. */
+	if (reset_shstk((void *)shstk))
+		goto err;
+
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	printf("[OK]\tShadow stack faults\n");
+	return 0;
+
+err:
+	return 1;
+}
+
+unsigned long saved_ssp;
+unsigned long saved_ssp_val;
+volatile bool segv_triggered;
+
+void __attribute__((noinline)) violate_ss(void)
+{
+	saved_ssp = get_ssp();
+	saved_ssp_val = *(unsigned long *)saved_ssp;
+
+	/* Corrupt shadow stack */
+	printf("[INFO]\tCorrupting shadow stack\n");
+	write_shstk((void *)saved_ssp, 0);
+}
+
+void segv_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tGenerated shadow stack violation successfully\n");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	write_shstk((void *)saved_ssp, saved_ssp_val);
+}
+
+int test_shstk_violation(void)
+{
+	struct sigaction sa;
+
+	sa.sa_sigaction = segv_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before violate_ss() */
+	asm volatile("" : : : "memory");
+
+	violate_ss();
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow stack violation test\n");
+
+	return !segv_triggered;
+}
+
+/* Gup test state */
+#define MAGIC_VAL 0x12345678
+bool is_shstk_access;
+void *shstk_ptr;
+int fd;
+
+void reset_test_shstk(void *addr)
+{
+	if (shstk_ptr)
+		free_shstk(shstk_ptr);
+	shstk_ptr = create_shstk(addr);
+}
+
+void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	if (is_shstk_access) {
+		reset_test_shstk(shstk_ptr);
+		return;
+	}
+
+	free_shstk(shstk_ptr);
+	create_normal_mem(shstk_ptr);
+}
+
+bool test_shstk_access(void *ptr)
+{
+	is_shstk_access = true;
+	segv_triggered = false;
+	write_shstk(ptr, MAGIC_VAL);
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool test_write_access(void *ptr)
+{
+	is_shstk_access = false;
+	segv_triggered = false;
+	*(unsigned long *)ptr = MAGIC_VAL;
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool gup_write(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (write(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+bool gup_read(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (read(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+int test_gup(void)
+{
+	struct sigaction sa;
+	int status;
+	pid_t pid;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	fd = open("/proc/self/mem", O_RDWR);
+	if (fd == -1)
+		return 1;
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> write access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> write access success\n");
+
+	close(fd);
+
+	/* COW/gup test */
+	reset_test_shstk(0);
+	pid = fork();
+	if (!pid) {
+		fd = open("/proc/self/mem", O_RDWR);
+		if (fd == -1)
+			exit(1);
+
+		if (gup_write(shstk_ptr)) {
+			close(fd);
+			exit(1);
+		}
+		close(fd);
+		exit(0);
+	}
+	waitpid(pid, &status, 0);
+	if (WEXITSTATUS(status)) {
+		printf("[FAIL]\tWrite in child failed\n");
+		return 1;
+	}
+	if (*(unsigned long *)shstk_ptr == MAGIC_VAL) {
+		printf("[FAIL]\tWrite in child wrote through to shared memory\n");
+		return 1;
+	}
+
+	printf("[INFO]\tCow gup write -> write access success\n");
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow gup test\n");
+
+	return 0;
+}
+
+int test_mprotect(void)
+{
+	struct sigaction sa;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* mprotect a shadow stack as read only */
+	reset_test_shstk(0);
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+		return 1;
+	}
+
+	/* try to wrss it and fail */
+	if (!test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to read-only memory succeeded\n");
+		return 1;
+	}
+
+	/*
+	 * The shadow stack was reset above to resolve the fault, make the new one
+	 * read-only.
+	 */
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+		return 1;
+	}
+
+	/* then back to writable */
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_WRITE) failed\n");
+		return 1;
+	}
+
+	/* then wrss to it and succeed */
+	if (test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n");
+		return 1;
+	}
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tmprotect() test\n");
+
+	return 0;
+}
+
+char zero[4096];
+
+static void *uffd_thread(void *arg)
+{
+	struct uffdio_copy req;
+	int uffd = *(int *)arg;
+	struct uffd_msg msg;
+
+	if (read(uffd, &msg, sizeof(msg)) <= 0)
+		return (void *)1;
+
+	req.dst = msg.arg.pagefault.address;
+	req.src = (__u64)zero;
+	req.len = 4096;
+	req.mode = 0;
+
+	if (ioctl(uffd, UFFDIO_COPY, &req))
+		return (void *)1;
+
+	return (void *)0;
+}
+
+int test_userfaultfd(void)
+{
+	struct uffdio_register uffdio_register;
+	struct uffdio_api uffdio_api;
+	struct sigaction sa;
+	pthread_t thread;
+	void *res;
+	int uffd;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		printf("[SKIP]\tUserfaultfd unavailable.\n");
+		return 0;
+	}
+
+	reset_test_shstk(0);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+		goto err;
+
+	uffdio_register.range.start = (__u64)shstk_ptr;
+	uffdio_register.range.len = 4096;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		goto err;
+
+	if (pthread_create(&thread, NULL, &uffd_thread, &uffd))
+		goto err;
+
+	reset_shstk(shstk_ptr);
+	test_shstk_access(shstk_ptr);
+
+	if (pthread_join(thread, &res))
+		goto err;
+
+	if (test_shstk_access(shstk_ptr))
+		goto err;
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	if (!res)
+		printf("[OK]\tUserfaultfd test\n");
+	return !!res;
+err:
+	free_shstk(shstk_ptr);
+	close(uffd);
+	signal(SIGSEGV, SIG_DFL);
+	return 1;
+}
+
+/*
+ * Too complicated to pull it out of the 32 bit header, but also get the
+ * 64 bit one needed above. Just define a copy here.
+ */
+#define __NR_compat_sigaction 67
+
+/*
+ * Call 32 bit signal handler to get 32 bit signals ABI. Make sure
+ * to push the registers that will get clobbered.
+ */
+int sigaction32(int signum, const struct sigaction *restrict act,
+		struct sigaction *restrict oldact)
+{
+	register long syscall_reg asm("eax") = __NR_compat_sigaction;
+	register long signum_reg asm("ebx") = signum;
+	register long act_reg asm("ecx") = (long)act;
+	register long oldact_reg asm("edx") = (long)oldact;
+	int ret = 0;
+
+	asm volatile ("int $0x80;"
+		      : "=a"(ret), "=m"(oldact)
+		      : "r"(syscall_reg), "r"(signum_reg), "r"(act_reg),
+			"r"(oldact_reg)
+		      : "r8", "r9", "r10", "r11"
+		     );
+
+	return ret;
+}
+
+sigjmp_buf jmp_buffer;
+
+void segv_gp_handler(int signum, siginfo_t *si, void *uc)
+{
+	segv_triggered = true;
+
+	/*
+	 * To work with old glibc, this can't rely on siglongjmp working with
+	 * shadow stack enabled, so disable shadow stack before siglongjmp().
+	 */
+	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
+	siglongjmp(jmp_buffer, -1);
+}
+
+/*
+ * Transition to 32 bit mode and check that a #GP triggers a segfault.
+ */
+int test_32bit(void)
+{
+	struct sigaction sa;
+	struct sigaction *sa32;
+
+	/* Create sigaction in 32 bit address range */
+	sa32 = mmap(0, 4096, PROT_READ | PROT_WRITE,
+		    MAP_32BIT | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	sa32->sa_flags = SA_SIGINFO;
+
+	sa.sa_sigaction = segv_gp_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before triggering the #GP */
+	asm volatile("" : : : "memory");
+
+	/*
+	 * Set handler to somewhere in 32 bit address space
+	 */
+	sa32->sa_handler = (void *)sa32;
+	if (sigaction32(SIGUSR1, sa32, NULL))
+		return 1;
+
+	if (!sigsetjmp(jmp_buffer, 1))
+		raise(SIGUSR1);
+
+	if (segv_triggered)
+		printf("[OK]\t32 bit test\n");
+
+	return !segv_triggered;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = 0;
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+		printf("[SKIP]\tCould not enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+		printf("[SKIP]\tCould not re-enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_WRSS)) {
+		printf("[SKIP]\tCould not enable WRSS\n");
+		ret = 1;
+		goto out;
+	}
+
+	/* Should have succeeded if here, but this is a test, so double check. */
+	if (!get_ssp()) {
+		printf("[FAIL]\tShadow stack disabled\n");
+		return 1;
+	}
+
+	if (test_shstk_pivot()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack pivot\n");
+		goto out;
+	}
+
+	if (test_shstk_faults()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack fault test\n");
+		goto out;
+	}
+
+	if (test_shstk_violation()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack violation test\n");
+		goto out;
+	}
+
+	if (test_gup()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow stack gup\n");
+		goto out;
+	}
+
+	if (test_mprotect()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow mprotect test\n");
+		goto out;
+	}
+
+	if (test_userfaultfd()) {
+		ret = 1;
+		printf("[FAIL]\tUserfaultfd test\n");
+		goto out;
+	}
+
+	if (test_32bit()) {
+		ret = 1;
+		printf("[FAIL]\t32 bit test\n");
+	}
+
+	return ret;
+
+out:
+	/*
+	 * Disable shadow stack before the function returns, or there will be a
+	 * shadow stack violation.
+	 */
+	if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	return ret;
+}
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (36 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 37/41] selftests/x86: Add shadow stack test Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-11 12:54   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack Rick Edgecombe
                   ` (2 subsequent siblings)
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

If an xfeature is saved in a buffer, the xfeature's bit will be set in
xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
is in it's init state. In this case the xfeature buffer address cannot
be retrieved with get_xsave_addr().

Future patches will need to handle the case of writing to an xfeature
that may not be saved. So provide helpers to init an xfeature in an
xsave buffer.

This could of course be done directly by reaching into the xsave buffer,
however this would not be robust against future changes to optimize the
xsave buffer by compacting it. In that case the xsave buffer would need
to be re-arranged as well. So the logic properly belongs encapsulated
in a helper where the logic can be unified.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v2:
 - New patch
---
 arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
 arch/x86/kernel/fpu/xstate.h |  6 ++++
 2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 13a80521dd51..3ff80be0a441 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -934,6 +934,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
 }
 
+static int xsave_buffer_access_checks(int xfeature_nr)
+{
+	/*
+	 * Do we even *have* xsave state?
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVE))
+		return 1;
+
+	/*
+	 * We should not ever be requesting features that we
+	 * have not enabled.
+	 */
+	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+		return 1;
+
+	return 0;
+}
+
 /*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
@@ -954,17 +972,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 {
-	/*
-	 * Do we even *have* xsave state?
-	 */
-	if (!boot_cpu_has(X86_FEATURE_XSAVE))
-		return NULL;
-
-	/*
-	 * We should not ever be requesting features that we
-	 * have not enabled.
-	 */
-	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+	if (xsave_buffer_access_checks(xfeature_nr))
 		return NULL;
 
 	/*
@@ -984,6 +992,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	return __raw_xsave_addr(xsave, xfeature_nr);
 }
 
+/*
+ * Given the xsave area and a state inside, this function
+ * initializes an xfeature in the buffer.
+ *
+ * get_xsave_addr() will return NULL if the feature bit is
+ * not present in the header. This function will make it so
+ * the xfeature buffer address is ready to be retrieved by
+ * get_xsave_addr().
+ *
+ * Inputs:
+ *	xstate: the thread's storage area for all FPU data
+ *	xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ *	XFEATURE_SSE, etc...)
+ * Output:
+ *	1 if the feature cannot be inited, 0 on success
+ */
+int init_xfeature(struct xregs_state *xsave, int xfeature_nr)
+{
+	if (xsave_buffer_access_checks(xfeature_nr))
+		return 1;
+
+	/*
+	 * Mark the feature inited.
+	 */
+	xsave->header.xfeatures |= BIT_ULL(xfeature_nr);
+	return 0;
+}
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..dc06f63063ee 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -54,6 +54,12 @@ extern void fpu__init_cpu_xstate(void);
 extern void fpu__init_system_xstate(unsigned int legacy_size);
 
 extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int init_xfeature(struct xregs_state *xsave, int xfeature_nr);
+
+static inline int xfeature_saved(struct xregs_state *xsave, int xfeature_nr)
+{
+	return xsave->header.xfeatures & BIT_ULL(xfeature_nr);
+}
 
 static inline u64 xfeatures_mask_supervisor(void)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (37 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 38/41] x86/fpu: Add helper for initing features Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-11 15:06   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
  2023-02-27 22:29 ` [PATCH v7 41/41] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Some applications (like GDB) would like to tweak shadow stack state via
ptrace. This allows for existing functionality to continue to work for
seized shadow stack applications. Provide an regset interface for
manipulating the shadow stack pointer (SSP).

There is already ptrace functionality for accessing xstate, but this
does not include supervisor xfeatures. So there is not a completely
clear place for where to put the shadow stack state. Adding it to the
user xfeatures regset would complicate that code, as it currently shares
logic with signals which should not have supervisor features.

Don't add a general supervisor xfeature regset like the user one,
because it is better to maintain flexibility for other supervisor
xfeatures to define their own interface. For example, an xfeature may
decide not to expose all of it's state to userspace, as is actually the
case for  shadow stack ptrace functionality. A lot of enum values remain
to be used, so just put it in dedicated shadow stack regset.

The only downside to not having a generic supervisor xfeature regset,
is that apps need to be enlightened of any new supervisor xfeature
exposed this way (i.e. they can't try to have generic save/restore
logic). But maybe that is a good thing, because they have to think
through each new xfeature instead of encountering issues when new a new
supervisor xfeature was added.

By adding a shadow stack regset, it also has the effect of including the
shadow stack state in a core dump, which could be useful for debugging.

The shadow stack specific xstate includes the SSP, and the shadow stack
and WRSS enablement status. Enabling shadow stack or wrss in the kernel
involves more than just flipping the bit. The kernel is made aware that
it has to do extra things when cloning or handling signals. That logic
is triggered off of separate feature enablement state kept in the task
struct. So the flipping on HW shadow stack enforcement without notifying
the kernel to change its behavior would severely limit what an application
could do without crashing, and the results would depend on kernel
internal implementation details. There is also no known use for controlling
this state via prtace today. So only expose the SSP, which is something
that userspace already has indirect control over.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

---
v5:
 - Check shadow stack enablement status for tracee (rppt)
 - Fix typo in comment

v4:
 - Make shadow stack only. Reduce to only supporting SSP register, and
   remove CET references (peterz)
 - Add comment to not use 0x203, because binutils already looks for it in
   coredumps. (Christina Schimpe)

v3:
 - Drop dependence on thread.shstk.size, and use thread.features bits
 - Drop 32 bit support

v2:
 - Check alignment on ssp.
 - Block IBT bits.
 - Handle init states instead of returning error.
 - Add verbose commit log justifying the design.
---
 arch/x86/include/asm/fpu/regset.h |  7 +--
 arch/x86/kernel/fpu/regset.c      | 86 +++++++++++++++++++++++++++++++
 arch/x86/kernel/ptrace.c          | 12 +++++
 include/uapi/linux/elf.h          |  2 +
 4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index 4f928d6a367b..697b77e96025 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@
 
 #include <linux/regset.h>
 
-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+				ssp_active;
 extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get,
-				 xstateregs_get;
+				 xstateregs_get, ssp_get;
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
-				 xstateregs_set;
+				 xstateregs_set, ssp_set;
 
 /*
  * xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 6d056b68f4ed..c806952d9496 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -8,6 +8,7 @@
 #include <asm/fpu/api.h>
 #include <asm/fpu/signal.h>
 #include <asm/fpu/regset.h>
+#include <asm/prctl.h>
 
 #include "context.h"
 #include "internal.h"
@@ -174,6 +175,91 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	return ret;
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+int ssp_active(struct task_struct *target, const struct user_regset *regset)
+{
+	if (target->thread.features & ARCH_SHSTK_SHSTK)
+		return regset->n;
+
+	return 0;
+}
+
+int ssp_get(struct task_struct *target, const struct user_regset *regset,
+	    struct membuf to)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct cet_user_state *cetregs;
+
+	if (!boot_cpu_has(X86_FEATURE_USER_SHSTK))
+		return -ENODEV;
+
+	sync_fpstate(fpu);
+	cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER);
+	if (!cetregs) {
+		/*
+		 * The registers are the in the init state. The init values for
+		 * these regs are zero, so just zero the output buffer.
+		 */
+		membuf_zero(&to, sizeof(cetregs->user_ssp));
+		return 0;
+	}
+
+	return membuf_write(&to, (unsigned long *)&cetregs->user_ssp,
+			    sizeof(cetregs->user_ssp));
+}
+
+int ssp_set(struct task_struct *target, const struct user_regset *regset,
+	    unsigned int pos, unsigned int count,
+	    const void *kbuf, const void __user *ubuf)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct xregs_state *xsave = &fpu->fpstate->regs.xsave;
+	struct cet_user_state *cetregs;
+	unsigned long user_ssp;
+	int r;
+
+	if (!boot_cpu_has(X86_FEATURE_USER_SHSTK) ||
+	    !ssp_active(target, regset))
+		return -ENODEV;
+
+	r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_ssp, 0, -1);
+	if (r)
+		return r;
+
+	/*
+	 * Some kernel instructions (IRET, etc) can cause exceptions in the case
+	 * of disallowed CET register values. Just prevent invalid values.
+	 */
+	if ((user_ssp >= TASK_SIZE_MAX) || !IS_ALIGNED(user_ssp, 8))
+		return -EINVAL;
+
+	fpu_force_restore(fpu);
+
+	/*
+	 * Don't want to init the xfeature until the kernel will definitely
+	 * overwrite it, otherwise if it inits and then fails out, it would
+	 * end up initing it to random data.
+	 */
+	if (!xfeature_saved(xsave, XFEATURE_CET_USER) &&
+	    WARN_ON(init_xfeature(xsave, XFEATURE_CET_USER)))
+		return -ENODEV;
+
+	cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
+	if (WARN_ON(!cetregs)) {
+		/*
+		 * This shouldn't ever be NULL because it was successfully
+		 * inited above if needed. The only scenario would be if an
+		 * xfeature was somehow saved in a buffer, but not enabled in
+		 * xsave.
+		 */
+		return -ENODEV;
+	}
+
+	cetregs->user_ssp = user_ssp;
+	return 0;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 
 /*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index dfaa270a7cc9..095f04bdabdc 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -58,6 +58,7 @@ enum x86_regset_64 {
 	REGSET64_FP,
 	REGSET64_IOPERM,
 	REGSET64_XSTATE,
+	REGSET64_SSP,
 };
 
 #define REGSET_GENERAL \
@@ -1267,6 +1268,17 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
 		.active		= ioperm_active,
 		.regset_get	= ioperm_get
 	},
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	[REGSET64_SSP] = {
+		.core_note_type	= NT_X86_SHSTK,
+		.n		= 1,
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= ssp_active,
+		.regset_get	= ssp_get,
+		.set		= ssp_set
+	},
+#endif
 };
 
 static const struct user_regset_view user_x86_64_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 68de6f4c4eee..103a1f2da86e 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -406,6 +406,8 @@ typedef struct elf64_shdr {
 #define NT_386_TLS	0x200		/* i386 TLS slots (struct user_desc) */
 #define NT_386_IOPERM	0x201		/* x86 io permission bitmap (1=deny) */
 #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
+/* Old binutils treats 0x203 as a CET state */
+#define NT_X86_SHSTK	0x204		/* x86 SHSTK state */
 #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
 #define NT_S390_TIMER	0x301		/* s390 timer register */
 #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (38 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  2023-03-11 15:11   ` Borislav Petkov
  2023-02-27 22:29 ` [PATCH v7 41/41] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
  40 siblings, 1 reply; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other shadow stack operations, but restrict it being
called by the ptrace arch_pctl() interface.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
[Merged into recent API changes, added commit log and docs]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v4:
 - Add to docs that it is ptrace only.
 - Remove "CET" references

v3:
 - Depend on CONFIG_CHECKPOINT_RESTORE (Kees)
---
 Documentation/x86/shstk.rst       | 4 ++++
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 9 +++++++--
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
index f2e6f323cf68..e8ed5fc0f7ae 100644
--- a/Documentation/x86/shstk.rst
+++ b/Documentation/x86/shstk.rst
@@ -73,6 +73,10 @@ arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
     are ignored. The mask is ORed with the existing value. So any feature bits
     set here cannot be enabled or disabled afterwards.
 
+arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
+    Unlock features. 'features' is a mask of all features to unlock. All
+    bits set are processed, unset bits are ignored. Only works via ptrace.
+
 The return values are as follows. On success, return 0. On error, errno can
 be::
 
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index e31495668056..200efbbe5809 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -25,6 +25,7 @@
 #define ARCH_SHSTK_ENABLE		0x5001
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
+#define ARCH_SHSTK_UNLOCK		0x5004
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 71094c8a305f..d368854fa9c4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -835,6 +835,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_SHSTK_ENABLE:
 	case ARCH_SHSTK_DISABLE:
 	case ARCH_SHSTK_LOCK:
+	case ARCH_SHSTK_UNLOCK:
 		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 2faf9b45ac72..3197ff824809 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -451,9 +451,14 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 		return 0;
 	}
 
-	/* Don't allow via ptrace */
-	if (task != current)
+	/* Only allow via ptrace */
+	if (task != current) {
+		if (option == ARCH_SHSTK_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
+			task->thread.features_locked &= ~features;
+			return 0;
+		}
 		return -EINVAL;
+	}
 
 	/* Do not allow to change locked features */
 	if (features & task->thread.features_locked)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v7 41/41] x86/shstk: Add ARCH_SHSTK_STATUS
  2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
                   ` (39 preceding siblings ...)
  2023-02-27 22:29 ` [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
@ 2023-02-27 22:29 ` Rick Edgecombe
  40 siblings, 0 replies; 159+ messages in thread
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

CRIU and GDB need to get the current shadow stack and WRSS enablement
status. This information is already available via /proc/pid/status, but
this is inconvenient for CRIU because it involves parsing the text output
in an area of the code where this is difficult. Provide a status
arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
userspace address, and make the new arch_prctl simply copy the features
out to userspace.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v5:
 - Fix typo in commit log

v4:
 - New patch
---
 Documentation/x86/shstk.rst       | 6 ++++++
 arch/x86/include/asm/shstk.h      | 2 +-
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 8 +++++++-
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
index e8ed5fc0f7ae..7f4af798794e 100644
--- a/Documentation/x86/shstk.rst
+++ b/Documentation/x86/shstk.rst
@@ -77,6 +77,11 @@ arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
     Unlock features. 'features' is a mask of all features to unlock. All
     bits set are processed, unset bits are ignored. Only works via ptrace.
 
+arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr)
+    Copy the currently enabled features to the address passed in addr. The
+    features are described using the bits passed into the others in
+    'features'.
+
 The return values are as follows. On success, return 0. On error, errno can
 be::
 
@@ -84,6 +89,7 @@ be::
         -ENOTSUPP if the feature is not supported by the hardware or
          kernel.
         -EINVAL arguments (non existing feature, etc)
+        -EFAULT if could not copy information back to userspace
 
 The feature's bits supported are::
 
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index acee68d30a07..be9267897211 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -14,7 +14,7 @@ struct thread_shstk {
 	u64	size;
 };
 
-long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
 int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 			     unsigned long stack_size,
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 200efbbe5809..1b85bc876c2d 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,6 +26,7 @@
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
 #define ARCH_SHSTK_UNLOCK		0x5004
+#define ARCH_SHSTK_STATUS		0x5005
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d368854fa9c4..dde43caf196e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -836,6 +836,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_SHSTK_DISABLE:
 	case ARCH_SHSTK_LOCK:
 	case ARCH_SHSTK_UNLOCK:
+	case ARCH_SHSTK_STATUS:
 		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3197ff824809..4069d5bbbe8c 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -444,8 +444,14 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi
 	return alloc_shstk(addr, aligned_size, size, set_tok);
 }
 
-long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
 {
+	unsigned long features = arg2;
+
+	if (option == ARCH_SHSTK_STATUS) {
+		return put_user(task->thread.features, (unsigned long __user *)arg2);
+	}
+
 	if (option == ARCH_SHSTK_LOCK) {
 		task->thread.features_locked |= features;
 		return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  2023-02-27 22:29 ` [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
@ 2023-02-27 22:54   ` Kees Cook
  2023-03-08  9:23   ` Borislav Petkov
  1 sibling, 0 replies; 159+ messages in thread
From: Kees Cook @ 2023-02-27 22:54 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe

On February 27, 2023 2:29:43 PM PST, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
>When user shadow stack is use, Write=0,Dirty=1 is treated by the CPU as
>shadow stack memory. So for shadow stack memory this bit combination is
>valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
>write protected, the kernel has been taught to transition the Dirty=1
>bit to SavedDirty=1, to avoid inadvertently creating shadow stack
>memory. It does this inside pte_wrprotect() because it knows the PTE is
>not intended to be a writable shadow stack entry, it is supposed to be
>write protected.
>
>However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
>can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
>distinguish between the caller intending to create a shadow stack PTE or
>needing the SavedDirty shift.
>
>The kernel has been updated to not do this, and so Write=0,Dirty=1
>memory should only be created by the pte_mkfoo() helpers. Add a warning
>to make sure no new mk_pte() start doing this.
>
>Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>Tested-by: John Allen <john.allen@amd.com>
>Tested-by: Kees Cook <keescook@chromium.org>
>Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
>Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA
  2023-02-27 22:29 ` [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
@ 2023-03-01  7:03   ` Christophe Leroy
  2023-03-01  8:16     ` David Hildenbrand
  2023-03-02 12:19   ` Borislav Petkov
  1 sibling, 1 reply; 159+ messages in thread
From: Christophe Leroy @ 2023-03-01  7:03 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: linux-alpha, linux-snps-arc, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-ia64, loongarch, linux-m68k, Michal Simek,
	Dinh Nguyen, linux-mips, linux-openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, xen-devel



Le 27/02/2023 à 23:29, Rick Edgecombe a écrit :
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite().
> 
> In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
> that they are special shadow stack flavor of writable memory. So make
> pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
> create regular writable memory or shadow stack memory.
> 
> Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().

I'm not sure it is a good idea to add a second argument to 
pte_mkwrite(). All pte_mkxxxx() only take a pte and nothing else.

I think you should do the same as commit d9ed9faac283 ("mm: add new 
arch_make_huge_pte() method for tile support")

Christophe

> 
> No functional change.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: linux-hexagon@vger.kernel.org
> Cc: linux-ia64@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-m68k@lists.linux-m68k.org
> Cc: Michal Simek <monstr@monstr.eu>
> Cc: Dinh Nguyen <dinguyen@kernel.org>
> Cc: linux-mips@vger.kernel.org
> Cc: linux-openrisc@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-um@lists.infradead.org
> Cc: xen-devel@lists.xenproject.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> Hi Non-x86 Arch’s,
> 
> x86 has a feature that allows for the creation of a special type of
> writable memory (shadow stack) that is only writable in limited specific
> ways. Previously, changes were proposed to core MM code to teach it to
> decide when to create normally writable memory or the special shadow stack
> writable memory, but David Hildenbrand suggested[0] to change
> pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
> moved into x86 code.
> 
> Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
> changes. So that is why you are seeing some patches out of a big x86
> series pop up in your arch mailing list. There is no functional change.
> After this refactor, the shadow stack series goes on to use the arch
> helpers to push shadow stack memory details inside arch/x86.
> 
> Testing was just 0-day build testing.
> 
> Hopefully that is enough context. Thanks!
> 
> [0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/#t
> 
> v6:
>   - New patch
> ---
>   Documentation/mm/arch_pgtable_helpers.rst    |  9 ++++++---
>   arch/alpha/include/asm/pgtable.h             |  6 +++++-
>   arch/arc/include/asm/hugepage.h              |  2 +-
>   arch/arc/include/asm/pgtable-bits-arcv2.h    |  7 ++++++-
>   arch/arm/include/asm/pgtable-3level.h        |  7 ++++++-
>   arch/arm/include/asm/pgtable.h               |  2 +-
>   arch/arm64/include/asm/pgtable.h             |  4 ++--
>   arch/csky/include/asm/pgtable.h              |  2 +-
>   arch/hexagon/include/asm/pgtable.h           |  2 +-
>   arch/ia64/include/asm/pgtable.h              |  2 +-
>   arch/loongarch/include/asm/pgtable.h         |  4 ++--
>   arch/m68k/include/asm/mcf_pgtable.h          |  2 +-
>   arch/m68k/include/asm/motorola_pgtable.h     |  6 +++++-
>   arch/m68k/include/asm/sun3_pgtable.h         |  6 +++++-
>   arch/microblaze/include/asm/pgtable.h        |  2 +-
>   arch/mips/include/asm/pgtable.h              |  6 +++---
>   arch/nios2/include/asm/pgtable.h             |  2 +-
>   arch/openrisc/include/asm/pgtable.h          |  2 +-
>   arch/parisc/include/asm/pgtable.h            |  6 +++++-
>   arch/powerpc/include/asm/book3s/32/pgtable.h |  2 +-
>   arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++--
>   arch/powerpc/include/asm/nohash/32/pgtable.h |  2 +-
>   arch/powerpc/include/asm/nohash/32/pte-8xx.h |  2 +-
>   arch/powerpc/include/asm/nohash/64/pgtable.h |  2 +-
>   arch/riscv/include/asm/pgtable.h             |  6 +++---
>   arch/s390/include/asm/hugetlb.h              |  4 ++--
>   arch/s390/include/asm/pgtable.h              |  4 ++--
>   arch/sh/include/asm/pgtable_32.h             | 10 ++++++++--
>   arch/sparc/include/asm/pgtable_32.h          |  2 +-
>   arch/sparc/include/asm/pgtable_64.h          |  6 +++---
>   arch/um/include/asm/pgtable.h                |  2 +-
>   arch/x86/include/asm/pgtable.h               |  6 ++++--
>   arch/xtensa/include/asm/pgtable.h            |  2 +-
>   include/asm-generic/hugetlb.h                |  4 ++--
>   include/linux/mm.h                           |  2 +-
>   mm/debug_vm_pgtable.c                        | 16 ++++++++--------
>   mm/huge_memory.c                             |  6 +++---
>   mm/hugetlb.c                                 |  4 ++--
>   mm/memory.c                                  |  4 ++--
>   mm/migrate_device.c                          |  2 +-
>   mm/mprotect.c                                |  2 +-
>   mm/userfaultfd.c                             |  2 +-
>   42 files changed, 106 insertions(+), 69 deletions(-)
> 
> diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
> index 30d9a09f01f4..78ac3ff2fe1d 100644
> --- a/Documentation/mm/arch_pgtable_helpers.rst
> +++ b/Documentation/mm/arch_pgtable_helpers.rst
> @@ -46,7 +46,8 @@ PTE Page Table Helpers
>   +---------------------------+--------------------------------------------------+
>   | pte_mkclean               | Creates a clean PTE                              |
>   +---------------------------+--------------------------------------------------+
> -| pte_mkwrite               | Creates a writable PTE                           |
> +| pte_mkwrite               | Creates a writable PTE of the type specified by  |
> +|                           | the VMA.                                         |
>   +---------------------------+--------------------------------------------------+
>   | pte_wrprotect             | Creates a write protected PTE                    |
>   +---------------------------+--------------------------------------------------+
> @@ -118,7 +119,8 @@ PMD Page Table Helpers
>   +---------------------------+--------------------------------------------------+
>   | pmd_mkclean               | Creates a clean PMD                              |
>   +---------------------------+--------------------------------------------------+
> -| pmd_mkwrite               | Creates a writable PMD                           |
> +| pmd_mkwrite               | Creates a writable PMD of the type specified by  |
> +|                           | the VMA.                                         |
>   +---------------------------+--------------------------------------------------+
>   | pmd_wrprotect             | Creates a write protected PMD                    |
>   +---------------------------+--------------------------------------------------+
> @@ -222,7 +224,8 @@ HugeTLB Page Table Helpers
>   +---------------------------+--------------------------------------------------+
>   | huge_pte_mkdirty          | Creates a dirty HugeTLB                          |
>   +---------------------------+--------------------------------------------------+
> -| huge_pte_mkwrite          | Creates a writable HugeTLB                       |
> +| huge_pte_mkwrite          | Creates a writable HugeTLB of the type specified |
> +|                           | by the VMA.                                      |
>   +---------------------------+--------------------------------------------------+
>   | huge_pte_wrprotect        | Creates a write protected HugeTLB                |
>   +---------------------------+--------------------------------------------------+
> diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
> index ba43cb841d19..fb5d207c2a89 100644
> --- a/arch/alpha/include/asm/pgtable.h
> +++ b/arch/alpha/include/asm/pgtable.h
> @@ -256,9 +256,13 @@ extern inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED;
>   extern inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) |= _PAGE_FOW; return pte; }
>   extern inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~(__DIRTY_BITS); return pte; }
>   extern inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~(__ACCESS_BITS); return pte; }
> -extern inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) &= ~_PAGE_FOW; return pte; }
>   extern inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= __DIRTY_BITS; return pte; }
>   extern inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= __ACCESS_BITS; return pte; }
> +extern inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	pte_val(pte) &= ~_PAGE_FOW;
> +	return pte;
> +}
>   
>   /*
>    * The smp_rmb() in the following functions are required to order the load of
> diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
> index 5001b796fb8d..223a96967188 100644
> --- a/arch/arc/include/asm/hugepage.h
> +++ b/arch/arc/include/asm/hugepage.h
> @@ -21,7 +21,7 @@ static inline pmd_t pte_pmd(pte_t pte)
>   }
>   
>   #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
> -#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
> +#define pmd_mkwrite(pmd, vma)	pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
>   #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
>   #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
>   #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
> diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/asm/pgtable-bits-arcv2.h
> index 6e9f8ca6d6a1..a5b8bc955015 100644
> --- a/arch/arc/include/asm/pgtable-bits-arcv2.h
> +++ b/arch/arc/include/asm/pgtable-bits-arcv2.h
> @@ -87,7 +87,6 @@
>   
>   PTE_BIT_FUNC(mknotpresent,     &= ~(_PAGE_PRESENT));
>   PTE_BIT_FUNC(wrprotect,	&= ~(_PAGE_WRITE));
> -PTE_BIT_FUNC(mkwrite,	|= (_PAGE_WRITE));
>   PTE_BIT_FUNC(mkclean,	&= ~(_PAGE_DIRTY));
>   PTE_BIT_FUNC(mkdirty,	|= (_PAGE_DIRTY));
>   PTE_BIT_FUNC(mkold,	&= ~(_PAGE_ACCESSED));
> @@ -95,6 +94,12 @@ PTE_BIT_FUNC(mkyoung,	|= (_PAGE_ACCESSED));
>   PTE_BIT_FUNC(mkspecial,	|= (_PAGE_SPECIAL));
>   PTE_BIT_FUNC(mkhuge,	|= (_PAGE_HW_SZ));
>   
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	pte_val(pte) |= (_PAGE_WRITE);
> +	return pte;
> +}
> +
>   static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   {
>   	return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
> diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> index 106049791500..df071a807610 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -202,11 +202,16 @@ static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
>   
>   PMD_BIT_FUNC(wrprotect,	|= L_PMD_SECT_RDONLY);
>   PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
> -PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
>   PMD_BIT_FUNC(mkdirty,   |= L_PMD_SECT_DIRTY);
>   PMD_BIT_FUNC(mkclean,   &= ~L_PMD_SECT_DIRTY);
>   PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
>   
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	pmd_val(pmd) |= L_PMD_SECT_RDONLY;
> +	return pmd;
> +}
> +
>   #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
>   
>   #define pmd_pfn(pmd)		(((pmd_val(pmd) & PMD_MASK) & PHYS_MASK) >> PAGE_SHIFT)
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index a58ccbb406ad..39ad1ae1308d 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -227,7 +227,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
>   	return set_pte_bit(pte, __pgprot(L_PTE_RDONLY));
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return clear_pte_bit(pte, __pgprot(L_PTE_RDONLY));
>   }
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index cccf8885792e..913bf370f74a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -187,7 +187,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return pte_mkwrite_kernel(pte);
>   }
> @@ -492,7 +492,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
>   #define pmd_cont(pmd)		pte_cont(pmd_pte(pmd))
>   #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
>   #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
> -#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
> +#define pmd_mkwrite(pmd, vma)	pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
>   #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
>   #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
>   #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
> diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
> index d4042495febc..c2f92c991e37 100644
> --- a/arch/csky/include/asm/pgtable.h
> +++ b/arch/csky/include/asm/pgtable.h
> @@ -176,7 +176,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= _PAGE_WRITE;
>   	if (pte_val(pte) & _PAGE_MODIFIED)
> diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
> index 59393613d086..14ab9c789c0e 100644
> --- a/arch/hexagon/include/asm/pgtable.h
> +++ b/arch/hexagon/include/asm/pgtable.h
> @@ -300,7 +300,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
>   }
>   
>   /* pte_mkwrite - mark page as writable */
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= _PAGE_WRITE;
>   	return pte;
> diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
> index 21c97e31a28a..f879dd626da6 100644
> --- a/arch/ia64/include/asm/pgtable.h
> +++ b/arch/ia64/include/asm/pgtable.h
> @@ -268,7 +268,7 @@ ia64_phys_addr_valid (unsigned long addr)
>    * access rights:
>    */
>   #define pte_wrprotect(pte)	(__pte(pte_val(pte) & ~_PAGE_AR_RW))
> -#define pte_mkwrite(pte)	(__pte(pte_val(pte) | _PAGE_AR_RW))
> +#define pte_mkwrite(pte, vma)	(__pte(pte_val(pte) | _PAGE_AR_RW))
>   #define pte_mkold(pte)		(__pte(pte_val(pte) & ~_PAGE_A))
>   #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
>   #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
> diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> index d28fb9dbec59..ebf645f40298 100644
> --- a/arch/loongarch/include/asm/pgtable.h
> +++ b/arch/loongarch/include/asm/pgtable.h
> @@ -390,7 +390,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= _PAGE_WRITE;
>   	if (pte_val(pte) & _PAGE_MODIFIED)
> @@ -490,7 +490,7 @@ static inline int pmd_write(pmd_t pmd)
>   	return !!(pmd_val(pmd) & _PAGE_WRITE);
>   }
>   
> -static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
>   	pmd_val(pmd) |= _PAGE_WRITE;
>   	if (pmd_val(pmd) & _PAGE_MODIFIED)
> diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h
> index 13741c1245e1..37d77e055016 100644
> --- a/arch/m68k/include/asm/mcf_pgtable.h
> +++ b/arch/m68k/include/asm/mcf_pgtable.h
> @@ -211,7 +211,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= CF_PAGE_WRITABLE;
>   	return pte;
> diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h
> index ec0dc19ab834..c4e8eb76286d 100644
> --- a/arch/m68k/include/asm/motorola_pgtable.h
> +++ b/arch/m68k/include/asm/motorola_pgtable.h
> @@ -155,7 +155,6 @@ static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED;
>   static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) |= _PAGE_RONLY; return pte; }
>   static inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
>   static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~_PAGE_ACCESSED; return pte; }
> -static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) &= ~_PAGE_RONLY; return pte; }
>   static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
>   static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
>   static inline pte_t pte_mknocache(pte_t pte)
> @@ -168,6 +167,11 @@ static inline pte_t pte_mkcache(pte_t pte)
>   	pte_val(pte) = (pte_val(pte) & _CACHEMASK040) | m68k_supervisor_cachemode;
>   	return pte;
>   }
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	pte_val(pte) &= ~_PAGE_RONLY;
> +	return pte;
> +}
>   
>   #define swapper_pg_dir kernel_pg_dir
>   extern pgd_t kernel_pg_dir[128];
> diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h
> index e582b0484a55..2a06bea51a1e 100644
> --- a/arch/m68k/include/asm/sun3_pgtable.h
> +++ b/arch/m68k/include/asm/sun3_pgtable.h
> @@ -143,10 +143,14 @@ static inline int pte_young(pte_t pte)		{ return pte_val(pte) & SUN3_PAGE_ACCESS
>   static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; }
>   static inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; }
>   static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_ACCESSED; return pte; }
> -static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; }
>   static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_MODIFIED; return pte; }
>   static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_ACCESSED; return pte; }
>   static inline pte_t pte_mknocache(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_NOCACHE; return pte; }
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	pte_val(pte) |= SUN3_PAGE_WRITEABLE;
> +	return pte;
> +}
>   // use this version when caches work...
>   //static inline pte_t pte_mkcache(pte_t pte)	{ pte_val(pte) &= SUN3_PAGE_NOCACHE; return pte; }
>   // until then, use:
> diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h
> index d1b8272abcd9..5b83e82f8d7e 100644
> --- a/arch/microblaze/include/asm/pgtable.h
> +++ b/arch/microblaze/include/asm/pgtable.h
> @@ -266,7 +266,7 @@ static inline pte_t pte_mkread(pte_t pte) \
>   	{ pte_val(pte) |= _PAGE_USER; return pte; }
>   static inline pte_t pte_mkexec(pte_t pte) \
>   	{ pte_val(pte) |= _PAGE_USER | _PAGE_EXEC; return pte; }
> -static inline pte_t pte_mkwrite(pte_t pte) \
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) \
>   	{ pte_val(pte) |= _PAGE_RW; return pte; }
>   static inline pte_t pte_mkdirty(pte_t pte) \
>   	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
> diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
> index 791389bf3c12..06efd567144a 100644
> --- a/arch/mips/include/asm/pgtable.h
> +++ b/arch/mips/include/asm/pgtable.h
> @@ -309,7 +309,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte.pte_low |= _PAGE_WRITE;
>   	if (pte.pte_low & _PAGE_MODIFIED) {
> @@ -364,7 +364,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= _PAGE_WRITE;
>   	if (pte_val(pte) & _PAGE_MODIFIED)
> @@ -626,7 +626,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
>   	return pmd;
>   }
>   
> -static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
>   	pmd_val(pmd) |= _PAGE_WRITE;
>   	if (pmd_val(pmd) & _PAGE_MODIFIED)
> diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
> index 0f5c2564e9f5..edd458518e0e 100644
> --- a/arch/nios2/include/asm/pgtable.h
> +++ b/arch/nios2/include/asm/pgtable.h
> @@ -129,7 +129,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= _PAGE_WRITE;
>   	return pte;
> diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
> index 3eb9b9555d0d..fd40aec189d1 100644
> --- a/arch/openrisc/include/asm/pgtable.h
> +++ b/arch/openrisc/include/asm/pgtable.h
> @@ -250,7 +250,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	pte_val(pte) |= _PAGE_WRITE;
>   	return pte;
> diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
> index e2950f5db7c9..89f62137e67f 100644
> --- a/arch/parisc/include/asm/pgtable.h
> +++ b/arch/parisc/include/asm/pgtable.h
> @@ -331,8 +331,12 @@ static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~_PAGE_ACCESSED; retu
>   static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) &= ~_PAGE_WRITE; return pte; }
>   static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
>   static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
> -static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) |= _PAGE_WRITE; return pte; }
>   static inline pte_t pte_mkspecial(pte_t pte)	{ pte_val(pte) |= _PAGE_SPECIAL; return pte; }
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	pte_val(pte) |= _PAGE_WRITE;
> +	return pte;
> +}
>   
>   /*
>    * Huge pte definitions.
> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
> index 7bf1fe7297c6..10d9a1d2aca9 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
> @@ -498,7 +498,7 @@ static inline pte_t pte_mkpte(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return __pte(pte_val(pte) | _PAGE_RW);
>   }
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 4acc9690f599..be0636522d36 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -600,7 +600,7 @@ static inline pte_t pte_mkexec(pte_t pte)
>   	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC));
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	/*
>   	 * write implies read, hence set both
> @@ -1071,7 +1071,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
>   #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
>   #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
>   #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
> -#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
> +#define pmd_mkwrite(pmd, vma)	pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
>   
>   #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
>   #define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
> index fec56d965f00..7bfbcb9ba55b 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
> @@ -171,7 +171,7 @@ void unmap_kernel_page(unsigned long va);
>   	do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0)
>   
>   #ifndef pte_mkwrite
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return __pte(pte_val(pte) | _PAGE_RW);
>   }
> diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> index 1a89ebdc3acc..f32450eb270a 100644
> --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> @@ -101,7 +101,7 @@ static inline int pte_write(pte_t pte)
>   
>   #define pte_write pte_write
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return __pte(pte_val(pte) & ~_PAGE_RO);
>   }
> diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
> index 287e25864ffa..589009555877 100644
> --- a/arch/powerpc/include/asm/nohash/64/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
> @@ -85,7 +85,7 @@
>   #ifndef __ASSEMBLY__
>   /* pte_clear moved to later in this file */
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return __pte(pte_val(pte) | _PAGE_RW);
>   }
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index d8d8de0ded99..fed1b81fbe07 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -338,7 +338,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
>   
>   /* static inline pte_t pte_mkread(pte_t pte) */
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return __pte(pte_val(pte) | _PAGE_WRITE);
>   }
> @@ -624,9 +624,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
>   	return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
>   }
>   
> -static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
> -	return pte_pmd(pte_mkwrite(pmd_pte(pmd)));
> +	return pte_pmd(pte_mkwrite(pmd_pte(pmd), vma));
>   }
>   
>   static inline pmd_t pmd_wrprotect(pmd_t pmd)
> diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
> index ccdbccfde148..558f7eef9c4d 100644
> --- a/arch/s390/include/asm/hugetlb.h
> +++ b/arch/s390/include/asm/hugetlb.h
> @@ -102,9 +102,9 @@ static inline int huge_pte_dirty(pte_t pte)
>   	return pte_dirty(pte);
>   }
>   
> -static inline pte_t huge_pte_mkwrite(pte_t pte)
> +static inline pte_t huge_pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
> -	return pte_mkwrite(pte);
> +	return pte_mkwrite(pte, vma);
>   }
>   
>   static inline pte_t huge_pte_mkdirty(pte_t pte)
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index deeb918cae1d..8f2c743da0eb 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1013,7 +1013,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
>   	return pte;
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return pte_mkwrite_kernel(pte);
>   }
> @@ -1499,7 +1499,7 @@ static inline pmd_t pmd_mkwrite_kernel(pmd_t pmd)
>   	return pmd;
>   }
>   
> -static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
>   	return pmd_mkwrite_kernel(pmd);
>   }
> diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
> index 21952b094650..9f2dcb9eafc8 100644
> --- a/arch/sh/include/asm/pgtable_32.h
> +++ b/arch/sh/include/asm/pgtable_32.h
> @@ -351,6 +351,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>   
>   #define PTE_BIT_FUNC(h,fn,op) \
>   static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
> +#define PTE_BIT_FUNC_VMA(h,fn,op) \
> +static inline pte_t pte_##fn(pte_t pte, struct vm_area_struct *vma) \
> +{ \
> +	pte.pte_##h op; \
> +	return pte; \
> +}
>   
>   #ifdef CONFIG_X2TLB
>   /*
> @@ -359,11 +365,11 @@ static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
>    * kernel permissions), we attempt to couple them a bit more sanely here.
>    */
>   PTE_BIT_FUNC(high, wrprotect, &= ~(_PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE));
> -PTE_BIT_FUNC(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
> +PTE_BIT_FUNC_VMA(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
>   PTE_BIT_FUNC(high, mkhuge, |= _PAGE_SZHUGE);
>   #else
>   PTE_BIT_FUNC(low, wrprotect, &= ~_PAGE_RW);
> -PTE_BIT_FUNC(low, mkwrite, |= _PAGE_RW);
> +PTE_BIT_FUNC_VMA(low, mkwrite, |= _PAGE_RW);
>   PTE_BIT_FUNC(low, mkhuge, |= _PAGE_SZHUGE);
>   #endif
>   
> diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
> index d4330e3c57a6..3e8836179456 100644
> --- a/arch/sparc/include/asm/pgtable_32.h
> +++ b/arch/sparc/include/asm/pgtable_32.h
> @@ -241,7 +241,7 @@ static inline pte_t pte_mkold(pte_t pte)
>   	return __pte(pte_val(pte) & ~SRMMU_REF);
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return __pte(pte_val(pte) | SRMMU_WRITE);
>   }
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 2dc8d4641734..c5cd5c03f557 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -466,7 +466,7 @@ static inline pte_t pte_mkclean(pte_t pte)
>   	return __pte(val);
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	unsigned long val = pte_val(pte), mask;
>   
> @@ -756,11 +756,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
>   	return __pmd(pte_val(pte));
>   }
>   
> -static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
>   	pte_t pte = __pte(pmd_val(pmd));
>   
> -	pte = pte_mkwrite(pte);
> +	pte = pte_mkwrite(pte, vma);
>   
>   	return __pmd(pte_val(pte));
>   }
> diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
> index a70d1618eb35..963479c133b7 100644
> --- a/arch/um/include/asm/pgtable.h
> +++ b/arch/um/include/asm/pgtable.h
> @@ -207,7 +207,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
>   	return(pte);
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	if (unlikely(pte_get_bits(pte,  _PAGE_RW)))
>   		return pte;
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 3607f2572f9e..66c514808276 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -369,7 +369,9 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
>   	return pte_set_flags(pte, _PAGE_RW);
>   }
>   
> -static inline pte_t pte_mkwrite(pte_t pte)
> +struct vm_area_struct;
> +
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	return pte_mkwrite_kernel(pte);
>   }
> @@ -470,7 +472,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
>   	return pmd_set_flags(pmd, _PAGE_ACCESSED);
>   }
>   
> -static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
>   	return pmd_set_flags(pmd, _PAGE_RW);
>   }
> diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
> index fc7a14884c6c..d72632d9c53c 100644
> --- a/arch/xtensa/include/asm/pgtable.h
> +++ b/arch/xtensa/include/asm/pgtable.h
> @@ -262,7 +262,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
>   	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
>   static inline pte_t pte_mkyoung(pte_t pte)
>   	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
> -static inline pte_t pte_mkwrite(pte_t pte)
> +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   	{ pte_val(pte) |= _PAGE_WRITABLE; return pte; }
>   
>   #define pgprot_noncached(prot) \
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index d7f6335d3999..e86c830728de 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -20,9 +20,9 @@ static inline unsigned long huge_pte_dirty(pte_t pte)
>   	return pte_dirty(pte);
>   }
>   
> -static inline pte_t huge_pte_mkwrite(pte_t pte)
> +static inline pte_t huge_pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
> -	return pte_mkwrite(pte);
> +	return pte_mkwrite(pte, vma);
>   }
>   
>   #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f79667824eb..af652444fbba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1163,7 +1163,7 @@ void free_compound_page(struct page *page);
>   static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   {
>   	if (likely(vma->vm_flags & VM_WRITE))
> -		pte = pte_mkwrite(pte);
> +		pte = pte_mkwrite(pte, vma);
>   	return pte;
>   }
>   
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index af59cc7bd307..7bc5592900bc 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -109,10 +109,10 @@ static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx)
>   	WARN_ON(!pte_same(pte, pte));
>   	WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte))));
>   	WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte))));
> -	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte))));
> +	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma)));
>   	WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte))));
>   	WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte))));
> -	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte))));
> +	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma))));
>   	WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte))));
>   	WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte))));
>   }
> @@ -153,7 +153,7 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
>   	pte = pte_mkclean(pte);
>   	set_pte_at(args->mm, args->vaddr, args->ptep, pte);
>   	flush_dcache_page(page);
> -	pte = pte_mkwrite(pte);
> +	pte = pte_mkwrite(pte, args->vma);
>   	pte = pte_mkdirty(pte);
>   	ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1);
>   	pte = ptep_get(args->ptep);
> @@ -199,10 +199,10 @@ static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx)
>   	WARN_ON(!pmd_same(pmd, pmd));
>   	WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd))));
>   	WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd))));
> -	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd))));
> +	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma)));
>   	WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd))));
>   	WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd))));
> -	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd))));
> +	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma))));
>   	WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd))));
>   	WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd))));
>   	/*
> @@ -253,7 +253,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
>   	pmd = pmd_mkclean(pmd);
>   	set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
>   	flush_dcache_page(page);
> -	pmd = pmd_mkwrite(pmd);
> +	pmd = pmd_mkwrite(pmd, args->vma);
>   	pmd = pmd_mkdirty(pmd);
>   	pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1);
>   	pmd = READ_ONCE(*args->pmdp);
> @@ -928,8 +928,8 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
>   	pte = mk_huge_pte(page, args->page_prot);
>   
>   	WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
> -	WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
> -	WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte))));
> +	WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte), args->vma)));
> +	WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte, args->vma))));
>   
>   #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
>   	pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4fc43859e59a..aaf815838144 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -555,7 +555,7 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
>   pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>   {
>   	if (likely(vma->vm_flags & VM_WRITE))
> -		pmd = pmd_mkwrite(pmd);
> +		pmd = pmd_mkwrite(pmd, vma);
>   	return pmd;
>   }
>   
> @@ -1580,7 +1580,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>   	pmd = pmd_mkyoung(pmd);
>   	if (writable)
> -		pmd = pmd_mkwrite(pmd);
> +		pmd = pmd_mkwrite(pmd, vma);
>   	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>   	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
>   	spin_unlock(vmf->ptl);
> @@ -1926,7 +1926,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   	/* See change_pte_range(). */
>   	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
>   	    can_change_pmd_writable(vma, addr, entry))
> -		entry = pmd_mkwrite(entry);
> +		entry = pmd_mkwrite(entry, vma);
>   
>   	ret = HPAGE_PMD_NR;
>   	set_pmd_at(mm, addr, pmd, entry);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 07abcb6eb203..6af471bdcff8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4900,7 +4900,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
>   
>   	if (writable) {
>   		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
> -					 vma->vm_page_prot)));
> +					 vma->vm_page_prot)), vma);
>   	} else {
>   		entry = huge_pte_wrprotect(mk_huge_pte(page,
>   					   vma->vm_page_prot));
> @@ -4916,7 +4916,7 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
>   {
>   	pte_t entry;
>   
> -	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)));
> +	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)), vma);
>   	if (huge_ptep_set_access_flags(vma, address, ptep, entry, 1))
>   		update_mmu_cache(vma, address, ptep);
>   }
> diff --git a/mm/memory.c b/mm/memory.c
> index f456f3b5049c..d0972d2d6f36 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4067,7 +4067,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>   	entry = mk_pte(&folio->page, vma->vm_page_prot);
>   	entry = pte_sw_mkyoung(entry);
>   	if (vma->vm_flags & VM_WRITE)
> -		entry = pte_mkwrite(pte_mkdirty(entry));
> +		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>   
>   	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>   			&vmf->ptl);
> @@ -4755,7 +4755,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	pte = pte_modify(old_pte, vma->vm_page_prot);
>   	pte = pte_mkyoung(pte);
>   	if (writable)
> -		pte = pte_mkwrite(pte);
> +		pte = pte_mkwrite(pte, vma);
>   	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
>   	update_mmu_cache(vma, vmf->address, vmf->pte);
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index d30c9de60b0d..df3f5e9d5f76 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -646,7 +646,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>   		}
>   		entry = mk_pte(page, vma->vm_page_prot);
>   		if (vma->vm_flags & VM_WRITE)
> -			entry = pte_mkwrite(pte_mkdirty(entry));
> +			entry = pte_mkwrite(pte_mkdirty(entry), vma);
>   	}
>   
>   	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1d4843c97c2a..381163a41e88 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -198,7 +198,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>   			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>   			    !pte_write(ptent) &&
>   			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent);
> +				ptent = pte_mkwrite(ptent, vma);
>   
>   			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>   			if (pte_needs_flush(oldpte, ptent))
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 53c3d916ff66..3db6f87c0aca 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -75,7 +75,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>   	if (page_in_cache && !vm_shared)
>   		writable = false;
>   	if (writable)
> -		_dst_pte = pte_mkwrite(_dst_pte);
> +		_dst_pte = pte_mkwrite(_dst_pte, dst_vma);
>   	if (wp_copy)
>   		_dst_pte = pte_mkuffd_wp(_dst_pte);
>   

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA
  2023-03-01  7:03   ` Christophe Leroy
@ 2023-03-01  8:16     ` David Hildenbrand
  0 siblings, 0 replies; 159+ messages in thread
From: David Hildenbrand @ 2023-03-01  8:16 UTC (permalink / raw)
  To: Christophe Leroy, Rick Edgecombe, x86, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug
  Cc: linux-alpha, linux-snps-arc, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-ia64, loongarch, linux-m68k, Michal Simek,
	Dinh Nguyen, linux-mips, linux-openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, xen-devel

On 01.03.23 08:03, Christophe Leroy wrote:
> 
> 
> Le 27/02/2023 à 23:29, Rick Edgecombe a écrit :
>> The x86 Control-flow Enforcement Technology (CET) feature includes a new
>> type of memory called shadow stack. This shadow stack memory has some
>> unusual properties, which requires some core mm changes to function
>> properly.
>>
>> One of these unusual properties is that shadow stack memory is writable,
>> but only in limited ways. These limits are applied via a specific PTE
>> bit combination. Nevertheless, the memory is writable, and core mm code
>> will need to apply the writable permissions in the typical paths that
>> call pte_mkwrite().
>>
>> In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
>> that they are special shadow stack flavor of writable memory. So make
>> pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
>> create regular writable memory or shadow stack memory.
>>
>> Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().
> 
> I'm not sure it is a good idea to add a second argument to
> pte_mkwrite(). All pte_mkxxxx() only take a pte and nothing else.

We touched on this in previous revisions and so far there was no strong 
push back. This turned out to be cleaner and easier than the 
alternatives we evaluated.

pte_modify(), for example, takes another argument. Sure, we could try 
thinking about passing something else than a VMA to identify the 
writability type, but I am not convinced that will look particularly better.

> 
> I think you should do the same as commit d9ed9faac283 ("mm: add new
> arch_make_huge_pte() method for tile support")
> 

We already have 3 architectures intending to support shadow stacks in 
one way or the other. Replacing all pte_mkwrite() with 
arch_pte_mkwrite() doesn't sound particularly appealing to me.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-02-27 22:29 ` [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description Rick Edgecombe
@ 2023-03-01 14:21   ` Szabolcs Nagy
  2023-03-01 14:38     ` Szabolcs Nagy
  2023-03-01 18:07     ` Edgecombe, Rick P
  0 siblings, 2 replies; 159+ messages in thread
From: Szabolcs Nagy @ 2023-03-01 14:21 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Mark Brown
  Cc: Yu-cheng Yu

The 02/27/2023 14:29, Rick Edgecombe wrote:
> +Application Enabling
> +====================
> +
> +An application's CET capability is marked in its ELF note and can be verified
> +from readelf/llvm-readelf output::
> +
> +    readelf -n <application> | grep -a SHSTK
> +        properties: x86 feature: SHSTK
> +
> +The kernel does not process these applications markers directly. Applications
> +or loaders must enable CET features using the interface described in section 4.
> +Typically this would be done in dynamic loader or static runtime objects, as is
> +the case in GLIBC.

Note that this has to be an early decision in libc (ld.so or static
exe start code), which will be difficult to hook into system wide
security policy settings. (e.g. to force shstk on marked binaries.)

From userspace POV I'd prefer if a static exe did not have to parse
its own ELF notes (i.e. kernel enabled shstk based on the marking).
But I realize if there is a need for complex shstk enable/disable
decision that is better in userspace and if the kernel decision can
be overridden then it might as well all be in userspace.

> +Enabling arch_prctl()'s
> +=======================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's. They
> +are only supported in 64 bit user applications.
> +
> +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
> +    Enable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
> +    Disable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
> +    Lock in features at their current enabled or disabled status. 'features'
> +    is a mask of all features to lock. All bits set are processed, unset bits
> +    are ignored. The mask is ORed with the existing value. So any feature bits
> +    set here cannot be enabled or disabled afterwards.

The multi-thread behaviour should be documented here: Only the
current thread is affected. So an application can only change the
setting while single-threaded which is only guaranteed before any
user code is executed. Later using the prctl is complicated and
most c runtimes would not want to do that (async signalling all
threads and prctl from the handler).

In particular these interfaces are not suitable to turn shstk off
at dlopen time when an unmarked binary is loaded. Or any other
late shstk policy change will not work, so as far as i can see
the "permissive" mode in glibc does not work.

Does the main thread have shadow stack allocated before shstk is
enabled? is the shadow stack freed when it is disabled? (e.g.
what would the instruction reading the SSP do in disabled state?)

> +Proc Status
> +===========
> +To check if an application is actually running with shadow stack, the
> +user can read the /proc/$PID/status. It will report "wrss" or "shstk"
> +depending on what is enabled. The lines look like this::
> +
> +    x86_Thread_features: shstk wrss
> +    x86_Thread_features_locked: shstk wrss

Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also
shows the setting and only valid for the specific thread (not the
entire process). So i would note that this for one thread only.

> +Implementation of the Shadow Stack
> +==================================
> +
> +Shadow Stack Size
> +-----------------
> +
> +A task's shadow stack is allocated from memory to a fixed size of
> +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
> +the maximum size of the normal stack, but capped to 4 GB. However,
> +a compat-mode application's address space is smaller, each of its thread's
> +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).

This policy tries to handle all threads with the same shadow stack
size logic, which has limitations. I think it should be improved
(otherwise some applications will have to turn shstk off):

- RLIMIT_STACK is not an upper bound for the main thread stack size
  (rlimit can increase/decrease dynamically).
- RLIMIT_STACK only applies to the main thread, so it is not an upper
  bound for non-main thread stacks.
- i.e. stack size >> startup RLIMIT_STACK is possible and then shadow
  stack can overflow.
- stack size << startup RLIMIT_STACK is also possible and then VA
  space is wasted (can lead to OOM with strict memory overcommit).
- clone3 tells the kernel the thread stack size so that should be
  used instead of RLIMIT_STACK. (clone does not though.)
- I think it's better to have a new limit specifically for shadow
  stack size (which by default can be RLIMIT_STACK) so userspace
  can adjust it if needed (another reason is that stack size is
  not always a good indicator of max call depth).

> +Signal
> +------
> +
> +By default, the main program and its signal handlers use the same shadow
> +stack. Because the shadow stack stores only return addresses, a large
> +shadow stack covers the condition that both the program stack and the
> +signal alternate stack run out.

What does "by default" mean here? Is there a case when the signal handler
is not entered with SSP set to the handling thread'd shadow stack?

> +When a signal happens, the old pre-signal state is pushed on the stack. When
> +shadow stack is enabled, the shadow stack specific state is pushed onto the
> +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
> +in a special format with bit 63 set. On sigreturn this old SSP token is
> +verified and restored by the kernel. The kernel will also push the normal
> +restorer address to the shadow stack to help userspace avoid a shadow stack
> +violation on the sigreturn path that goes through the restorer.

The kernel pushes on the shadow stack on signal entry so shadow stack
overflow cannot be handled. Please document this as non-recoverable
failure.

I think it can be made recoverable if signals with alternate stack run
on a different shadow stack. And the top of the thread shadow stack is
just corrupted instead of pushed in the overflow case. Then longjmp out
can be made to work (common in stack overflow handling cases), and
reliable crash report from the signal handler works (also common).

Does SSP get stored into the sigcontext struct somewhere?

> +Fork
> +----
> +
> +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
> +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
> +shadow access triggers a page fault with the shadow stack access bit set
> +in the page fault error code.
> +
> +When a task forks a child, its shadow stack PTEs are copied and both the
> +parent's and the child's shadow stack PTEs are cleared of the dirty bit.
> +Upon the next shadow stack access, the resulting shadow stack page fault
> +is handled by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new shadow stack
> +for the new thread. New shadow stack's behave like mmap() with respect to
> +ASLR behavior.

Please document the shadow stack lifetimes here:

I think thread exit unmaps shadow stack and vfork shares shadow stack
with parent so exit does not unmap.

I think the map_shadow_stack syscall should be mentioned in this
document too.

ABI for initial shadow stack entries:

If one wants to scan the shadow stack how to detect the end (e.g. fast
backtrace)? Is it useful to put an invalid value (-1) there?
(affects map_shadow_stack syscall too).

thanks.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-01 14:21   ` Szabolcs Nagy
@ 2023-03-01 14:38     ` Szabolcs Nagy
  2023-03-01 18:07     ` Edgecombe, Rick P
  1 sibling, 0 replies; 159+ messages in thread
From: Szabolcs Nagy @ 2023-03-01 14:38 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Mark Brown
  Cc: Yu-cheng Yu, nd

The 03/01/2023 14:21, Szabolcs Nagy wrote:
>...
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

sorry,
ignore this.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 07/41] x86: Move control protection handler to separate file
  2023-02-27 22:29 ` [PATCH v7 07/41] x86: Move control protection handler to separate file Rick Edgecombe
@ 2023-03-01 15:38   ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-01 15:38 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:23PM -0800, Rick Edgecombe wrote:
> Subject: Re: [PATCH v7 07/41] x86: Move control protection handler to separate file

x86/traps: ...

but that can be fixed while committing.

> Today the control protection handler is defined in traps.c and used only
> for the kernel IBT feature. To reduce ifdeffery, move it to it's own file.
> In future patches, functionality will be added to make this handler also
> handle user shadow stack faults. So name the file cet.c.
> 
> No functional change.

...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler
  2023-02-27 22:29 ` [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler Rick Edgecombe
@ 2023-03-01 18:06   ` Borislav Petkov
  2023-03-01 18:14     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-01 18:06 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu, Michael Kerrisk

On Mon, Feb 27, 2023 at 02:29:24PM -0800, Rick Edgecombe wrote:
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> index 7ad22b705b64..cc10d8be9d74 100644
> --- a/arch/x86/kernel/cet.c
> +++ b/arch/x86/kernel/cet.c
> @@ -4,10 +4,6 @@
>  #include <asm/bugs.h>
>  #include <asm/traps.h>
>  
> -static __ro_after_init bool ibt_fatal = true;
> -
> -extern void ibt_selftest_ip(void); /* code label defined in asm below */
> -
>  enum cp_error_code {
>  	CP_EC        = (1 << 15) - 1,

From a previous review:

"That looks like a mask, so

        CP_EC_MASK

I guess."

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-01 14:21   ` Szabolcs Nagy
  2023-03-01 14:38     ` Szabolcs Nagy
@ 2023-03-01 18:07     ` Edgecombe, Rick P
  2023-03-01 18:32       ` Edgecombe, Rick P
  2023-03-02 16:14       ` szabolcs.nagy
  1 sibling, 2 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-01 18:07 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, szabolcs.nagy, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, broonie,
	linux-arch, kcc, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng

On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > +Application Enabling
> > +====================
> > +
> > +An application's CET capability is marked in its ELF note and can
> > be verified
> > +from readelf/llvm-readelf output::
> > +
> > +    readelf -n <application> | grep -a SHSTK
> > +        properties: x86 feature: SHSTK
> > +
> > +The kernel does not process these applications markers directly.
> > Applications
> > +or loaders must enable CET features using the interface described
> > in section 4.
> > +Typically this would be done in dynamic loader or static runtime
> > objects, as is
> > +the case in GLIBC.
> 
> Note that this has to be an early decision in libc (ld.so or static
> exe start code), which will be difficult to hook into system wide
> security policy settings. (e.g. to force shstk on marked binaries.)

In the eager enabling (by the kernel) scenario, how is this improved?
The loader has to have the option to disable the shadow stack if
enabling conditions are not met, so it still has to trust userspace to
not do that. Did you have any more specifics on how the policy would
work?

> 
> From userspace POV I'd prefer if a static exe did not have to parse
> its own ELF notes (i.e. kernel enabled shstk based on the marking).

This is actually exactly what happens in the glibc patches. My
understand was that it already been discussed amongst glibc folks.

> But I realize if there is a need for complex shstk enable/disable
> decision that is better in userspace and if the kernel decision can
> be overridden then it might as well all be in userspace.

A complication with shadow stack in general is that it has to be
enabled very early. Otherwise when the program returns from main(), it
will get a shadow stack underflow. The old logic in this series would
enable shadow stack if the loader had the SHSTK bit (by parsing the
header in the kernel). Then later if the conditions were not met to use
shadow stack, the loader would call into the kernel again to disable
shadow stack.

One problem (there were several with this area) with this eager
enabling, was the kernel ended up mapping, briefly using, and then
unmapping the shadow stack in the case of a executable not supporting
shadow stack. What the glibc patches do today is pretty much the same
behavior as before, just with the header parsing moved into userspace.
I think letting the component with the most information make the
decision leaves open the best opportunity for making it efficient. I
wonder if it could be possible for glibc to enable it later than it
currently does in the patches and improve the dynamic loader case, but
I don't know enough of that code.

> 
> > +Enabling arch_prctl()'s
> > +=======================
> > +
> > +Elf features should be enabled by the loader using the below
> > arch_prctl's. They
> > +are only supported in 64 bit user applications.
> > +
> > +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
> > +    Enable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
> > +    Disable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
> > +    Lock in features at their current enabled or disabled status.
> > 'features'
> > +    is a mask of all features to lock. All bits set are processed,
> > unset bits
> > +    are ignored. The mask is ORed with the existing value. So any
> > feature bits
> > +    set here cannot be enabled or disabled afterwards.
> 
> The multi-thread behaviour should be documented here: Only the
> current thread is affected. So an application can only change the
> setting while single-threaded which is only guaranteed before any
> user code is executed. Later using the prctl is complicated and
> most c runtimes would not want to do that (async signalling all
> threads and prctl from the handler).

It is kind of covered in the fork() docs, but yes there should probably
be a reference here too.

> 
> In particular these interfaces are not suitable to turn shstk off
> at dlopen time when an unmarked binary is loaded. Or any other
> late shstk policy change will not work, so as far as i can see
> the "permissive" mode in glibc does not work.

Yes, that is correct. Glibc permissive mode does not fully work. There
are some ongoing discussions on how to make it work. Some options don't
require kernel changes, and some do. Making it per-thread is
complicated for x86 because when shadow stack is off, some of the
special shadow stack instructions will cause #UD exception. Glibc (any
probably other apps in the future) could be in the middle of executing
these instructions when dlopen() was called. So if there was a process
wide disable option it would have to be resilient to these #UDs. And
even then the code that used them could not be guaranteed to continue
to work. For example, if you call the gcc intrinsic _get_ssp() when
shadow stack is enabled it could be expected to point to the shadow
stack in most cases. If shadow stack gets disabled, rdssp will return
0, in which case reading the shadow stack would segfault. So the all-
process disabling solution can't be fully robust when there is any
shadow stack specific logic.

The other option discussed was creating trampolines between the
linked legacy objects that could know to tell the kernel to disable
shadow stack if needed. In this case, shadow stack is disabled for each
thread as it calls into the DSO. It's not clear if there can be enough
information gleaned from the legacy binaries to know when to generate
the trampolines in exotic cases.

A third option might be to have some synchronization between the kernel
and userspace around anything using the shadow stack instructions. But
there is not much detail filled in there.

So in summary, it's not as simple as making the disable per-process.

> 
> Does the main thread have shadow stack allocated before shstk is
> enabled?

No.

> is the shadow stack freed when it is disabled? (e.g.
> what would the instruction reading the SSP do in disabled state?)

Yes.

When shadow stack is disabled rdssp is a NOP, the intrinsic returns
NULL.

> 
> > +Proc Status
> > +===========
> > +To check if an application is actually running with shadow stack,
> > the
> > +user can read the /proc/$PID/status. It will report "wrss" or
> > "shstk"
> > +depending on what is enabled. The lines look like this::
> > +
> > +    x86_Thread_features: shstk wrss
> > +    x86_Thread_features_locked: shstk wrss
> 
> Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also
> shows the setting and only valid for the specific thread (not the
> entire process). So i would note that this for one thread only.

Since enabling/disabling is per-thread, and the field is called
"x86_Thread_features" I thought it was clear. It's easy to add some
more detail though.

> 
> > +Implementation of the Shadow Stack
> > +==================================
> > +
> > +Shadow Stack Size
> > +-----------------
> > +
> > +A task's shadow stack is allocated from memory to a fixed size of
> > +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is
> > allocated to
> > +the maximum size of the normal stack, but capped to 4 GB. However,
> > +a compat-mode application's address space is smaller, each of its
> > thread's
> > +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
> 
> This policy tries to handle all threads with the same shadow stack
> size logic, which has limitations. I think it should be improved
> (otherwise some applications will have to turn shstk off):
> 
> - RLIMIT_STACK is not an upper bound for the main thread stack size
>   (rlimit can increase/decrease dynamically).
> - RLIMIT_STACK only applies to the main thread, so it is not an upper
>   bound for non-main thread stacks.
> - i.e. stack size >> startup RLIMIT_STACK is possible and then shadow
>   stack can overflow.
> - stack size << startup RLIMIT_STACK is also possible and then VA
>   space is wasted (can lead to OOM with strict memory overcommit).
> - clone3 tells the kernel the thread stack size so that should be
>   used instead of RLIMIT_STACK. (clone does not though.)

This actually happens already. I can update the docs.

> - I think it's better to have a new limit specifically for shadow
>   stack size (which by default can be RLIMIT_STACK) so userspace
>   can adjust it if needed (another reason is that stack size is
>   not always a good indicator of max call depth).

Hmm, yea. This seems like a good idea, but I don't see why it can't be
a follow on. The series is quite big just to get the basics. I have
tried to save some of the enhancements (like alt shadow stack) for the
future.

> 
> > +Signal
> > +------
> > +
> > +By default, the main program and its signal handlers use the same
> > shadow
> > +stack. Because the shadow stack stores only return addresses, a
> > large
> > +shadow stack covers the condition that both the program stack and
> > the
> > +signal alternate stack run out.
> 
> What does "by default" mean here? Is there a case when the signal
> handler
> is not entered with SSP set to the handling thread'd shadow stack?

Ah, yea, that could be updated. It is in reference to an alt shadow
stack implementation that was held for later.

> 
> > +When a signal happens, the old pre-signal state is pushed on the
> > stack. When
> > +shadow stack is enabled, the shadow stack specific state is pushed
> > onto the
> > +shadow stack. Today this is only the old SSP (shadow stack
> > pointer), pushed
> > +in a special format with bit 63 set. On sigreturn this old SSP
> > token is
> > +verified and restored by the kernel. The kernel will also push the
> > normal
> > +restorer address to the shadow stack to help userspace avoid a
> > shadow stack
> > +violation on the sigreturn path that goes through the restorer.
> 
> The kernel pushes on the shadow stack on signal entry so shadow stack
> overflow cannot be handled. Please document this as non-recoverable
> failure.

It doesn't hurt to call it out. Please see the below link for future
plans to handle this scenario (alt shadow stack).

> 
> I think it can be made recoverable if signals with alternate stack
> run
> on a different shadow stack. And the top of the thread shadow stack
> is
> just corrupted instead of pushed in the overflow case. Then longjmp
> out
> can be made to work (common in stack overflow handling cases), and
> reliable crash report from the signal handler works (also common).
> 
> Does SSP get stored into the sigcontext struct somewhere?

No, it's pushed to the shadow stack only. See the v2 coverletter of the
discussion on the design and reasoning:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

> 
> > +Fork
> > +----
> > +
> > +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are
> > required
> > +to be read-only and dirty. When a shadow stack PTE is not RO and
> > dirty, a
> > +shadow access triggers a page fault with the shadow stack access
> > bit set
> > +in the page fault error code.
> > +
> > +When a task forks a child, its shadow stack PTEs are copied and
> > both the
> > +parent's and the child's shadow stack PTEs are cleared of the
> > dirty bit.
> > +Upon the next shadow stack access, the resulting shadow stack page
> > fault
> > +is handled by page copy/re-use.
> > +
> > +When a pthread child is created, the kernel allocates a new shadow
> > stack
> > +for the new thread. New shadow stack's behave like mmap() with
> > respect to
> > +ASLR behavior.
> 
> Please document the shadow stack lifetimes here:
> 
> I think thread exit unmaps shadow stack and vfork shares shadow stack
> with parent so exit does not unmap.

Sure, this can be updated.

> 
> I think the map_shadow_stack syscall should be mentioned in this
> document too.

There is a man page prepared for this. I plan to update the docs to
reference it when it exists and not duplicate the text. There can be a
blurb for the time being but it would be short lived.

> If one wants to scan the shadow stack how to detect the end (e.g.
> fast
> backtrace)? Is it useful to put an invalid value (-1) there?
> (affects map_shadow_stack syscall too).

Interesting idea. I think it's probably not a breaking ABI change if we
wanted to add it later.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler
  2023-03-01 18:06   ` Borislav Petkov
@ 2023-03-01 18:14     ` Edgecombe, Rick P
  2023-03-01 18:37       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-01 18:14 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, mtk.manpages, corbet, linux-kernel, linux-api, gorcunov

On Wed, 2023-03-01 at 19:06 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:24PM -0800, Rick Edgecombe wrote:
> > diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> > index 7ad22b705b64..cc10d8be9d74 100644
> > --- a/arch/x86/kernel/cet.c
> > +++ b/arch/x86/kernel/cet.c
> > @@ -4,10 +4,6 @@
> >   #include <asm/bugs.h>
> >   #include <asm/traps.h>
> >   
> > -static __ro_after_init bool ibt_fatal = true;
> > -
> > -extern void ibt_selftest_ip(void); /* code label defined in asm
> > below */
> > -
> >   enum cp_error_code {
> >        CP_EC        = (1 << 15) - 1,
> 
> From a previous review:
> 
> "That looks like a mask, so
> 
>         CP_EC_MASK
> 
> I guess."

It is from the existing code:

https://lore.kernel.org/lkml/393a03d063dee5831af93ca67636df75a76481c3.camel@intel.com/#t

The rename certainly doesn't belong in the code move patch, but I took
the previous discussion to mean it didn't belong in this patch either.
Do you want me to add it to this one or a separate one?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-01 18:07     ` Edgecombe, Rick P
@ 2023-03-01 18:32       ` Edgecombe, Rick P
  2023-03-02 16:34         ` szabolcs.nagy
  2023-03-02 16:14       ` szabolcs.nagy
  1 sibling, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-01 18:32 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, szabolcs.nagy, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, broonie,
	linux-arch, kcc, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng

On Wed, 2023-03-01 at 10:07 -0800, Rick Edgecombe wrote:
> > If one wants to scan the shadow stack how to detect the end (e.g.
> > fast
> > backtrace)? Is it useful to put an invalid value (-1) there?
> > (affects map_shadow_stack syscall too).
> 
> Interesting idea. I think it's probably not a breaking ABI change if
> we
> wanted to add it later.

One complication could be how to handle shadow stacks created outside
of thread creation. map_shadow_stack would typically add a token at the
end so it could be pivoted to. So then the backtracing algorithm would
have to know to skip it or something to find a special start of stack
marker.

Alternatively, the thread shadow stacks could get an already used token
pushed at the end, to try to match what an in-use map_shadow_stack
shadow stack would look like. Then the backtracing algorithm could just
look for the same token in both cases. It might get confused in exotic
cases and mistake a token in the middle of the stack for the end of the
allocation though. Hmm...

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler
  2023-03-01 18:14     ` Edgecombe, Rick P
@ 2023-03-01 18:37       ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-01 18:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, mtk.manpages, corbet, linux-kernel, linux-api, gorcunov

On Wed, Mar 01, 2023 at 06:14:44PM +0000, Edgecombe, Rick P wrote:
> Do you want me to add it to this one or a separate one?

Don't bother. I'll fix it up myself in the future, if I notice it again.
No need to disturb the set just for that.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA
  2023-02-27 22:29 ` [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
  2023-03-01  7:03   ` Christophe Leroy
@ 2023-03-02 12:19   ` Borislav Petkov
  1 sibling, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-02 12:19 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, linux-ia64, loongarch, linux-m68k,
	Michal Simek, Dinh Nguyen, linux-mips, linux-openrisc,
	linux-parisc, linuxppc-dev, linux-riscv, linux-s390, linux-sh,
	sparclinux, linux-um, xen-devel

On Mon, Feb 27, 2023 at 02:29:29PM -0800, Rick Edgecombe wrote:
> [0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/#t

I guess that sub-thread about how you arrived at this "pass a VMA"
decision should be in the Link tag. But that's for the committer, I'd
say.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-02-27 22:29 ` [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-03-02 12:48   ` Borislav Petkov
  2023-03-02 17:01     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-02 12:48 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:30PM -0800, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 0646ad00178b..56b374d1bffb 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -21,7 +21,8 @@
>  #define _PAGE_BIT_SOFTW2	10	/* " */
>  #define _PAGE_BIT_SOFTW3	11	/* " */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
>  #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>  #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
>  #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
> @@ -34,6 +35,15 @@
>  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>  
> +/*
> + * Indicates a Saved Dirty bit page.
> + */
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +#define _PAGE_BIT_SAVED_DIRTY		_PAGE_BIT_SOFTW5 /* Saved Dirty bit */
> +#else
> +#define _PAGE_BIT_SAVED_DIRTY		0
> +#endif
> +
>  /* If _PAGE_BIT_PRESENT is clear, we use these: */
>  /* - if the user mapped it with PROT_NONE; pte_present gives true */
>  #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
> @@ -117,6 +127,25 @@
>  #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
>  #endif
>  
> +/*
> + * The hardware requires shadow stack to be Write=0,Dirty=1. However,
> + * there are valid cases where the kernel might create read-only PTEs that
> + * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty  tracking). In
> + * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit,
> + * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have
> + * (Write=0,SavedDirty=1,Dirty=0) set.
> + *
> + * Note that on processors without shadow stack support, the 

.git/rebase-apply/patch:154: trailing whitespace.
 * Note that on processors without shadow stack support, the 
warning: 1 line adds whitespace errors.

Hm, apparently git checks for that too - not only trailing empty lines.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-01 18:07     ` Edgecombe, Rick P
  2023-03-01 18:32       ` Edgecombe, Rick P
@ 2023-03-02 16:14       ` szabolcs.nagy
  2023-03-02 21:17         ` Edgecombe, Rick P
  1 sibling, 1 reply; 159+ messages in thread
From: szabolcs.nagy @ 2023-03-02 16:14 UTC (permalink / raw)
  To: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, dave.hansen, kirill.shutemov,
	Eranian, Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	broonie, linux-arch, kcc, bp, oleg, hjl.tools, pavel, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, Yang, Weijiang, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng, nd

The 03/01/2023 18:07, Edgecombe, Rick P wrote:
> On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > +Application Enabling
> > > +====================
> > > +
> > > +An application's CET capability is marked in its ELF note and can
> > > be verified
> > > +from readelf/llvm-readelf output::
> > > +
> > > +    readelf -n <application> | grep -a SHSTK
> > > +        properties: x86 feature: SHSTK
> > > +
> > > +The kernel does not process these applications markers directly.
> > > Applications
> > > +or loaders must enable CET features using the interface described
> > > in section 4.
> > > +Typically this would be done in dynamic loader or static runtime
> > > objects, as is
> > > +the case in GLIBC.
> > 
> > Note that this has to be an early decision in libc (ld.so or static
> > exe start code), which will be difficult to hook into system wide
> > security policy settings. (e.g. to force shstk on marked binaries.)
> 
> In the eager enabling (by the kernel) scenario, how is this improved?
> The loader has to have the option to disable the shadow stack if
> enabling conditions are not met, so it still has to trust userspace to
> not do that. Did you have any more specifics on how the policy would
> work?

i guess my issue is that the arch prctls only allow self policing.
there is no kernel mechanism to set policy from outside the process
that is either inherited or asynchronously set. policy is completely
managed by libc (and done very early).

now i understand that async disable does not work (thanks for the
explanation), but some control for forced enable/locking inherited
across exec could work.

> > From userspace POV I'd prefer if a static exe did not have to parse
> > its own ELF notes (i.e. kernel enabled shstk based on the marking).
> 
> This is actually exactly what happens in the glibc patches. My
> understand was that it already been discussed amongst glibc folks.

there were many glibc patches some of which are committed despite not
having an accepted linux abi, so i'm trying to review the linux abi
contracts and expect this patch to be authorative, please bear with me.

> > - I think it's better to have a new limit specifically for shadow
> >   stack size (which by default can be RLIMIT_STACK) so userspace
> >   can adjust it if needed (another reason is that stack size is
> >   not always a good indicator of max call depth).
> 
> Hmm, yea. This seems like a good idea, but I don't see why it can't be
> a follow on. The series is quite big just to get the basics. I have
> tried to save some of the enhancements (like alt shadow stack) for the
> future.

it is actually not obvious how to introduce a limit so it is inherited
or reset in a sensible way so i think it is useful to discuss it
together with other issues.

> > The kernel pushes on the shadow stack on signal entry so shadow stack
> > overflow cannot be handled. Please document this as non-recoverable
> > failure.
> 
> It doesn't hurt to call it out. Please see the below link for future
> plans to handle this scenario (alt shadow stack).
> 
> > 
> > I think it can be made recoverable if signals with alternate stack
> > run
> > on a different shadow stack. And the top of the thread shadow stack
> > is
> > just corrupted instead of pushed in the overflow case. Then longjmp
> > out
> > can be made to work (common in stack overflow handling cases), and
> > reliable crash report from the signal handler works (also common).
> > 
> > Does SSP get stored into the sigcontext struct somewhere?
> 
> No, it's pushed to the shadow stack only. See the v2 coverletter of the
> discussion on the design and reasoning:
> 
> https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

i think this should be part of the initial design as it may be hard
to change later.

"sigaltshstk() is separate from sigaltstack(). You can have one
without the other, neither or both together. Because the shadow
stack specific state is pushed to the shadow stack, the two
features don’t need to know about each other."

this means they cannot be changed together atomically.

i'd expect most sigaltstack users to want to be resilient
against shadow stack overflow which means non-portable
code changes.

i don't see why automatic alt shadow stack allocation would
not work (kernel manages it transparently when an alt stack
is installed or disabled).

"Since shadow alt stacks are a new feature, longjmp()ing from an
alt shadow stack will simply not be supported. If a libc want’s
to support this it will need to enable WRSS and write it’s own
restore token."

i think longjmp should work without enabling writes to the shadow
stack in the libc. this can also affect unwinding across signal
handlers (not for c++ but e.g. glibc thread cancellation).

i'd prefer overwriting the shadow stack top entry on overflow to
disallowing longjmp out of a shadow stack overflow handler.

> > I think the map_shadow_stack syscall should be mentioned in this
> > document too.
> 
> There is a man page prepared for this. I plan to update the docs to
> reference it when it exists and not duplicate the text. There can be a
> blurb for the time being but it would be short lived.

i wanted to comment on the syscall because i think it may be better
to have a magic mmap MAP_ flag that takes care of everything.

but i can go comment on the specific patch then.

thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-01 18:32       ` Edgecombe, Rick P
@ 2023-03-02 16:34         ` szabolcs.nagy
  2023-03-03 22:35           ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: szabolcs.nagy @ 2023-03-02 16:34 UTC (permalink / raw)
  To: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, dave.hansen, kirill.shutemov,
	Eranian, Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	broonie, linux-arch, kcc, bp, oleg, hjl.tools, pavel, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, Yang, Weijiang, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng, nd

The 03/01/2023 18:32, Edgecombe, Rick P wrote:
> On Wed, 2023-03-01 at 10:07 -0800, Rick Edgecombe wrote:
> > > If one wants to scan the shadow stack how to detect the end (e.g.
> > > fast
> > > backtrace)? Is it useful to put an invalid value (-1) there?
> > > (affects map_shadow_stack syscall too).
> > 
> > Interesting idea. I think it's probably not a breaking ABI change if
> > we
> > wanted to add it later.
> 
> One complication could be how to handle shadow stacks created outside
> of thread creation. map_shadow_stack would typically add a token at the
> end so it could be pivoted to. So then the backtracing algorithm would
> have to know to skip it or something to find a special start of stack
> marker.

i'd expect the pivot token to disappear once you pivot to it
(and a pivot token to appear on the stack you pivoted away
from, so you can go back later) otherwise i don't see how
swapcontext works.

i'd push an end token and a pivot token on new shadow stacks.

> Alternatively, the thread shadow stacks could get an already used token
> pushed at the end, to try to match what an in-use map_shadow_stack
> shadow stack would look like. Then the backtracing algorithm could just
> look for the same token in both cases. It might get confused in exotic
> cases and mistake a token in the middle of the stack for the end of the
> allocation though. Hmm...

a backtracer would search for an end token on an active shadow
stack. it should be able to skip other tokens that don't seem
to be code addresses. the end token needs to be identifiable
and not break security properties. i think it's enough if the
backtrace is best effort correct, there can be corner-cases when
shadow stack is difficult to interpret, but e.g. a profiler can
still make good use of this feature.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-03-02 12:48   ` Borislav Petkov
@ 2023-03-02 17:01     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-02 17:01 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, 2023-03-02 at 13:48 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:30PM -0800, Rick Edgecombe wrote:
> > diff --git a/arch/x86/include/asm/pgtable_types.h
> > b/arch/x86/include/asm/pgtable_types.h
> > index 0646ad00178b..56b374d1bffb 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -21,7 +21,8 @@
> >   #define _PAGE_BIT_SOFTW2     10      /* " */
> >   #define _PAGE_BIT_SOFTW3     11      /* " */
> >   #define _PAGE_BIT_PAT_LARGE  12      /* On 2MB or 1GB pages */
> > -#define _PAGE_BIT_SOFTW4     58      /* available for programmer
> > */
> > +#define _PAGE_BIT_SOFTW4     57      /* available for programmer
> > */
> > +#define _PAGE_BIT_SOFTW5     58      /* available for programmer
> > */
> >   #define _PAGE_BIT_PKEY_BIT0  59      /* Protection Keys, bit 1/4
> > */
> >   #define _PAGE_BIT_PKEY_BIT1  60      /* Protection Keys, bit 2/4
> > */
> >   #define _PAGE_BIT_PKEY_BIT2  61      /* Protection Keys, bit 3/4
> > */
> > @@ -34,6 +35,15 @@
> >   #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty
> > tracking */
> >   #define _PAGE_BIT_DEVMAP     _PAGE_BIT_SOFTW4
> >   
> > +/*
> > + * Indicates a Saved Dirty bit page.
> > + */
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +#define _PAGE_BIT_SAVED_DIRTY                _PAGE_BIT_SOFTW5 /*
> > Saved Dirty bit */
> > +#else
> > +#define _PAGE_BIT_SAVED_DIRTY                0
> > +#endif
> > +
> >   /* If _PAGE_BIT_PRESENT is clear, we use these: */
> >   /* - if the user mapped it with PROT_NONE; pte_present gives true
> > */
> >   #define _PAGE_BIT_PROTNONE   _PAGE_BIT_GLOBAL
> > @@ -117,6 +127,25 @@
> >   #define _PAGE_SOFTW4 (_AT(pteval_t, 0))
> >   #endif
> >   
> > +/*
> > + * The hardware requires shadow stack to be Write=0,Dirty=1.
> > However,
> > + * there are valid cases where the kernel might create read-only
> > PTEs that
> > + * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty 
> > tracking). In
> > + * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-
> > dirty bit,
> > + * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have
> > + * (Write=0,SavedDirty=1,Dirty=0) set.
> > + *
> > + * Note that on processors without shadow stack support, the 
> 
> .git/rebase-apply/patch:154: trailing whitespace.
>  * Note that on processors without shadow stack support, the 
> warning: 1 line adds whitespace errors.
> 
> Hm, apparently git checks for that too - not only trailing empty
> lines.

Weird. And oops on the space. Just wondering how checkpatch missed
this. It didn't, just was in a pile of false positives on that patch
and I didn't notice it in there.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-02-27 22:29 ` [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2023-03-02 17:22   ` Szabolcs Nagy
  2023-03-02 21:21     ` Edgecombe, Rick P
  2023-03-09 18:55     ` Deepak Gupta
  2023-03-10 16:11   ` Borislav Petkov
  1 sibling, 2 replies; 159+ messages in thread
From: Szabolcs Nagy @ 2023-03-02 17:22 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: nd, al.grant

The 02/27/2023 14:29, Rick Edgecombe wrote:
> Previously, a new PROT_SHADOW_STACK was attempted,
...
> So rather than repurpose two existing syscalls (mmap, madvise) that don't
> quite fit, just implement a new map_shadow_stack syscall to allow
> userspace to map and setup new shadow stacks in one step. While ucontext
> is the primary motivator, userspace may have other unforeseen reasons to
> setup it's own shadow stacks using the WRSS instruction. Towards this
> provide a flag so that stacks can be optionally setup securely for the
> common case of ucontext without enabling WRSS. Or potentially have the
> kernel set up the shadow stack in some new way.
...
> The following example demonstrates how to create a new shadow stack with
> map_shadow_stack:
> void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

i think

mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);

could do the same with less disruption to users (new syscalls
are harder to deal with than new flags). it would do the
guard page and initial token setup too (there is no flag for
it but could be squeezed in).

most of the mmap features need not be available (EINVAL) when
MAP_SHADOW_STACK is specified.

the main drawback is running out of mmap flags so extension
is limited. (but the new syscall has limitations too).

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-02-27 22:29 ` [PATCH v7 30/41] x86/shstk: Handle thread shadow stack Rick Edgecombe
@ 2023-03-02 17:34   ` Szabolcs Nagy
  2023-03-02 21:48     ` Edgecombe, Rick P
  2023-03-08 15:26   ` Borislav Petkov
  1 sibling, 1 reply; 159+ messages in thread
From: Szabolcs Nagy @ 2023-03-02 17:34 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: Yu-cheng Yu, nd, al.grant

The 02/27/2023 14:29, Rick Edgecombe wrote:
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from
> returning from the vfork() calling function, using CET. Glibc does this
> by adjusting the shadow stack pointer in the child, so that the child
> receives a #CP if it tries to return from vfork() calling function.

this commit message implies there is protection against
the vfork child clobbering the parent's shadow stack,
but actually the child can INCSSP (or longjmp) and then
clobber it.

so the glibc code just tries to catch bugs and accidents
not a strong security mechanism. i'd skip this paragraph.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-02 16:14       ` szabolcs.nagy
@ 2023-03-02 21:17         ` Edgecombe, Rick P
  2023-03-03 16:30           ` szabolcs.nagy
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-02 21:17 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	broonie, kcc, linux-arch, bp, oleg, hjl.tools, Yang, Weijiang,
	Lutomirski, Andy, pavel, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng, nd

On Thu, 2023-03-02 at 16:14 +0000, szabolcs.nagy@arm.com wrote:
> The 03/01/2023 18:07, Edgecombe, Rick P wrote:
> > On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> > > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > > +Application Enabling
> > > > +====================
> > > > +
> > > > +An application's CET capability is marked in its ELF note and
> > > > can
> > > > be verified
> > > > +from readelf/llvm-readelf output::
> > > > +
> > > > +    readelf -n <application> | grep -a SHSTK
> > > > +        properties: x86 feature: SHSTK
> > > > +
> > > > +The kernel does not process these applications markers
> > > > directly.
> > > > Applications
> > > > +or loaders must enable CET features using the interface
> > > > described
> > > > in section 4.
> > > > +Typically this would be done in dynamic loader or static
> > > > runtime
> > > > objects, as is
> > > > +the case in GLIBC.
> > > 
> > > Note that this has to be an early decision in libc (ld.so or
> > > static
> > > exe start code), which will be difficult to hook into system wide
> > > security policy settings. (e.g. to force shstk on marked
> > > binaries.)
> > 
> > In the eager enabling (by the kernel) scenario, how is this
> > improved?
> > The loader has to have the option to disable the shadow stack if
> > enabling conditions are not met, so it still has to trust userspace
> > to
> > not do that. Did you have any more specifics on how the policy
> > would
> > work?
> 
> i guess my issue is that the arch prctls only allow self policing.
> there is no kernel mechanism to set policy from outside the process
> that is either inherited or asynchronously set. policy is completely
> managed by libc (and done very early).
> 
> now i understand that async disable does not work (thanks for the
> explanation), but some control for forced enable/locking inherited
> across exec could work.

Is the idea that shadow stack would be forced on regardless of if the
linked libraries support it? In which case it could be allowed to crash
if they do not?

I think the majority of users would prefer the other case where shadow
stack is only used if supported, so this sounds like a special case.
Rather than lose the flexibility for the typical case, I would think
something like this could be an additional enabling mode. glibc could
check if shadow stack is already enabled by the kernel using the
arch_prctl()s in this case.

We are having to work around the existing broken glibc binaries by not
triggering off the elf bits automatically in the kernel, but I suppose
if this was a special "I don't care if it crashes" feature, maybe it
would be ok. Otherwise we would need to change the elf header bit to
exclude the old binaries to even be able to do this, and there was
extreme resistance to this idea from the userspace side.

> 
> > > From userspace POV I'd prefer if a static exe did not have to
> > > parse
> > > its own ELF notes (i.e. kernel enabled shstk based on the
> > > marking).
> > 
> > This is actually exactly what happens in the glibc patches. My
> > understand was that it already been discussed amongst glibc folks.
> 
> there were many glibc patches some of which are committed despite not
> having an accepted linux abi, so i'm trying to review the linux abi
> contracts and expect this patch to be authorative, please bear with
> me.

H.J. has some recent ones that work against this kernel series that
might interest you. The existing upstream glibc support will not get
used due to the enabling interface change to arch_prctl() (this was one
of the inspirations of the change actually).

> 
> > > - I think it's better to have a new limit specifically for shadow
> > >   stack size (which by default can be RLIMIT_STACK) so userspace
> > >   can adjust it if needed (another reason is that stack size is
> > >   not always a good indicator of max call depth).
> > 
> > Hmm, yea. This seems like a good idea, but I don't see why it can't
> > be
> > a follow on. The series is quite big just to get the basics. I have
> > tried to save some of the enhancements (like alt shadow stack) for
> > the
> > future.
> 
> it is actually not obvious how to introduce a limit so it is
> inherited
> or reset in a sensible way so i think it is useful to discuss it
> together with other issues.

Looking at this again, I'm not sure why a new rlimit is needed. It
seems many of those points were just formulations of that the clone3
stack size was not used, but it actually is and just not documented. If
you disagree perhaps you could elaborate on what the requirements are
and we can see if it seems tricky to do in a follow up.

> 
> > > The kernel pushes on the shadow stack on signal entry so shadow
> > > stack
> > > overflow cannot be handled. Please document this as non-
> > > recoverable
> > > failure.
> > 
> > It doesn't hurt to call it out. Please see the below link for
> > future
> > plans to handle this scenario (alt shadow stack).
> > 
> > > 
> > > I think it can be made recoverable if signals with alternate
> > > stack
> > > run
> > > on a different shadow stack. And the top of the thread shadow
> > > stack
> > > is
> > > just corrupted instead of pushed in the overflow case. Then
> > > longjmp
> > > out
> > > can be made to work (common in stack overflow handling cases),
> > > and
> > > reliable crash report from the signal handler works (also
> > > common).
> > > 
> > > Does SSP get stored into the sigcontext struct somewhere?
> > 
> > No, it's pushed to the shadow stack only. See the v2 coverletter of
> > the
> > discussion on the design and reasoning:
> > 
> > 
https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/
> 
> i think this should be part of the initial design as it may be hard
> to change later.

This is actually how it came up. Andy Lutomirski said, paraphrasing,
"what if we want alt shadow stacks someday, does the signal frame ABI
support it?". So I created an ABI that supports it and an initial POC,
and said lets hold off on the implementation for the first version and
just use the sigframe ABI that will allow it for the future. So the
point was to make sure the signal format supported alt shadow stacks to
make it easier in the future.

> 
> "sigaltshstk() is separate from sigaltstack(). You can have one
> without the other, neither or both together. Because the shadow
> stack specific state is pushed to the shadow stack, the two
> features don’t need to know about each other."
> 
> this means they cannot be changed together atomically.

Not sure why this is needed since they can be used separately. So why
tie them together?

> 
> i'd expect most sigaltstack users to want to be resilient
> against shadow stack overflow which means non-portable
> code changes.

Portable between architectures? Or between shadow stack vs non-shadow
stack?

It does seem like it would not be uncommon for users to want both
together, but see below.

> 
> i don't see why automatic alt shadow stack allocation would
> not work (kernel manages it transparently when an alt stack
> is installed or disabled).

Ah, I think I see where maybe I can fill you in. Andy Luto had
discounted this idea out of hand originally, but I didn't see it at
first. sigaltstack lets you set, retrieve, or disable the shadow stack,
right... But this doesn't allocate anything, it just sets where the
next signal will be handled. This is different than things like threads
where there is a new resources being allocated and it makes coming up
with logic to guess when to de-allocate the alt shadow stack difficult.
You probably already know...

But because of this there can be some modes where the shadow stack is
changed while on it. For one example, SS_AUTODISARM will disable the
alt shadow stack while switching to it and restore when sigreturning.
At which point a new altstack can be set. In the non-shadow stack case
this is nice because future signals won't clobber the alt stack if you
switch away from it (swapcontext(), etc). But it also means you can
"change" the alt stack while on it ("change" sort of, the auto disarm
results in the kernel forgetting it temporarily).

I hear where you are coming from with the desire to have it "just work"
with existing code, but I think the resulting ABI around the alt shadow
stack allocation lifecycle would be way too complicated even if it
could be made to work. Hence making a new interface. But also, the idea
was that the x86 signal ABI should support handling alt shadow stacks,
which is what we have done with this series. If a different interface
for configuring it is better than the one from the POC, I'm not seeing
a problem jump out. Is there any specific concern about backwards
compatibility here?

> 
> "Since shadow alt stacks are a new feature, longjmp()ing from an
> alt shadow stack will simply not be supported. If a libc want’s
> to support this it will need to enable WRSS and write it’s own
> restore token."
> 
> i think longjmp should work without enabling writes to the shadow
> stack in the libc. this can also affect unwinding across signal
> handlers (not for c++ but e.g. glibc thread cancellation).

glibc today does not support longjmp()ing from a different stack (for
example even today after a swapcontext()) when shadow stack is used. If
glibc used wrss it could be supported maybe, but otherwise I don't see
how the HW can support it.

HJ and I were actually just discussing this the other day. Are you
looking at this series with respect to the arm shadow stack feature by
any chance? I would love if glibc/tools would document what the shadow
stack limitations are. If the all the arch's have the same or similar
limitations perhaps this could be one developer guide. For the most
part though, the limitations I've encountered are in glibc and the
kernel is more the building blocks.

> 
> i'd prefer overwriting the shadow stack top entry on overflow to
> disallowing longjmp out of a shadow stack overflow handler.
> 
> > > I think the map_shadow_stack syscall should be mentioned in this
> > > document too.
> > 
> > There is a man page prepared for this. I plan to update the docs to
> > reference it when it exists and not duplicate the text. There can
> > be a
> > blurb for the time being but it would be short lived.
> 
> i wanted to comment on the syscall because i think it may be better
> to have a magic mmap MAP_ flag that takes care of everything.
> 
> but i can go comment on the specific patch then.
> 
> thanks.

A general comment. Not sure if you are aware, but this shadow stack
enabling effort is quite old at this point and there have been many
discussions on these topics stretching back years. The latest
conversation was around getting this series into linux-next soon to get
some testing on the MM pieces. I really appreciate getting this ABI
feedback as it is always tricky to get right, but at this stage I would
hope to be focusing mostly on concrete problems.

I also expect to have some amount of ABI growth going forward with all
the normal things that entails. Shadow stack is not special in that it
can come fully finalized without the need for the real world usage
iterative feedback process. At some point we need to move forward with
something, and we have quite a bit of initial changes at this point.

So I would like to minimize the initial implementation unless anyone
sees any likely problems with future growth. Can you be clear if you
see any concrete problems at this point or are more looking to evaluate
the design reasoning? I'm under the assumption there is nothing that
would prohibit linux-next testing while any ABI shakedown happens
concurrently at least?

Thanks,
Rick

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-02 17:22   ` Szabolcs Nagy
@ 2023-03-02 21:21     ` Edgecombe, Rick P
  2023-03-09 18:55     ` Deepak Gupta
  1 sibling, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-02 21:21 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, szabolcs.nagy, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: nd, al.grant

On Thu, 2023-03-02 at 17:22 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > Previously, a new PROT_SHADOW_STACK was attempted,
> 
> ...
> > So rather than repurpose two existing syscalls (mmap, madvise) that
> > don't
> > quite fit, just implement a new map_shadow_stack syscall to allow
> > userspace to map and setup new shadow stacks in one step. While
> > ucontext
> > is the primary motivator, userspace may have other unforeseen
> > reasons to
> > setup it's own shadow stacks using the WRSS instruction. Towards
> > this
> > provide a flag so that stacks can be optionally setup securely for
> > the
> > common case of ucontext without enabling WRSS. Or potentially have
> > the
> > kernel set up the shadow stack in some new way.
> 
> ...
> > The following example demonstrates how to create a new shadow stack
> > with
> > map_shadow_stack:
> > void *shstk = map_shadow_stack(addr, stack_size,
> > SHADOW_STACK_SET_TOKEN);
> 
> i think
> 
> mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
> 
> could do the same with less disruption to users (new syscalls
> are harder to deal with than new flags). it would do the
> guard page and initial token setup too (there is no flag for
> it but could be squeezed in).
> 
> most of the mmap features need not be available (EINVAL) when
> MAP_SHADOW_STACK is specified.
> 
> the main drawback is running out of mmap flags so extension
> is limited. (but the new syscall has limitations too).

Deepak Gupta (working on riscv shadow stack) asked something similar.
Can you see if this thread answers your questions?

https://lore.kernel.org/lkml/20230223000340.GB945966@debug.ba.rivosinc.com/

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-03-02 17:34   ` Szabolcs Nagy
@ 2023-03-02 21:48     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-02 21:48 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, szabolcs.nagy, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng, nd, al.grant

On Thu, 2023-03-02 at 17:34 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > For shadow stack enabled vfork(), the parent and child can share
> > the same
> > shadow stack, like they can share a normal stack. Since the parent
> > is
> > suspended until the child terminates, the child will not interfere
> > with
> > the parent while executing as long as it doesn't return from the
> > vfork()
> > and overwrite up the shadow stack. The child can safely overwrite
> > down
> > the shadow stack, as the parent can just overwrite this later. So
> > CET does
> > not add any additional limitations for vfork().
> > 
> > Userspace implementing posix vfork() can actually prevent the child
> > from
> > returning from the vfork() calling function, using CET. Glibc does
> > this
> > by adjusting the shadow stack pointer in the child, so that the
> > child
> > receives a #CP if it tries to return from vfork() calling function.
> 
> this commit message implies there is protection against
> the vfork child clobbering the parent's shadow stack,
> but actually the child can INCSSP (or longjmp) and then
> clobber it.

It's true the vfork child could use INCSSP and clobber to create
problems, so it is not a strong guarantee of shadow stack integrity.
But that's not claimed either. It does "prevent the child from
returning from the vfork() calling function" as much as shadow stack
protections apply, which I think would be reasonably understood. The
vfork child could also use wrss to write the return address to the
shadow stack and actually return, or disable shadow stack and return,
as other ways to create problems.

> 
> so the glibc code just tries to catch bugs and accidents
> not a strong security mechanism. i'd skip this paragraph.

Yep. I think it's very much a "nice to have" thing and not intended for
security. The paragraph is an aside anyway, because it is specifics
about another project. I don't have any objection to dropping it if the
opportunity comes up. IIRC it was added because someone thought vfork
couldn't work with shadow stack, so people might like to have the
details of how can be done.

I wouldn't even be too bothered if the discussed glibc behavior was
dropped either. vfork() can go wrong many ways regardless of shadow
stack. Is it worth the extra special behavior? Maybe just barely...

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors
  2023-02-27 22:29 ` [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors Rick Edgecombe
@ 2023-03-03 14:00   ` Borislav Petkov
  2023-03-03 14:39     ` Dave Hansen
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-03 14:00 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:35PM -0800, Rick Edgecombe wrote:
> @@ -1310,6 +1324,23 @@ void do_user_addr_fault(struct pt_regs *regs,
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
> +	/*
> +	 * For conventionally writable pages, a read can be serviced with a
> +	 * read only PTE. But for shadow stack, there isn't a concept of
> +	 * read-only shadow stack memory. If it a PTE has the shadow stack

s/it //

> +	 * permission, it can be modified via CALL and RET instructions. So
> +	 * core MM needs to fault in a writable PTE and do things it already
> +	 * does for write faults.
> +	 *
> +	 * Shadow stack accesses (read or write) need to be serviced with
> +	 * shadow stack permission memory, which always include write
> +	 * permissions. So in the case of a shadow stack read access, treat it
> +	 * as a WRITE fault. This will make sure that MM will prepare
> +	 * everything (e.g., break COW) such that maybe_mkwrite() can create a
> +	 * proper shadow stack PTE.
> +	 */
> +	if (error_code & X86_PF_SHSTK)
> +		flags |= FAULT_FLAG_WRITE;
>  	if (error_code & X86_PF_WRITE)
>  		flags |= FAULT_FLAG_WRITE;
>  	if (error_code & X86_PF_INSTR)
> -- 
> 2.17.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors
  2023-03-03 14:00   ` Borislav Petkov
@ 2023-03-03 14:39     ` Dave Hansen
  0 siblings, 0 replies; 159+ messages in thread
From: Dave Hansen @ 2023-03-03 14:39 UTC (permalink / raw)
  To: Borislav Petkov, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On 3/3/23 06:00, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:35PM -0800, Rick Edgecombe wrote:
>> @@ -1310,6 +1324,23 @@ void do_user_addr_fault(struct pt_regs *regs,
>>  
>>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>  
>> +	/*
>> +	 * For conventionally writable pages, a read can be serviced with a
>> +	 * read only PTE. But for shadow stack, there isn't a concept of
>> +	 * read-only shadow stack memory. If it a PTE has the shadow stack
> s/it //
> 
>> +	 * permission, it can be modified via CALL and RET instructions. So
>> +	 * core MM needs to fault in a writable PTE and do things it already
>> +	 * does for write faults.
>> +	 *
>> +	 * Shadow stack accesses (read or write) need to be serviced with
>> +	 * shadow stack permission memory, which always include write
>> +	 * permissions. So in the case of a shadow stack read access, treat it
>> +	 * as a WRITE fault. This will make sure that MM will prepare
>> +	 * everything (e.g., break COW) such that maybe_mkwrite() can create a
>> +	 * proper shadow stack PTE.

I ended up just chopping that top paragraph out and rewording it a bit.  I think this still expresses the intent in a lot less space:

        /*
         * Read-only permissions can not be expressed in shadow stack PTEs.
         * Treat all shadow stack accesses as WRITE faults. This ensures
         * that the MM will prepare everything (e.g., break COW) such that
         * maybe_mkwrite() can create a proper shadow stack PTE.
         */


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 20/41] x86/mm: Teach pte_mkwrite() about stack memory
  2023-02-27 22:29 ` [PATCH v7 20/41] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
@ 2023-03-03 15:37   ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-03 15:37 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:36PM -0800, Rick Edgecombe wrote:
> If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
> when it is made writable with pte_mkwrite(), it should create shadow
> stack memory, not conventionally writable memory. Now that pte_mkwrite()
> takes a VMA, and places where shadow stack memory might be created pass
> one, pte_mkwrite() can know when it should do this.
^^^^

This sentence needs rewriting.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-02 21:17         ` Edgecombe, Rick P
@ 2023-03-03 16:30           ` szabolcs.nagy
  2023-03-03 16:57             ` H.J. Lu
  2023-03-03 17:41             ` Edgecombe, Rick P
  0 siblings, 2 replies; 159+ messages in thread
From: szabolcs.nagy @ 2023-03-03 16:30 UTC (permalink / raw)
  To: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, fweimer, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, hjl.tools,
	Yang, Weijiang, Lutomirski, Andy, pavel, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, debug, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm
  Cc: Yu, Yu-cheng, nd

The 03/02/2023 21:17, Edgecombe, Rick P wrote:
> Is the idea that shadow stack would be forced on regardless of if the
> linked libraries support it? In which case it could be allowed to crash
> if they do not?

execute a binary
- with shstk enabled and locked (only if marked?).
- with shstk disabled and locked.
could be managed in userspace, but it is libc dependent then.

> > > > - I think it's better to have a new limit specifically for shadow
> > > >   stack size (which by default can be RLIMIT_STACK) so userspace
> > > >   can adjust it if needed (another reason is that stack size is
> > > >   not always a good indicator of max call depth).
> 
> Looking at this again, I'm not sure why a new rlimit is needed. It
> seems many of those points were just formulations of that the clone3
> stack size was not used, but it actually is and just not documented. If
> you disagree perhaps you could elaborate on what the requirements are
> and we can see if it seems tricky to do in a follow up.

- tiny thread stack and deep signal stack.
(note that this does not really work with glibc because it has
implementation internal signals that don't run on alt stack,
cannot be masked and don't fit on a tiny thread stack, but
with other runtimes this can be a valid use-case, e.g. musl
allows tiny thread stacks, < pagesize.)

- thread runtimes with clone (glibc uses clone3 but some dont).

- huge stacks but small call depth (problem if some va limit
  is hit or memory overcommit is disabled).

> > "sigaltshstk() is separate from sigaltstack(). You can have one
> > without the other, neither or both together. Because the shadow
> > stack specific state is pushed to the shadow stack, the two
> > features don’t need to know about each other."
...
> > i don't see why automatic alt shadow stack allocation would
> > not work (kernel manages it transparently when an alt stack
> > is installed or disabled).
> 
> Ah, I think I see where maybe I can fill you in. Andy Luto had
> discounted this idea out of hand originally, but I didn't see it at
> first. sigaltstack lets you set, retrieve, or disable the shadow stack,
> right... But this doesn't allocate anything, it just sets where the
> next signal will be handled. This is different than things like threads
> where there is a new resources being allocated and it makes coming up
> with logic to guess when to de-allocate the alt shadow stack difficult.
> You probably already know...
> 
> But because of this there can be some modes where the shadow stack is
> changed while on it. For one example, SS_AUTODISARM will disable the
> alt shadow stack while switching to it and restore when sigreturning.
> At which point a new altstack can be set. In the non-shadow stack case
> this is nice because future signals won't clobber the alt stack if you
> switch away from it (swapcontext(), etc). But it also means you can
> "change" the alt stack while on it ("change" sort of, the auto disarm
> results in the kernel forgetting it temporarily).

the problem with swapcontext is that it may unmask signals
that run on the alt stack, which means the code cannot jump
back after another signal clobbered the alt stack.

the non-standard SS_AUTODISARM aims to solve this by disabling
alt stack settings on signal entry until the handler returns.

so this use case is not about supporting swapcontext out, but
about jumping back. however that does not work reliably with
this patchset: if swapcontext goes to the thread stack (and
not to another stack e.g. used by makecontext), then jump back
fails. (and if there is a sigaltshstk installed then even jump
out fails.)

assuming
- jump out from alt shadow stack can be made to work.
- alt shadow stack management can be automatic.
then this can be improved so jump back works reliably.

> I hear where you are coming from with the desire to have it "just work"
> with existing code, but I think the resulting ABI around the alt shadow
> stack allocation lifecycle would be way too complicated even if it
> could be made to work. Hence making a new interface. But also, the idea
> was that the x86 signal ABI should support handling alt shadow stacks,
> which is what we have done with this series. If a different interface
> for configuring it is better than the one from the POC, I'm not seeing
> a problem jump out. Is there any specific concern about backwards
> compatibility here?

sigaltstack syscall behaviour may be hard to change later
and currently
- shadow stack overflow cannot be recovered from.
- longjmp out of signal handler fails (with sigaltshstk).
- SS_AUTODISARM does not work (jump back can fail).

> > "Since shadow alt stacks are a new feature, longjmp()ing from an
> > alt shadow stack will simply not be supported. If a libc want’s
> > to support this it will need to enable WRSS and write it’s own
> > restore token."
> > 
> > i think longjmp should work without enabling writes to the shadow
> > stack in the libc. this can also affect unwinding across signal
> > handlers (not for c++ but e.g. glibc thread cancellation).
> 
> glibc today does not support longjmp()ing from a different stack (for
> example even today after a swapcontext()) when shadow stack is used. If
> glibc used wrss it could be supported maybe, but otherwise I don't see
> how the HW can support it.
> 
> HJ and I were actually just discussing this the other day. Are you
> looking at this series with respect to the arm shadow stack feature by
> any chance? I would love if glibc/tools would document what the shadow
> stack limitations are. If the all the arch's have the same or similar
> limitations perhaps this could be one developer guide. For the most
> part though, the limitations I've encountered are in glibc and the
> kernel is more the building blocks.

well we hope that shadow stack behaviour and limitations can
be similar across targets.

longjmp to different stack should work: it can do the same as
setcontext/swapcontext: scan for the pivot token. then only
longjmp out of alt shadow stack fails. (this is non-conforming
longjmp use, but e.g. qemu relies on it.)

for longjmp out of alt shadow stack, the target shadow stack
needs a pivot token, which implies the kernel needs to push that
on signal entry, which can overflow. but i suspect that can be
handled the same way as stackoverflow on signal entry is handled.

> A general comment. Not sure if you are aware, but this shadow stack
> enabling effort is quite old at this point and there have been many
> discussions on these topics stretching back years. The latest
> conversation was around getting this series into linux-next soon to get
> some testing on the MM pieces. I really appreciate getting this ABI
> feedback as it is always tricky to get right, but at this stage I would
> hope to be focusing mostly on concrete problems.
> 
> I also expect to have some amount of ABI growth going forward with all
> the normal things that entails. Shadow stack is not special in that it
> can come fully finalized without the need for the real world usage
> iterative feedback process. At some point we need to move forward with
> something, and we have quite a bit of initial changes at this point.
> 
> So I would like to minimize the initial implementation unless anyone
> sees any likely problems with future growth. Can you be clear if you
> see any concrete problems at this point or are more looking to evaluate
> the design reasoning? I'm under the assumption there is nothing that
> would prohibit linux-next testing while any ABI shakedown happens
> concurrently at least?

understood.

the points that i think are worth raising:

- shadow stack size logic may need to change later.
  (it can be too big, or too small in practice.)
- shadow stack overflow is not recoverable and the
  possible fix for that (sigaltshstk) breaks longjmp
  out of signal handlers.
- jump back after SS_AUTODISARM swapcontext cannot be
  reliable if alt signal uses thread shadow stack.
- the above two concerns may be mitigated by different
  sigaltstack behaviour which may be hard to add later.
- end token for backtrace may be useful, if added
  later it can be hard to check.

thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-03 16:30           ` szabolcs.nagy
@ 2023-03-03 16:57             ` H.J. Lu
  2023-03-03 17:39               ` szabolcs.nagy
  2023-03-03 17:41             ` Edgecombe, Rick P
  1 sibling, 1 reply; 159+ messages in thread
From: H.J. Lu @ 2023-03-03 16:57 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, fweimer, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, Yang,
	Weijiang, Lutomirski, Andy, pavel, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, debug, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm, Yu, Yu-cheng, nd

On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 03/02/2023 21:17, Edgecombe, Rick P wrote:
> > Is the idea that shadow stack would be forced on regardless of if the
> > linked libraries support it? In which case it could be allowed to crash
> > if they do not?
>
> execute a binary
> - with shstk enabled and locked (only if marked?).
> - with shstk disabled and locked.
> could be managed in userspace, but it is libc dependent then.
>
> > > > > - I think it's better to have a new limit specifically for shadow
> > > > >   stack size (which by default can be RLIMIT_STACK) so userspace
> > > > >   can adjust it if needed (another reason is that stack size is
> > > > >   not always a good indicator of max call depth).
> >
> > Looking at this again, I'm not sure why a new rlimit is needed. It
> > seems many of those points were just formulations of that the clone3
> > stack size was not used, but it actually is and just not documented. If
> > you disagree perhaps you could elaborate on what the requirements are
> > and we can see if it seems tricky to do in a follow up.
>
> - tiny thread stack and deep signal stack.
> (note that this does not really work with glibc because it has
> implementation internal signals that don't run on alt stack,
> cannot be masked and don't fit on a tiny thread stack, but
> with other runtimes this can be a valid use-case, e.g. musl
> allows tiny thread stacks, < pagesize.)
>
> - thread runtimes with clone (glibc uses clone3 but some dont).
>
> - huge stacks but small call depth (problem if some va limit
>   is hit or memory overcommit is disabled).
>
> > > "sigaltshstk() is separate from sigaltstack(). You can have one
> > > without the other, neither or both together. Because the shadow
> > > stack specific state is pushed to the shadow stack, the two
> > > features don’t need to know about each other."
> ...
> > > i don't see why automatic alt shadow stack allocation would
> > > not work (kernel manages it transparently when an alt stack
> > > is installed or disabled).
> >
> > Ah, I think I see where maybe I can fill you in. Andy Luto had
> > discounted this idea out of hand originally, but I didn't see it at
> > first. sigaltstack lets you set, retrieve, or disable the shadow stack,
> > right... But this doesn't allocate anything, it just sets where the
> > next signal will be handled. This is different than things like threads
> > where there is a new resources being allocated and it makes coming up
> > with logic to guess when to de-allocate the alt shadow stack difficult.
> > You probably already know...
> >
> > But because of this there can be some modes where the shadow stack is
> > changed while on it. For one example, SS_AUTODISARM will disable the
> > alt shadow stack while switching to it and restore when sigreturning.
> > At which point a new altstack can be set. In the non-shadow stack case
> > this is nice because future signals won't clobber the alt stack if you
> > switch away from it (swapcontext(), etc). But it also means you can
> > "change" the alt stack while on it ("change" sort of, the auto disarm
> > results in the kernel forgetting it temporarily).
>
> the problem with swapcontext is that it may unmask signals
> that run on the alt stack, which means the code cannot jump
> back after another signal clobbered the alt stack.
>
> the non-standard SS_AUTODISARM aims to solve this by disabling
> alt stack settings on signal entry until the handler returns.
>
> so this use case is not about supporting swapcontext out, but
> about jumping back. however that does not work reliably with
> this patchset: if swapcontext goes to the thread stack (and
> not to another stack e.g. used by makecontext), then jump back
> fails. (and if there is a sigaltshstk installed then even jump
> out fails.)
>
> assuming
> - jump out from alt shadow stack can be made to work.
> - alt shadow stack management can be automatic.
> then this can be improved so jump back works reliably.
>
> > I hear where you are coming from with the desire to have it "just work"
> > with existing code, but I think the resulting ABI around the alt shadow
> > stack allocation lifecycle would be way too complicated even if it
> > could be made to work. Hence making a new interface. But also, the idea
> > was that the x86 signal ABI should support handling alt shadow stacks,
> > which is what we have done with this series. If a different interface
> > for configuring it is better than the one from the POC, I'm not seeing
> > a problem jump out. Is there any specific concern about backwards
> > compatibility here?
>
> sigaltstack syscall behaviour may be hard to change later
> and currently
> - shadow stack overflow cannot be recovered from.
> - longjmp out of signal handler fails (with sigaltshstk).
> - SS_AUTODISARM does not work (jump back can fail).
>
> > > "Since shadow alt stacks are a new feature, longjmp()ing from an
> > > alt shadow stack will simply not be supported. If a libc want’s
> > > to support this it will need to enable WRSS and write it’s own
> > > restore token."
> > >
> > > i think longjmp should work without enabling writes to the shadow
> > > stack in the libc. this can also affect unwinding across signal
> > > handlers (not for c++ but e.g. glibc thread cancellation).
> >
> > glibc today does not support longjmp()ing from a different stack (for
> > example even today after a swapcontext()) when shadow stack is used. If
> > glibc used wrss it could be supported maybe, but otherwise I don't see
> > how the HW can support it.
> >
> > HJ and I were actually just discussing this the other day. Are you
> > looking at this series with respect to the arm shadow stack feature by
> > any chance? I would love if glibc/tools would document what the shadow
> > stack limitations are. If the all the arch's have the same or similar
> > limitations perhaps this could be one developer guide. For the most
> > part though, the limitations I've encountered are in glibc and the
> > kernel is more the building blocks.
>
> well we hope that shadow stack behaviour and limitations can
> be similar across targets.
>
> longjmp to different stack should work: it can do the same as
> setcontext/swapcontext: scan for the pivot token. then only
> longjmp out of alt shadow stack fails. (this is non-conforming
> longjmp use, but e.g. qemu relies on it.)

Restore token may not be used with longjmp.  Unlike setcontext/swapcontext,
longjmp is optional.  If longjmp isn't called, there will be an extra
token on shadow
stack and RET will fail.

> for longjmp out of alt shadow stack, the target shadow stack
> needs a pivot token, which implies the kernel needs to push that
> on signal entry, which can overflow. but i suspect that can be
> handled the same way as stackoverflow on signal entry is handled.
>
> > A general comment. Not sure if you are aware, but this shadow stack
> > enabling effort is quite old at this point and there have been many
> > discussions on these topics stretching back years. The latest
> > conversation was around getting this series into linux-next soon to get
> > some testing on the MM pieces. I really appreciate getting this ABI
> > feedback as it is always tricky to get right, but at this stage I would
> > hope to be focusing mostly on concrete problems.
> >
> > I also expect to have some amount of ABI growth going forward with all
> > the normal things that entails. Shadow stack is not special in that it
> > can come fully finalized without the need for the real world usage
> > iterative feedback process. At some point we need to move forward with
> > something, and we have quite a bit of initial changes at this point.
> >
> > So I would like to minimize the initial implementation unless anyone
> > sees any likely problems with future growth. Can you be clear if you
> > see any concrete problems at this point or are more looking to evaluate
> > the design reasoning? I'm under the assumption there is nothing that
> > would prohibit linux-next testing while any ABI shakedown happens
> > concurrently at least?
>
> understood.
>
> the points that i think are worth raising:
>
> - shadow stack size logic may need to change later.
>   (it can be too big, or too small in practice.)
> - shadow stack overflow is not recoverable and the
>   possible fix for that (sigaltshstk) breaks longjmp
>   out of signal handlers.
> - jump back after SS_AUTODISARM swapcontext cannot be
>   reliable if alt signal uses thread shadow stack.
> - the above two concerns may be mitigated by different
>   sigaltstack behaviour which may be hard to add later.
> - end token for backtrace may be useful, if added
>   later it can be hard to check.
>
> thanks.



-- 
H.J.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-03 16:57             ` H.J. Lu
@ 2023-03-03 17:39               ` szabolcs.nagy
  2023-03-03 17:50                 ` H.J. Lu
  0 siblings, 1 reply; 159+ messages in thread
From: szabolcs.nagy @ 2023-03-03 17:39 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, fweimer, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, Yang,
	Weijiang, Lutomirski, Andy, pavel, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, debug, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm, Yu, Yu-cheng, nd

The 03/03/2023 08:57, H.J. Lu wrote:
> On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com
> <szabolcs.nagy@arm.com> wrote:
> > longjmp to different stack should work: it can do the same as
> > setcontext/swapcontext: scan for the pivot token. then only
> > longjmp out of alt shadow stack fails. (this is non-conforming
> > longjmp use, but e.g. qemu relies on it.)
> 
> Restore token may not be used with longjmp.  Unlike setcontext/swapcontext,
> longjmp is optional.  If longjmp isn't called, there will be an extra
> token on shadow
> stack and RET will fail.

what do you mean longjmp is optional?

it can scan the target shadow stack and decide if it's the
same as the current one or not and in the latter case there
should be a restore token to switch to. then it can INCSSP
to reach the target SSP state.

qemu does setjmp, then swapcontext, then longjmp back.
swapcontext can change the stack, but leaves a token behind
so longjmp can switch back.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-03 16:30           ` szabolcs.nagy
  2023-03-03 16:57             ` H.J. Lu
@ 2023-03-03 17:41             ` Edgecombe, Rick P
  1 sibling, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-03 17:41 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	broonie, kcc, linux-arch, bp, oleg, hjl.tools, Yang, Weijiang,
	Lutomirski, Andy, pavel, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng, nd

On Fri, 2023-03-03 at 16:30 +0000, szabolcs.nagy@arm.com wrote:
> the points that i think are worth raising:
> 
> - shadow stack size logic may need to change later.
>   (it can be too big, or too small in practice.)

Looking at making it more efficient in the future seems great. But
since we are not in the position of being able to make shadow stacks
completely seamless (see below)

> - shadow stack overflow is not recoverable and the
>   possible fix for that (sigaltshstk) breaks longjmp
>   out of signal handlers.
> - jump back after SS_AUTODISARM swapcontext cannot be
>   reliable if alt signal uses thread shadow stack.
> - the above two concerns may be mitigated by different
>   sigaltstack behaviour which may be hard to add later.

Are you aware that you can't simply emit a restore token on x86 without
first restoring to another restore token? This is why (I'm assuming)
glibc uses incssp to implement longjmp instead of just jumping back to
the setjmp point with a shadow stack restore. So of course then longjmp
can't jump between shadow stacks. So there are sort of two categories
of restrictions on binaries that mark the SHSTK elf bit. The first
category is that they have to take special steps when switching stacks
or jumping around on the stack. Once they handle this, they can work
with shadow stack.

The second category is that they can't do certain patterns of jumping
around on stacks, regardless of the steps they take. So certain
previously allowed software patterns are now impossible, including ones
implemented in glibc. (And the exact restrictions on the glibc APIs are
not documented and this should be fixed).

If applications will violate either type of these restrictions they
should not mark the SHSTK elf bit.

Now that said, there is an exception to these restrictions on x86,
which is the WRSS instruction, which can write to the shadow stack. The
arch_prctl() interface allows this to be optionally enabled and locked.
The v2 signal analysis I pointed earlier, mentions how this might be
used by glibc to support more of the currently restricted patterns.
Please take a look if you haven't (section "setjmp()/longjmp()"). It
also explains why in the non-WRSS scenarios the kernel can't easily
help improve the situation.

WRSS opens up writing to the shadow stack, and so a glibc-WRSS mode
would be making a security/compatibility tradeoff. I think starting
with the more restricted mode was ultimately good in creating a kernel
ABI that can support both. If userspace could paper over ABI gaps with
WRSS, we might not have realized the issues we did.

> - end token for backtrace may be useful, if added
>   later it can be hard to check.

Yes this seems like a good idea. Thanks for the suggestion. I'm not
sure it can't be added later though. I'll POC it and do some more
thinking.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-03 17:39               ` szabolcs.nagy
@ 2023-03-03 17:50                 ` H.J. Lu
  0 siblings, 0 replies; 159+ messages in thread
From: H.J. Lu @ 2023-03-03 17:50 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, fweimer, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, Yang,
	Weijiang, Lutomirski, Andy, pavel, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, debug, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm, Yu, Yu-cheng, nd

On Fri, Mar 3, 2023 at 9:40 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 03/03/2023 08:57, H.J. Lu wrote:
> > On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com
> > <szabolcs.nagy@arm.com> wrote:
> > > longjmp to different stack should work: it can do the same as
> > > setcontext/swapcontext: scan for the pivot token. then only
> > > longjmp out of alt shadow stack fails. (this is non-conforming
> > > longjmp use, but e.g. qemu relies on it.)
> >
> > Restore token may not be used with longjmp.  Unlike setcontext/swapcontext,
> > longjmp is optional.  If longjmp isn't called, there will be an extra
> > token on shadow
> > stack and RET will fail.
>
> what do you mean longjmp is optional?

In some cases, longjmp is called to handle an error condition and
longjmp won't be called if there is no error.

> it can scan the target shadow stack and decide if it's the
> same as the current one or not and in the latter case there
> should be a restore token to switch to. then it can INCSSP
> to reach the target SSP state.
>
> qemu does setjmp, then swapcontext, then longjmp back.
> swapcontext can change the stack, but leaves a token behind
> so longjmp can switch back.

This needs changes to support shadow stack.  Replacing setjmp with
getcontext and longjmp with setcontext may work for shadow stack.

BTW, there is no testcase in glibc for this usage.

-- 
H.J.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-02 16:34         ` szabolcs.nagy
@ 2023-03-03 22:35           ` Edgecombe, Rick P
  2023-03-06 16:20             ` szabolcs.nagy
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-03 22:35 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	broonie, kcc, linux-arch, bp, oleg, hjl.tools, Yang, Weijiang,
	Lutomirski, Andy, pavel, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Yu, Yu-cheng, nd

On Thu, 2023-03-02 at 16:34 +0000, szabolcs.nagy@arm.com wrote:
> > Alternatively, the thread shadow stacks could get an already used
> > token
> > pushed at the end, to try to match what an in-use map_shadow_stack
> > shadow stack would look like. Then the backtracing algorithm could
> > just
> > look for the same token in both cases. It might get confused in
> > exotic
> > cases and mistake a token in the middle of the stack for the end of
> > the
> > allocation though. Hmm...
> 
> a backtracer would search for an end token on an active shadow
> stack. it should be able to skip other tokens that don't seem
> to be code addresses. the end token needs to be identifiable
> and not break security properties. i think it's enough if the
> backtrace is best effort correct, there can be corner-cases when
> shadow stack is difficult to interpret, but e.g. a profiler can
> still make good use of this feature.

So just taking a look at this and remembering we used to have an
arch_prctl() that returned the thread's shadow stack base and size.
Glibc needed it, but we found a way around and dropped it. If we added
something like that back, then it could be used for backtracing in the
typical thread case and also potentially similar things to what glibc
was doing. This also saves ~8 bytes per shadow stack over an end-of-
stack marker, so it's a tiny bit better on memory use.

For the end-of-stack-marker solution:
In the case of thread shadow stacks, I'm not seeing any issues testing
adding markers at the end. So adding this on top of the existing series
for just thread shadow stacks seems lower probability of impact
regression wise. Especially if we do it in the near term.

For ucontext/map_shadow_stack, glibc expects a token to be at the size
passed in. So we would either have to create a larger allocation (to
include the marker) or create a new map_shadow_stack flag to do this
(it was expected that there might be new types of initial shadow stack
data that the kernel might need to create). It is also possible to pass
a non-page aligned size and get zero's at the end of the allocation. In
fact glibc does this today in the common case. So that is also an
option.

I think I slightly prefer the former arch_prctl() based solution for a
few reasons:
 - When you need to find the start or end of the shadow stack can you
can just ask for it instead of searching. It can be faster and simpler.
 - It saves 8 bytes of memory per shadow stack.

If this turns out to be wrong and we want to do the marker solution
much later at some point, the safest option would probably be to create
new flags.

But just discussing this with HJ, can you share more on what the usage
is? Like which backtracing operation specifically needs the marker? How
much does it care about the ucontext case?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-02-27 22:29 ` [PATCH v7 21/41] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2023-03-06  8:08   ` Borislav Petkov
  2023-03-07  1:29     ` Edgecombe, Rick P
  2023-03-17 17:09   ` Deepak Gupta
  1 sibling, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-06  8:08 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

Just typos:

On Mon, Feb 27, 2023 at 02:29:37PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> The architecture of shadow stack constrains the ability of userspace to
> move the shadow stack pointer (SSP) in order to  prevent corrupting or
> switching to other shadow stacks. The RSTORSSP can move the ssp to
						^
						instruction

s/ssp/SSP/g


> different shadow stacks, but it requires a specially placed token in order
> to do this. However, the architecture does not prevent incrementing the
> stack pointer to wander onto an adjacent shadow stack. To prevent this in
> software, enforce guard pages at the beginning of shadow stack vmas, such

VMAs

> that there will always be a gap between adjacent shadow stacks.
> 
> Make the gap big enough so that no userspace SSP changing operations
> (besides RSTORSSP), can move the SSP from one stack to the next. The
> SSP can increment or decrement by CALL, RET  and INCSSP. CALL and RET

"can be incremented or decremented"

> can move the SSP by a maximum of 8 bytes, at which point the shadow
> stack would be accessed.
> 
> The INCSSP instruction can also increment the shadow stack pointer. It
> is the shadow stack analog of an instruction like:
> 
> 	addq    $0x80, %rsp
> 
> However, there is one important difference between an ADD on %rsp and
> INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
> of the first and last elements that were "popped". It can be thought of
> as acting like this:
> 
> READ_ONCE(ssp);       // read+discard top element on stack
> ssp += nr_to_pop * 8; // move the shadow stack
> READ_ONCE(ssp-8);     // read+discard last popped stack element
> 
> The maximum distance INCSSP can move the SSP is 2040 bytes, before it
> would read the memory. Therefore a single page gap will be enough to
				  ^
				  ,


> prevent any operation from shifting the SSP to an adjacent stack, since
> it would have to land in the gap at least once, causing a fault.
> 
> This could be accomplished by using VM_GROWSDOWN, but this has a
> downside. The behavior would allow shadow stack's to grow, which is

s/stack's/stacks/

> unneeded and adds a strange difference to how most regular stacks work.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> v5:
>  - Fix typo in commit log
> 
> v4:
>  - Drop references to 32 bit instructions
>  - Switch to generic code to drop __weak (Peterz)
> 
> v2:
>  - Use __weak instead of #ifdef (Dave Hansen)
>  - Only have start gap on shadow stack (Andy Luto)
>  - Create stack_guard_start_gap() to not duplicate code
>    in an arch version of vm_start_gap() (Dave Hansen)
>  - Improve commit log partly with verbiage from (Dave Hansen)
> 
> Yu-cheng v25:
>  - Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.
> ---
>  include/linux/mm.h | 31 ++++++++++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 097544afb1aa..6a093daced88 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3107,15 +3107,36 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
>  	return mtree_load(&mm->mm_mt, addr);
>  }
>  
> +static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & VM_GROWSDOWN)
> +		return stack_guard_gap;
> +
> +	/*
> +	 * Shadow stack pointer is moved by CALL, RET, and INCSSPQ.
> +	 * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
> +	 * and touches the first and the last element in the range, which
> +	 * triggers a page fault if the range is not in a shadow stack.
> +	 * Because of this, creating 4-KB guard pages around a shadow
> +	 * stack prevents these instructions from going beyond.

I'd prefer the equivalant explanation above from the commit message - it
is more precise.

> +	 *
> +	 * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
> +	 * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
> +	 */
> +	if (vma->vm_flags & VM_SHADOW_STACK)
> +		return PAGE_SIZE;
> +
> +	return 0;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-02-27 22:29 ` [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
@ 2023-03-06 13:01   ` Borislav Petkov
  2023-03-06 18:11     ` Edgecombe, Rick P
  2023-03-07 10:42   ` David Hildenbrand
  2023-03-17 17:12   ` Deepak Gupta
  2 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-06 13:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:38PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Account shadow stack pages to stack memory. Do this by adding a
> VM_SHADOW_STACK check in is_stack_mapping().

That last sentence is superfluous.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-02-27 22:29 ` [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
@ 2023-03-06 13:10   ` Borislav Petkov
  2023-03-06 18:15     ` Andy Lutomirski
  2023-03-17 17:05   ` Deepak Gupta
  1 sibling, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-06 13:10 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:40PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Shadow stack memory is writable only in very specific, controlled ways.
> However, since it is writable, the kernel treats it as such. As a result
									  ^
									  ,

> there remain many ways for userspace to trigger the kernel to write to
> shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a

"stacks"

or "to write to a shadow stack via..."

> little less exposed, block writable GUPs for shadow stack VMAs.

GUPs?

I supposed this means "prevent get_user_pages() from pinning pages to
which the corresponding VMA is a shadow stack one."?

Or something like that which is less mm-internal speak...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-03 22:35           ` Edgecombe, Rick P
@ 2023-03-06 16:20             ` szabolcs.nagy
  2023-03-06 16:31               ` Florian Weimer
  2023-03-06 18:05               ` Edgecombe, Rick P
  0 siblings, 2 replies; 159+ messages in thread
From: szabolcs.nagy @ 2023-03-06 16:20 UTC (permalink / raw)
  To: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, fweimer, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, hjl.tools,
	Yang, Weijiang, Lutomirski, Andy, pavel, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, debug, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm
  Cc: Yu, Yu-cheng, nd

The 03/03/2023 22:35, Edgecombe, Rick P wrote:
> I think I slightly prefer the former arch_prctl() based solution for a
> few reasons:
>  - When you need to find the start or end of the shadow stack can you
> can just ask for it instead of searching. It can be faster and simpler.
>  - It saves 8 bytes of memory per shadow stack.
> 
> If this turns out to be wrong and we want to do the marker solution
> much later at some point, the safest option would probably be to create
> new flags.

i see two problems with a get bounds syscall:

- syscall overhead.

- discontinous shadow stack (e.g. alt shadow stack ends with a
  pointer to the interrupted thread shadow stack, so stack trace
  can continue there, except you don't know the bounds of that).

> But just discussing this with HJ, can you share more on what the usage
> is? Like which backtracing operation specifically needs the marker? How
> much does it care about the ucontext case?

it could be an option for perf or ptracers to sample the stack trace.

in-process collection of stack trace for profiling or crash reporting
(e.g. when stack is corrupted) or cross checking stack integrity may
use it too.

sometimes parsing /proc/self/smaps maybe enough, but the idea was to
enable light-weight backtrace collection in an async-signal-safe way.

syscall overhead in case of frequent stack trace collection can be
avoided by caching (in tls) when ssp falls within the thread shadow
stack bounds. otherwise caching does not work as the shadow stack may
be reused (alt shadow stack or ucontext case).

unfortunately i don't know if syscall overhead is actually a problem
(probably not) or if backtrace across signal handlers need to work
with alt shadow stack (i guess it should work for crash reporting).

thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-06 16:20             ` szabolcs.nagy
@ 2023-03-06 16:31               ` Florian Weimer
  2023-03-06 18:08                 ` Edgecombe, Rick P
  2023-03-06 18:05               ` Edgecombe, Rick P
  1 sibling, 1 reply; 159+ messages in thread
From: Florian Weimer @ 2023-03-06 16:31 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, nadav.amit, jannh,
	dethoma, broonie, kcc, linux-arch, bp, oleg, hjl.tools, Yang,
	Weijiang, Lutomirski, Andy, pavel, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, linux-doc, debug, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm, Yu, Yu-cheng, nd

* szabolcs:

> syscall overhead in case of frequent stack trace collection can be
> avoided by caching (in tls) when ssp falls within the thread shadow
> stack bounds. otherwise caching does not work as the shadow stack may
> be reused (alt shadow stack or ucontext case).

Do we need to perform the system call at each page boundary only?  That
should reduce overhead to the degree that it should not matter.

> unfortunately i don't know if syscall overhead is actually a problem
> (probably not) or if backtrace across signal handlers need to work
> with alt shadow stack (i guess it should work for crash reporting).

Ideally, we would implement the backtrace function (in glibc) as just a
shadow stack copy.  But this needs to follow the chain of alternate
stacks, and it may also need some form of markup for signal handler
frames (which need program counter adjustment to reflect that a
*non-signal* frame is conceptually nested within the previous
instruction, and not the function the return address points to).  But I
think we can add support for this incrementally.

I assume there is no desire at all on the kernel side that sigaltstack
transparently allocates the shadow stack?  Because there is no
deallocation function today for sigaltstack?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-06 16:20             ` szabolcs.nagy
  2023-03-06 16:31               ` Florian Weimer
@ 2023-03-06 18:05               ` Edgecombe, Rick P
  2023-03-06 20:31                 ` Liang, Kan
  1 sibling, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-06 18:05 UTC (permalink / raw)
  To: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	broonie, kcc, linux-arch, bp, oleg, hjl.tools, Yang, Weijiang,
	Lutomirski, Andy, pavel, arnd, tglx, Schimpe, Christina,
	mike.kravetz, x86, linux-doc, debug, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm
  Cc: Liang, Kan, Yu, Yu-cheng, nd

+Kan for shadow stack perf discussion.

On Mon, 2023-03-06 at 16:20 +0000, szabolcs.nagy@arm.com wrote:
> The 03/03/2023 22:35, Edgecombe, Rick P wrote:
> > I think I slightly prefer the former arch_prctl() based solution
> > for a
> > few reasons:
> >   - When you need to find the start or end of the shadow stack can
> > you
> > can just ask for it instead of searching. It can be faster and
> > simpler.
> >   - It saves 8 bytes of memory per shadow stack.
> > 
> > If this turns out to be wrong and we want to do the marker solution
> > much later at some point, the safest option would probably be to
> > create
> > new flags.
> 
> i see two problems with a get bounds syscall:
> 
> - syscall overhead.
> 
> - discontinous shadow stack (e.g. alt shadow stack ends with a
>   pointer to the interrupted thread shadow stack, so stack trace
>   can continue there, except you don't know the bounds of that).
> 
> > But just discussing this with HJ, can you share more on what the
> > usage
> > is? Like which backtracing operation specifically needs the marker?
> > How
> > much does it care about the ucontext case?
> 
> it could be an option for perf or ptracers to sample the stack trace.
> 
> in-process collection of stack trace for profiling or crash reporting
> (e.g. when stack is corrupted) or cross checking stack integrity may
> use it too.
> 
> sometimes parsing /proc/self/smaps maybe enough, but the idea was to
> enable light-weight backtrace collection in an async-signal-safe way.
> 
> syscall overhead in case of frequent stack trace collection can be
> avoided by caching (in tls) when ssp falls within the thread shadow
> stack bounds. otherwise caching does not work as the shadow stack may
> be reused (alt shadow stack or ucontext case).
> 
> unfortunately i don't know if syscall overhead is actually a problem
> (probably not) or if backtrace across signal handlers need to work
> with alt shadow stack (i guess it should work for crash reporting).

There was a POC done of perf integration. I'm not too knowledgeable on
perf, but the patch itself didn't need any new shadow stack bounds ABI.
Since it was implemented in the kernel, it could just refer to the
kernel's internal data for the thread's shadow stack bounds.

I asked about ucontext (similar to alt shadow stacks in regards to lack
of bounds ABI), and apparently perf usually focuses on the thread
stacks. Hopefully Kan can lend some more confidence to that assertion.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-06 16:31               ` Florian Weimer
@ 2023-03-06 18:08                 ` Edgecombe, Rick P
  2023-03-07 13:03                   ` szabolcs.nagy
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-06 18:08 UTC (permalink / raw)
  To: fweimer, szabolcs.nagy
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, nadav.amit, jannh, dethoma, broonie, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, debug, x86,
	linux-doc, mike.kravetz, pavel, andrew.cooper3, john.allen, rppt,
	nd, mingo, corbet, linux-kernel, linux-api, gorcunov, akpm

On Mon, 2023-03-06 at 17:31 +0100, Florian Weimer wrote:
> Ideally, we would implement the backtrace function (in glibc) as just
> a
> shadow stack copy.  But this needs to follow the chain of alternate
> stacks, and it may also need some form of markup for signal handler
> frames (which need program counter adjustment to reflect that a
> *non-signal* frame is conceptually nested within the previous
> instruction, and not the function the return address points to).

In the alt shadow stack case, the shadow stack sigframe will have a
special shadow stack frame with a pointer to the shadow stack stack it
came from. This may be a thread stack, or some other stack. This
writeup in the v2 of the series has more details and analysis on the
signal piece:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

So in that design, you should be able to backtrace out of a chain of
alt stacks.

>   But I
> think we can add support for this incrementally.

Yea, I think so too.

> 
> I assume there is no desire at all on the kernel side that
> sigaltstack
> transparently allocates the shadow stack?  

It could have some nice benefit for some apps, so I did look into it.

> Because there is no
> deallocation function today for sigaltstack?

Yea, this is why we can't do it transparently. There was some
discussion up the thread on this.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G
  2023-02-27 22:29 ` [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
@ 2023-03-06 18:09   ` Borislav Petkov
  2023-03-07  1:10     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-06 18:09 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:41PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which require some core mm changes to function
> properly.
> 
> One of the properties is that the shadow stack pointer (SSP), which is a
> CPU register that points to the shadow stack like the stack pointer points
> to the stack, can't be pointing outside of the 32 bit address space when
> the CPU is executing in 32 bit mode. It is desirable to prevent executing
> in 32 bit mode when shadow stack is enabled because the kernel can't easily
> support 32 bit signals.
> 
> On x86 it is possible to transition to 32 bit mode without any special
> interaction with the kernel, by doing a "far call" to a 32 bit segment.
> So the shadow stack implementation can use this address space behavior
> as a feature, by enforcing that shadow stack memory is always crated
								^^^^^^^

"created"

and I'd say "mapped" or "allocated" here. "Created" sounds weird.

> outside of the 32 bit address space. This way userspace will trigger a
> general protection fault which will in turn trigger a segfault if it
> tries to transition to 32 bit mode with shadow stack enabled.
> 
> This provides a clean error generating border for the user if they try
> attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
> half working state for userspace to be surprised by.
> 
> So to allow future shadow stack enabling patches to map shadow stacks
> out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior

I guess this needs to be documented in the mmap() manpage too.

> is pretty much like MAP_32BIT, except that it has the opposite address
> range. The are a few differences though.
> 
> If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
> MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
> syscall.
> 
> Since the default search behavior is top down, the normal kaslr base can
> be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add it's
								     ^^^^


"its"

> own randomization in the bottom up case.

...

> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 8cc653ffdccd..06378b5682c1 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
> -	info.low_limit = PAGE_SIZE;
> +	if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
> +		info.low_limit = 0x100000000;

We have a human readable define for that: SZ_4G

> +	else
> +		info.low_limit = PAGE_SIZE;
> +
>  	info.high_limit = get_mmap_base(0);
>  
>  	/*

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-03-06 13:01   ` Borislav Petkov
@ 2023-03-06 18:11     ` Edgecombe, Rick P
  2023-03-06 18:16       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-06 18:11 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Mon, 2023-03-06 at 14:01 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:38PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> > 
> > Account shadow stack pages to stack memory. Do this by adding a
> > VM_SHADOW_STACK check in is_stack_mapping().
> 
> That last sentence is superfluous.

Before this version it was open coded, but David Hildenbrand suggested
this is_stack_mapping() solution. Should it be explained more, or just
dropped?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-03-06 13:10   ` Borislav Petkov
@ 2023-03-06 18:15     ` Andy Lutomirski
  2023-03-06 18:33       ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2023-03-06 18:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Weijiang Yang, Kirill A . Shutemov,
	John Allen, kcc, eranian, rppt, jamorris, dethoma, akpm,
	Andrew.Cooper3, christina.schimpe, david, debug

On Mon, Mar 6, 2023 at 5:10 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Feb 27, 2023 at 02:29:40PM -0800, Rick Edgecombe wrote:
> > The x86 Control-flow Enforcement Technology (CET) feature includes a new
> > type of memory called shadow stack. This shadow stack memory has some
> > unusual properties, which requires some core mm changes to function
> > properly.
> >
> > Shadow stack memory is writable only in very specific, controlled ways.
> > However, since it is writable, the kernel treats it as such. As a result
>                                                                           ^
>                                                                           ,
>
> > there remain many ways for userspace to trigger the kernel to write to
> > shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a

Is there an alternate mechanism, or do we still want to allow
FOLL_FORCE so that debuggers can write it?

--Andy

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-03-06 18:11     ` Edgecombe, Rick P
@ 2023-03-06 18:16       ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-06 18:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Mon, Mar 06, 2023 at 06:11:32PM +0000, Edgecombe, Rick P wrote:
> Before this version it was open coded, but David Hildenbrand suggested
> this is_stack_mapping() solution. Should it be explained more, or just
> dropped?

Well, "adding a VM_SHADOW_STACK check in is_stack_mapping()" is what's
in the diff already and it is kinda obvious. So why write what the patch
does when one can simply look at the diff?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-03-06 18:15     ` Andy Lutomirski
@ 2023-03-06 18:33       ` Edgecombe, Rick P
  2023-03-06 18:57         ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-06 18:33 UTC (permalink / raw)
  To: Lutomirski, Andy, bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, linux-doc, arnd,
	jamorris, tglx, Schimpe, Christina, mike.kravetz, x86, akpm,
	debug, andrew.cooper3, rppt, john.allen, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2023-03-06 at 10:15 -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2023 at 5:10 AM Borislav Petkov <bp@alien8.de> wrote:
> > 
> > On Mon, Feb 27, 2023 at 02:29:40PM -0800, Rick Edgecombe wrote:
> > > The x86 Control-flow Enforcement Technology (CET) feature
> > > includes a new
> > > type of memory called shadow stack. This shadow stack memory has
> > > some
> > > unusual properties, which requires some core mm changes to
> > > function
> > > properly.
> > > 
> > > Shadow stack memory is writable only in very specific, controlled
> > > ways.
> > > However, since it is writable, the kernel treats it as such. As a
> > > result
> > 
> >                                                                    
> >         ^
> >                                                                    
> >         ,
> > 
> > > there remain many ways for userspace to trigger the kernel to
> > > write to
> > > shadow stack's via get_user_pages(, FOLL_WRITE) operations. To
> > > make this a
> 
> Is there an alternate mechanism, or do we still want to allow
> FOLL_FORCE so that debuggers can write it?

Yes, GDB shadow stack support uses it via both ptrace poke and
/proc/pid/mem apparently. So some ability to write through is needed
for debuggers. But not CRIU actually. It uses WRSS.

There was also some discussion[0] previously about how apps might
prefer to block /proc/self/mem for general security reasons. Blocking
shadow stack writes while you allow text writes is probably not that
impactful security-wise. So I thought it would be better to leave the
logic simpler. Then when /proc/self/mem could be locked down per the
discussion, shadow stack can be locked down the same way.

[0] 
https://lore.kernel.org/lkml/E857CF98-EEB2-4F83-8305-0A52B463A661@kernel.org/

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-03-06 18:33       ` Edgecombe, Rick P
@ 2023-03-06 18:57         ` Andy Lutomirski
  2023-03-07  1:47           ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2023-03-06 18:57 UTC (permalink / raw)
  To: Rick P Edgecombe, Borislav Petkov
  Cc: David Hildenbrand, Balbir Singh, H. Peter Anvin,
	Eugene Syromiatnikov, Peter Zijlstra (Intel),
	Randy Dunlap, Kees Cook, Dave Hansen, Kirill A. Shutemov,
	Eranian, Stephane, linux-mm, Florian Weimer, Nadav Amit,
	Jann Horn, dethoma, kcc, linux-arch, Pavel Machek, Oleg Nesterov,
	H.J. Lu, Weijiang Yang, linux-doc, Arnd Bergmann, jamorris,
	Thomas Gleixner, Schimpe, Christina, Mike Kravetz,
	the arch/x86 maintainers, Andrew Morton, debug, Andrew Cooper,
	Mike Rapoport, john.allen, Ingo Molnar, Jonathan Corbet,
	Linux Kernel Mailing List, Linux API, Cyrill Gorcunov

On Mon, Mar 6, 2023, at 10:33 AM, Edgecombe, Rick P wrote:
> On Mon, 2023-03-06 at 10:15 -0800, Andy Lutomirski wrote:
>> On Mon, Mar 6, 2023 at 5:10 AM Borislav Petkov <bp@alien8.de> wrote:
>> > 
>> > On Mon, Feb 27, 2023 at 02:29:40PM -0800, Rick Edgecombe wrote:
>> > > The x86 Control-flow Enforcement Technology (CET) feature
>> > > includes a new
>> > > type of memory called shadow stack. This shadow stack memory has
>> > > some
>> > > unusual properties, which requires some core mm changes to
>> > > function
>> > > properly.
>> > > 
>> > > Shadow stack memory is writable only in very specific, controlled
>> > > ways.
>> > > However, since it is writable, the kernel treats it as such. As a
>> > > result
>> > 
>> >                                                                    
>> >         ^
>> >                                                                    
>> >         ,
>> > 
>> > > there remain many ways for userspace to trigger the kernel to
>> > > write to
>> > > shadow stack's via get_user_pages(, FOLL_WRITE) operations. To
>> > > make this a
>> 
>> Is there an alternate mechanism, or do we still want to allow
>> FOLL_FORCE so that debuggers can write it?
>
> Yes, GDB shadow stack support uses it via both ptrace poke and
> /proc/pid/mem apparently. So some ability to write through is needed
> for debuggers. But not CRIU actually. It uses WRSS.
>
> There was also some discussion[0] previously about how apps might
> prefer to block /proc/self/mem for general security reasons. Blocking
> shadow stack writes while you allow text writes is probably not that
> impactful security-wise. So I thought it would be better to leave the
> logic simpler. Then when /proc/self/mem could be locked down per the
> discussion, shadow stack can be locked down the same way.

Ah, I am guilty of reading your changelog but not the code.

You said:

Shadow stack memory is writable only in very specific, controlled ways.
However, since it is writable, the kernel treats it as such. As a result
there remain many ways for userspace to trigger the kernel to write to
shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
little less exposed, block writable GUPs for shadow stack VMAs.

I read that as *denying* FOLL_FORCE.  Maybe clarify the changelog?

>
> [0] 
> https://lore.kernel.org/lkml/E857CF98-EEB2-4F83-8305-0A52B463A661@kernel.org/

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-06 18:05               ` Edgecombe, Rick P
@ 2023-03-06 20:31                 ` Liang, Kan
  0 siblings, 0 replies; 159+ messages in thread
From: Liang, Kan @ 2023-03-06 20:31 UTC (permalink / raw)
  To: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Eranian, Stephane,
	kirill.shutemov, szabolcs.nagy, dave.hansen, linux-mm, fweimer,
	nadav.amit, jannh, dethoma, broonie, kcc, linux-arch, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd, tglx,
	Schimpe, Christina, mike.kravetz, x86, linux-doc, debug,
	jamorris, john.allen, rppt, andrew.cooper3, mingo, corbet,
	linux-kernel, linux-api, gorcunov, akpm
  Cc: Liang, Kan, Yu, Yu-cheng, nd



On 2023-03-06 1:05 p.m., Edgecombe, Rick P wrote:
> +Kan for shadow stack perf discussion.
> 
> On Mon, 2023-03-06 at 16:20 +0000, szabolcs.nagy@arm.com wrote:
>> The 03/03/2023 22:35, Edgecombe, Rick P wrote:
>>> I think I slightly prefer the former arch_prctl() based solution
>>> for a
>>> few reasons:
>>>   - When you need to find the start or end of the shadow stack can
>>> you
>>> can just ask for it instead of searching. It can be faster and
>>> simpler.
>>>   - It saves 8 bytes of memory per shadow stack.
>>>
>>> If this turns out to be wrong and we want to do the marker solution
>>> much later at some point, the safest option would probably be to
>>> create
>>> new flags.
>>
>> i see two problems with a get bounds syscall:
>>
>> - syscall overhead.
>>
>> - discontinous shadow stack (e.g. alt shadow stack ends with a
>>   pointer to the interrupted thread shadow stack, so stack trace
>>   can continue there, except you don't know the bounds of that).
>>
>>> But just discussing this with HJ, can you share more on what the
>>> usage
>>> is? Like which backtracing operation specifically needs the marker?
>>> How
>>> much does it care about the ucontext case?
>>
>> it could be an option for perf or ptracers to sample the stack trace.
>>
>> in-process collection of stack trace for profiling or crash reporting
>> (e.g. when stack is corrupted) or cross checking stack integrity may
>> use it too.
>>
>> sometimes parsing /proc/self/smaps maybe enough, but the idea was to
>> enable light-weight backtrace collection in an async-signal-safe way.
>>
>> syscall overhead in case of frequent stack trace collection can be
>> avoided by caching (in tls) when ssp falls within the thread shadow
>> stack bounds. otherwise caching does not work as the shadow stack may
>> be reused (alt shadow stack or ucontext case).
>>
>> unfortunately i don't know if syscall overhead is actually a problem
>> (probably not) or if backtrace across signal handlers need to work
>> with alt shadow stack (i guess it should work for crash reporting).
> 
> There was a POC done of perf integration. I'm not too knowledgeable on
> perf, but the patch itself didn't need any new shadow stack bounds ABI.
> Since it was implemented in the kernel, it could just refer to the
> kernel's internal data for the thread's shadow stack bounds.
> 
> I asked about ucontext (similar to alt shadow stacks in regards to lack
> of bounds ABI), and apparently perf usually focuses on the thread
> stacks. Hopefully Kan can lend some more confidence to that assertion.

The POC perf patch I implemented tries to use the shadow stack to
replace the frame pointer to construct a callchain of a user space
thread. Yes, it's in the kernel, perf_callchain_user(). I don't think
the current X86 perf implementation handle the alt stack either. So the
kernel internal data for the thread's shadow stack bounds should be good
enough for the perf case.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G
  2023-03-06 18:09   ` Borislav Petkov
@ 2023-03-07  1:10     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-07  1:10 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Mon, 2023-03-06 at 19:09 +0100, Borislav Petkov wrote:
> > diff --git a/arch/x86/kernel/sys_x86_64.c
> > b/arch/x86/kernel/sys_x86_64.c
> > index 8cc653ffdccd..06378b5682c1 100644
> > --- a/arch/x86/kernel/sys_x86_64.c
> > +++ b/arch/x86/kernel/sys_x86_64.c
> > @@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file
> > *filp, const unsigned long addr0,
> >   
> >        info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> >        info.length = len;
> > -     info.low_limit = PAGE_SIZE;
> > +     if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
> > +             info.low_limit = 0x100000000;
> 
> We have a human readable define for that: SZ_4G

Uhh, yes that's much better. And the typos. Thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-03-06  8:08   ` Borislav Petkov
@ 2023-03-07  1:29     ` Edgecombe, Rick P
  2023-03-07 10:32       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-07  1:29 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Mon, 2023-03-06 at 09:08 +0100, Borislav Petkov wrote:
> Just typos:

All seem reasonable to me. Thanks. 

For using the log verbiage for the comment, it is quite big. Does
something like this seem reasonable?

/*
 * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ.
 * The INCSSP instruction can increment the shadow stack pointer. It
 * is the shadow stack analog of an instruction like:
 *
 *   addq $0x80, %rsp
 *
 * However, there is one important difference between an ADD on %rsp 
 * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
 * memory of the first and last elements that were "popped". It can be
 * thought of as acting like this:
 *
 * READ_ONCE(ssp);       // read+discard top element on stack
 * ssp += nr_to_pop * 8; // move the shadow stack
 * READ_ONCE(ssp-8);     // read+discard last popped stack element
 *
 * The maximum distance INCSSP can move the SSP is 2040 bytes, before
 * it would read the memory. Therefore a single page gap will be enough
 * to prevent any operation from shifting the SSP to an adjacent stack,
 * since it would have to land in the gap at least once, causing a
 * fault.
 *
 * Prevent using INCSSP to move the SSP between shadow stacks by
 * having a PAGE_SIZE gaurd gap.
 */

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-03-06 18:57         ` Andy Lutomirski
@ 2023-03-07  1:47           ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-07  1:47 UTC (permalink / raw)
  To: Lutomirski, Andy, bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, jamorris, arnd,
	tglx, Schimpe, Christina, mike.kravetz, debug, linux-doc, x86,
	andrew.cooper3, rppt, john.allen, mingo, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2023-03-06 at 10:57 -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2023, at 10:33 AM, Edgecombe, Rick P wrote:
> > On Mon, 2023-03-06 at 10:15 -0800, Andy Lutomirski wrote:
> > > On Mon, Mar 6, 2023 at 5:10 AM Borislav Petkov <bp@alien8.de>
> > > wrote:
> > > > 
> > > > On Mon, Feb 27, 2023 at 02:29:40PM -0800, Rick Edgecombe wrote:
> > > > > The x86 Control-flow Enforcement Technology (CET) feature
> > > > > includes a new
> > > > > type of memory called shadow stack. This shadow stack memory
> > > > > has
> > > > > some
> > > > > unusual properties, which requires some core mm changes to
> > > > > function
> > > > > properly.
> > > > > 
> > > > > Shadow stack memory is writable only in very specific,
> > > > > controlled
> > > > > ways.
> > > > > However, since it is writable, the kernel treats it as such.
> > > > > As a
> > > > > result
> > > > 
> > > >                                                                
> > > >      
> > > >          ^
> > > >                                                                
> > > >      
> > > >          ,
> > > > 
> > > > > there remain many ways for userspace to trigger the kernel to
> > > > > write to
> > > > > shadow stack's via get_user_pages(, FOLL_WRITE) operations.
> > > > > To
> > > > > make this a
> > > 
> > > Is there an alternate mechanism, or do we still want to allow
> > > FOLL_FORCE so that debuggers can write it?
> > 
> > Yes, GDB shadow stack support uses it via both ptrace poke and
> > /proc/pid/mem apparently. So some ability to write through is
> > needed
> > for debuggers. But not CRIU actually. It uses WRSS.
> > 
> > There was also some discussion[0] previously about how apps might
> > prefer to block /proc/self/mem for general security reasons.
> > Blocking
> > shadow stack writes while you allow text writes is probably not
> > that
> > impactful security-wise. So I thought it would be better to leave
> > the
> > logic simpler. Then when /proc/self/mem could be locked down per
> > the
> > discussion, shadow stack can be locked down the same way.
> 
> Ah, I am guilty of reading your changelog but not the code.
> 
> You said:
> 
> Shadow stack memory is writable only in very specific, controlled
> ways.
> However, since it is writable, the kernel treats it as such. As a
> result
> there remain many ways for userspace to trigger the kernel to write
> to
> shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make
> this a
> little less exposed, block writable GUPs for shadow stack VMAs.
> 
> I read that as *denying* FOLL_FORCE.  Maybe clarify the changelog?

I think maybe some helpful text missed the quote in Boris comment about
other issues: "Still allow FOLL_FORCE to write through shadow stack
protections, as it does for read-only protections."

But, yea, the tenses are hard to parse. Maybe something like this:
The x86 Control-flow Enforcement Technology (CET) feature includes a
new type of memory called shadow stack. This shadow stack memory has
some unusual properties, which requires some core mm changes to
function properly.

In userspace, shadow stack memory is writable only in very specific,
controlled ways. However, since userspace can, even in the limited
ways, modify shadow stack contents, the kernel treats it as writable
memory. As a result, without additional work there would remain many
ways for userspace to trigger the kernel to write arbitrary data to
shadow stacks via get_user_pages(, FOLL_WRITE) based operations. To
help userspace protect their shadow stacks, make this a little less
exposed by blocking writable get_user_pages() operations for shadow
stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections. This is required for debugging use
cases.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-03-07  1:29     ` Edgecombe, Rick P
@ 2023-03-07 10:32       ` Borislav Petkov
  2023-03-07 10:44         ` David Hildenbrand
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-07 10:32 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Tue, Mar 07, 2023 at 01:29:50AM +0000, Edgecombe, Rick P wrote:
> On Mon, 2023-03-06 at 09:08 +0100, Borislav Petkov wrote:
> > Just typos:
> 
> All seem reasonable to me. Thanks. 
> 
> For using the log verbiage for the comment, it is quite big. Does
> something like this seem reasonable?

Yeah, it does. I wouldn't want to lose that explanation in a commit
message.

However, this special aspect pertains to the shstk implementation in x86
but the code is generic mm and such arch-specific comments are kinda
unfitting there.

I wonder if it would be better if you could stick that explanation
somewhere in arch/x86/ and only refer to it in a short comment above
VM_SHADOW_STACK check in stack_guard_start_gap()...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-02-27 22:29 ` [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
  2023-03-06 13:01   ` Borislav Petkov
@ 2023-03-07 10:42   ` David Hildenbrand
  2023-03-17 17:12   ` Deepak Gupta
  2 siblings, 0 replies; 159+ messages in thread
From: David Hildenbrand @ 2023-03-07 10:42 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug
  Cc: Yu-cheng Yu

On 27.02.23 23:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Account shadow stack pages to stack memory. Do this by adding a
> VM_SHADOW_STACK check in is_stack_mapping().
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> v7:
>   - Change is_stack_mapping() to know about VM_SHADOW_STACK so the
>     additions in vm_stat_account() can be dropped. (David Hildenbrand)
> 
> v3:
>   - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
>     (Kirill)
> 
> v2:
>   - Remove is_shadow_stack_mapping() and just change it to directly bitwise
>     and VM_SHADOW_STACK.
> 
> Yu-cheng v26:
>   - Remove redundant #ifdef CONFIG_MMU.
> 
> Yu-cheng v25:
>   - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().
> ---
>   mm/internal.h | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 7920a8b7982e..1d13d5580f64 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -491,14 +491,14 @@ static inline bool is_exec_mapping(vm_flags_t flags)
>   }
>   
>   /*
> - * Stack area - automatically grows in one direction
> + * Stack area


Maybe "Stack area (including shadow stacks)"


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-03-07 10:32       ` Borislav Petkov
@ 2023-03-07 10:44         ` David Hildenbrand
  2023-03-08 22:48           ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: David Hildenbrand @ 2023-03-07 10:44 UTC (permalink / raw)
  To: Borislav Petkov, Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On 07.03.23 11:32, Borislav Petkov wrote:
> On Tue, Mar 07, 2023 at 01:29:50AM +0000, Edgecombe, Rick P wrote:
>> On Mon, 2023-03-06 at 09:08 +0100, Borislav Petkov wrote:
>>> Just typos:
>>
>> All seem reasonable to me. Thanks.
>>
>> For using the log verbiage for the comment, it is quite big. Does
>> something like this seem reasonable?
> 
> Yeah, it does. I wouldn't want to lose that explanation in a commit
> message.
> 
> However, this special aspect pertains to the shstk implementation in x86
> but the code is generic mm and such arch-specific comments are kinda
> unfitting there.
> 
> I wonder if it would be better if you could stick that explanation
> somewhere in arch/x86/ and only refer to it in a short comment above
> VM_SHADOW_STACK check in stack_guard_start_gap()...

+1

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-06 18:08                 ` Edgecombe, Rick P
@ 2023-03-07 13:03                   ` szabolcs.nagy
  2023-03-07 14:00                     ` Florian Weimer
  0 siblings, 1 reply; 159+ messages in thread
From: szabolcs.nagy @ 2023-03-07 13:03 UTC (permalink / raw)
  To: Edgecombe, Rick P, fweimer
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, nadav.amit, jannh, dethoma, broonie, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, debug, x86,
	linux-doc, mike.kravetz, pavel, andrew.cooper3, john.allen, rppt,
	nd, mingo, corbet, linux-kernel, linux-api, gorcunov, akpm

The 03/06/2023 18:08, Edgecombe, Rick P wrote:
> On Mon, 2023-03-06 at 17:31 +0100, Florian Weimer wrote:
> > I assume there is no desire at all on the kernel side that
> > sigaltstack
> > transparently allocates the shadow stack?  
> 
> It could have some nice benefit for some apps, so I did look into it.
> 
> > Because there is no
> > deallocation function today for sigaltstack?
> 
> Yea, this is why we can't do it transparently. There was some
> discussion up the thread on this.

changing/disabling the alt stack is not valid while a handler is
executing on it. if we don't allow jumping out and back to an
alt stack (swapcontext) then there can be only one alt stack
live per thread and change/disable can do the shadow stack free.

if jump back is allowed (linux even makes it race-free with
SS_AUTODISARM) then the life-time of alt stack is extended
beyond change/disable (jump back to an unregistered alt stack).

to support jump back to an alt stack the requirements are

1) user has to manage an alt shadow stack together with the alt
   stack (requies user code change, not just libc).

2) kernel has to push a restore token on the thread shadow stack
   on signal entry (at least in case of alt shadow stack, and
   deal with corner cases around shadow stack overflow).

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-07 13:03                   ` szabolcs.nagy
@ 2023-03-07 14:00                     ` Florian Weimer
  2023-03-07 16:14                       ` Szabolcs Nagy
  0 siblings, 1 reply; 159+ messages in thread
From: Florian Weimer @ 2023-03-07 14:00 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Yu, Yu-cheng, Eranian,
	Stephane, kirill.shutemov, dave.hansen, linux-mm, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, hjl.tools,
	Yang, Weijiang, Lutomirski, Andy, jamorris, arnd, tglx, Schimpe,
	Christina, debug, x86, linux-doc, mike.kravetz, pavel,
	andrew.cooper3, john.allen, rppt, nd, mingo, corbet,
	linux-kernel, linux-api, gorcunov, akpm

* szabolcs:

> changing/disabling the alt stack is not valid while a handler is
> executing on it. if we don't allow jumping out and back to an
> alt stack (swapcontext) then there can be only one alt stack
> live per thread and change/disable can do the shadow stack free.
>
> if jump back is allowed (linux even makes it race-free with
> SS_AUTODISARM) then the life-time of alt stack is extended
> beyond change/disable (jump back to an unregistered alt stack).
>
> to support jump back to an alt stack the requirements are
>
> 1) user has to manage an alt shadow stack together with the alt
>    stack (requies user code change, not just libc).
>
> 2) kernel has to push a restore token on the thread shadow stack
>    on signal entry (at least in case of alt shadow stack, and
>    deal with corner cases around shadow stack overflow).

We need to have a story for stackful coroutine switching as well, not
just for sigaltstack.  I hope that we can use OpenJDK (Project Loom) and
QEMU as guinea pigs.  If we have something that works for both,
hopefully that covers a broad range of scenarios.  Userspace
coordination can eventually be handled by glibc; we can deallocate
alternate stacks on thread exit fairly easily (at least compared to the
current stack 8-).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
  2023-03-07 14:00                     ` Florian Weimer
@ 2023-03-07 16:14                       ` Szabolcs Nagy
  0 siblings, 0 replies; 159+ messages in thread
From: Szabolcs Nagy @ 2023-03-07 16:14 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Edgecombe, Rick P, david, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Yu, Yu-cheng, Eranian,
	Stephane, kirill.shutemov, dave.hansen, linux-mm, nadav.amit,
	jannh, dethoma, broonie, kcc, linux-arch, bp, oleg, hjl.tools,
	Yang, Weijiang, Lutomirski, Andy, jamorris, arnd, tglx, Schimpe,
	Christina, debug, x86, linux-doc, mike.kravetz, pavel,
	andrew.cooper3, john.allen, rppt, nd, mingo, corbet,
	linux-kernel, linux-api, gorcunov, akpm

The 03/07/2023 15:00, Florian Weimer wrote:
> * szabolcs:
> 
> > changing/disabling the alt stack is not valid while a handler is
> > executing on it. if we don't allow jumping out and back to an
> > alt stack (swapcontext) then there can be only one alt stack
> > live per thread and change/disable can do the shadow stack free.
> >
> > if jump back is allowed (linux even makes it race-free with
> > SS_AUTODISARM) then the life-time of alt stack is extended
> > beyond change/disable (jump back to an unregistered alt stack).
> >
> > to support jump back to an alt stack the requirements are
> >
> > 1) user has to manage an alt shadow stack together with the alt
> >    stack (requies user code change, not just libc).
> >
> > 2) kernel has to push a restore token on the thread shadow stack
> >    on signal entry (at least in case of alt shadow stack, and
> >    deal with corner cases around shadow stack overflow).
> 
> We need to have a story for stackful coroutine switching as well, not
> just for sigaltstack.  I hope that we can use OpenJDK (Project Loom) and
> QEMU as guinea pigs.  If we have something that works for both,
> hopefully that covers a broad range of scenarios.  Userspace
> coordination can eventually be handled by glibc; we can deallocate
> alternate stacks on thread exit fairly easily (at least compared to the
> current stack 8-).

for stackful coroutines we just need a way to

- allocate a shadow stack with a restore token on it.

- switch to a target shadow stack with a restore token on it,
  while leaving behind a restore token on the old shadow stack.

this is supported via map_shadow_stack syscall and the
rstoressp, saveprevssp instruction pair.

otoh there can be many alt shadow stacks per thread alive if
we allow jump back (only one of them registered at a time) in
fact they can be jumped to even from another thread, so their
life-time is not tied to the thread (at least if we allow
swapcontext across threads) so i think the libc cannot manage
the alt shadow stacks, only user code can in the general case.

and in case a signal runs on an alt shadow stack, the restore
token can only be placed by the kernel on the old shadow stack.

thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma
  2023-02-27 22:29 ` [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
@ 2023-03-08  8:53   ` Borislav Petkov
  2023-03-08 23:36     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-08  8:53 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:42PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
> treated as shadow by the CPU, but this combination used to be created by
> the kernel on x86. Previous patches have changed the kernel to now avoid
> creating these PTEs unless they are for shadow stack memory. In case any
> missed corners of the kernel are still creating PTEs like this for
> non-shadow stack memory, and to catch any re-introductions of the logic,
> warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
> stack VMAs when they are being zapped. This won't catch transient cases
> but should have decent coverage. It will be compiled out when shadow
> stack is not configured.
> 
> In order to check if a pte is shadow stack in core mm code, add two arch

s/pte/PTE/

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  2023-02-27 22:29 ` [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
  2023-02-27 22:54   ` Kees Cook
@ 2023-03-08  9:23   ` Borislav Petkov
  2023-03-08 23:35     ` Edgecombe, Rick P
  1 sibling, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-08  9:23 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:43PM -0800, Rick Edgecombe wrote:
> When user shadow stack is use, Write=0,Dirty=1 is treated by the CPU as
			   ^
			   in

> shadow stack memory. So for shadow stack memory this bit combination is
> valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
> write protected, the kernel has been taught to transition the Dirty=1
> bit to SavedDirty=1, to avoid inadvertently creating shadow stack
> memory. It does this inside pte_wrprotect() because it knows the PTE is
> not intended to be a writable shadow stack entry, it is supposed to be
> write protected.


> 
> However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
> can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
> distinguish between the caller intending to create a shadow stack PTE or
> needing the SavedDirty shift.
> 
> The kernel has been updated to not do this, and so Write=0,Dirty=1
> memory should only be created by the pte_mkfoo() helpers. Add a warning
> to make sure no new mk_pte() start doing this.

Might wanna add the note from below here:

"... start doing this, like, for example, set_memory_rox() did."

> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> v6:
>  - New patch (Note, this has already been a useful warning, it caught the
>    newly added set_memory_rox() doing this)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-02-27 22:29 ` [PATCH v7 28/41] x86: Introduce userspace API for shadow stack Rick Edgecombe
@ 2023-03-08 10:27   ` Borislav Petkov
  2023-03-08 23:32     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-08 10:27 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:44PM -0800, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Add three new arch_prctl() handles:
> 
>  - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
>    feature. Returns 0 on success or an error.

"... or a negative value on error."

>  - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
>    specified feature. Returns 0 on success or an error

ditto.

What is the use case of the feature locking?

I'm under the simple assumption that once shstk is enabled for an app,
it remains so. I guess my question is rather, what's the use case for
enabling shadow stack and then disabling it later for an app...?

> The features are handled per-thread and inherited over fork(2)/clone(2),
> but reset on exec().
> 
> This is preparation patch. It does not implement any features.

That belongs under the "---" line I guess.

> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> [tweaked with feedback from tglx]
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> v4:
>  - Remove references to CET and replace with shadow stack (Peterz)
> 
> v3:
>  - Move shstk.c Makefile changes earlier (Kees)
>  - Add #ifdef around features_locked and features (Kees)
>  - Encapsulate features reset earlier in reset_thread_features() so
>    features and features_locked are not referenced in code that would be
>    compiled !CONFIG_X86_USER_SHADOW_STACK. (Kees)
>  - Fix typo in commit log (Kees)
>  - Switch arch_prctl() numbers to avoid conflict with LAM
> 
> v2:
>  - Only allow one enable/disable per call (tglx)
>  - Return error code like a normal arch_prctl() (Alexander Potapenko)
>  - Make CET only (tglx)
> ---
>  arch/x86/include/asm/processor.h  |  6 +++++
>  arch/x86/include/asm/shstk.h      | 21 +++++++++++++++
>  arch/x86/include/uapi/asm/prctl.h |  6 +++++
>  arch/x86/kernel/Makefile          |  2 ++
>  arch/x86/kernel/process_64.c      |  7 ++++-
>  arch/x86/kernel/shstk.c           | 44 +++++++++++++++++++++++++++++++
>  6 files changed, 85 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/asm/shstk.h
>  create mode 100644 arch/x86/kernel/shstk.c

...

> +long shstk_prctl(struct task_struct *task, int option, unsigned long features)
> +{
> +	if (option == ARCH_SHSTK_LOCK) {
> +		task->thread.features_locked |= features;
> +		return 0;
> +	}
> +
> +	/* Don't allow via ptrace */
> +	if (task != current)
> +		return -EINVAL;
> +
> +	/* Do not allow to change locked features */
> +	if (features & task->thread.features_locked)
> +		return -EPERM;
> +
> +	/* Only support enabling/disabling one feature at a time. */
> +	if (hweight_long(features) > 1)
> +		return -EINVAL;
> +
> +	if (option == ARCH_SHSTK_DISABLE) {
> +		return -EINVAL;
> +	}

{} braces left over from some previous version. Can go now.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-02-27 22:29 ` [PATCH v7 30/41] x86/shstk: Handle thread shadow stack Rick Edgecombe
  2023-03-02 17:34   ` Szabolcs Nagy
@ 2023-03-08 15:26   ` Borislav Petkov
  2023-03-08 20:03     ` Edgecombe, Rick P
  1 sibling, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-08 15:26 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When a process is duplicated, but the child shares the address space with
> the parent, there is potential for the threads sharing a single stack to
> cause conflicts for each other. In the normal non-cet case this is handled

"non-CET"

> in two ways.
> 
> With regular CLONE_VM a new stack is provided by userspace such that the
> parent and child have different stacks.
> 
> For vfork, the parent is suspended until the child exits. So as long as
> the child doesn't return from the vfork()/CLONE_VFORK calling function and
> sticks to a limited set of operations, the parent and child can share the
> same stack.
> 
> For shadow stack, these scenarios present similar sharing problems. For the
> CLONE_VM case, the child and the parent must have separate shadow stacks.
> Instead of changing clone to take a shadow stack, have the kernel just
> allocate one and switch to it.
> 
> Use stack_size passed from clone3() syscall for thread shadow stack size. A
> compat-mode thread shadow stack size is further reduced to 1/4. This
> allows more threads to run in a 32-bit address space. The clone() does not
> pass stack_size, which was added to clone3(). In that case, use
> RLIMIT_STACK size and cap to 4 GB.
> 
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from

"POSIX"

...

> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index f851558b673f..bc3de4aeb661 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
>  	}
>  }
>  
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
> +{
> +	struct cet_user_state *xstate;
> +
> +	/* If ssp update is not needed. */
> +	if (!ssp)
> +		return 0;
> +
> +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> +				XFEATURE_CET_USER);
> +
> +	/*
> +	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
> +	 * stack and the fpu state should be up to date since it was just copied
> +	 * from the parent in fpu_clone(). So there must be a valid non-init CET
> +	 * state location in the buffer.
> +	 */
> +	if (WARN_ON_ONCE(!xstate))
> +		return 1;
> +
> +	xstate->user_ssp = (u64)ssp;
> +
> +	return 0;
> +}
> +#else
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
								      ^^^^^^^^^^^
ssp, like above.

Better yet:

static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
{
#ifdef CONFIG_X86_USER_SHADOW_STACK
	...
#endif
	return 0;
}

and less ifdeffery.



> +{
> +	return 0;
> +}
> +#endif
> +
>  /* Clone current's FPU state on fork */
> -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
> +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
> +	      unsigned long ssp)
>  {
>  	struct fpu *src_fpu = &current->thread.fpu;
>  	struct fpu *dst_fpu = &dst->thread.fpu;
> @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
>  	if (use_xsave())
>  		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
>  
> +	/*
> +	 * Update shadow stack pointer, in case it changed during clone.
> +	 */
> +	if (update_fpu_shstk(dst, ssp))
> +		return 1;
> +
>  	trace_x86_fpu_copy_src(src_fpu);
>  	trace_x86_fpu_copy_dst(dst_fpu);
>  
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b650cde3f64d..bf703f53fa49 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -48,6 +48,7 @@
>  #include <asm/frame.h>
>  #include <asm/unwind.h>
>  #include <asm/tdx.h>
> +#include <asm/shstk.h>
>  
>  #include "process.h"
>  
> @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
>  
>  	free_vm86(t);
>  
> +	shstk_free(tsk);
>  	fpu__drop(fpu);
>  }
>  
> @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	struct inactive_task_frame *frame;
>  	struct fork_frame *fork_frame;
>  	struct pt_regs *childregs;
> +	unsigned long shstk_addr = 0;
>  	int ret = 0;
>  
>  	childregs = task_pt_regs(p);
> @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	frame->flags = X86_EFLAGS_FIXED;
>  #endif
>  
> -	fpu_clone(p, clone_flags, args->fn);
> +	/* Allocate a new shadow stack for pthread if needed */
> +	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
> +				       &shstk_addr);

That function will return 0 even if shstk_addr hasn't been written in it
and you will continue merrily and call

	fpu_clone(..., shstk_addr=0);

why don't you return the shadow stack address or negative on error
instead of adding an I/O parameter which is pretty much always nasty to
deal with.



> +	if (ret)
> +		return ret;
> +
> +	fpu_clone(p, clone_flags, args->fn, shstk_addr);
>  
>  	/* Kernel thread ? */
>  	if (unlikely(p->flags & PF_KTHREAD)) {

...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-03-08 15:26   ` Borislav Petkov
@ 2023-03-08 20:03     ` Edgecombe, Rick P
  2023-03-09 14:12       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-08 20:03 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Wed, 2023-03-08 at 16:26 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > When a process is duplicated, but the child shares the address
> > space with
> > the parent, there is potential for the threads sharing a single
> > stack to
> > cause conflicts for each other. In the normal non-cet case this is
> > handled
> 
> "non-CET"

Sure.

> 
> > in two ways.
> > 
> > With regular CLONE_VM a new stack is provided by userspace such
> > that the
> > parent and child have different stacks.
> > 
> > For vfork, the parent is suspended until the child exits. So as
> > long as
> > the child doesn't return from the vfork()/CLONE_VFORK calling
> > function and
> > sticks to a limited set of operations, the parent and child can
> > share the
> > same stack.
> > 
> > For shadow stack, these scenarios present similar sharing problems.
> > For the
> > CLONE_VM case, the child and the parent must have separate shadow
> > stacks.
> > Instead of changing clone to take a shadow stack, have the kernel
> > just
> > allocate one and switch to it.
> > 
> > Use stack_size passed from clone3() syscall for thread shadow stack
> > size. A
> > compat-mode thread shadow stack size is further reduced to 1/4.
> > This
> > allows more threads to run in a 32-bit address space. The clone()
> > does not
> > pass stack_size, which was added to clone3(). In that case, use
> > RLIMIT_STACK size and cap to 4 GB.
> > 
> > For shadow stack enabled vfork(), the parent and child can share
> > the same
> > shadow stack, like they can share a normal stack. Since the parent
> > is
> > suspended until the child terminates, the child will not interfere
> > with
> > the parent while executing as long as it doesn't return from the
> > vfork()
> > and overwrite up the shadow stack. The child can safely overwrite
> > down
> > the shadow stack, as the parent can just overwrite this later. So
> > CET does
> > not add any additional limitations for vfork().
> > 
> > Userspace implementing posix vfork() can actually prevent the child
> > from
> 
> "POSIX"

Ok.

> 
> ...
> 
> > diff --git a/arch/x86/kernel/fpu/core.c
> > b/arch/x86/kernel/fpu/core.c
> > index f851558b673f..bc3de4aeb661 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct
> > fpu *dst_fpu)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > ssp)
> > +{
> > +	struct cet_user_state *xstate;
> > +
> > +	/* If ssp update is not needed. */
> > +	if (!ssp)
> > +		return 0;
> > +
> > +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> > +				XFEATURE_CET_USER);
> > +
> > +	/*
> > +	 * If there is a non-zero ssp, then 'dst' must be configured
> > with a shadow
> > +	 * stack and the fpu state should be up to date since it was
> > just copied
> > +	 * from the parent in fpu_clone(). So there must be a valid
> > non-init CET
> > +	 * state location in the buffer.
> > +	 */
> > +	if (WARN_ON_ONCE(!xstate))
> > +		return 1;
> > +
> > +	xstate->user_ssp = (u64)ssp;
> > +
> > +	return 0;
> > +}
> > +#else
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > shstk_addr)
> 
> 								      ^
> ^^^^^^^^^^
> ssp, like above.
> 
> Better yet:
> 
> static int update_fpu_shstk(struct task_struct *dst, unsigned long
> ssp)
> {
> #ifdef CONFIG_X86_USER_SHADOW_STACK
> 	...
> #endif
> 	return 0;
> }
> 
> and less ifdeffery.

Sure. Sometimes people tell me to only ifdef out whole functions to
make it easier to read. I suppose in this case it's not hard to see.


> 
> 
> 
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  /* Clone current's FPU state on fork */
> > -int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
> > bool minimal)
> > +int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
> > bool minimal,
> > +	      unsigned long ssp)
> >  {
> >  	struct fpu *src_fpu = &current->thread.fpu;
> >  	struct fpu *dst_fpu = &dst->thread.fpu;
> > @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst,
> > unsigned long clone_flags, bool minimal)
> >  	if (use_xsave())
> >  		dst_fpu->fpstate->regs.xsave.header.xfeatures &=
> > ~XFEATURE_MASK_PASID;
> >  
> > +	/*
> > +	 * Update shadow stack pointer, in case it changed during
> > clone.
> > +	 */
> > +	if (update_fpu_shstk(dst, ssp))
> > +		return 1;
> > +
> >  	trace_x86_fpu_copy_src(src_fpu);
> >  	trace_x86_fpu_copy_dst(dst_fpu);
> >  
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index b650cde3f64d..bf703f53fa49 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -48,6 +48,7 @@
> >  #include <asm/frame.h>
> >  #include <asm/unwind.h>
> >  #include <asm/tdx.h>
> > +#include <asm/shstk.h>
> >  
> >  #include "process.h"
> >  
> > @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
> >  
> >  	free_vm86(t);
> >  
> > +	shstk_free(tsk);
> >  	fpu__drop(fpu);
> >  }
> >  
> > @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const
> > struct kernel_clone_args *args)
> >  	struct inactive_task_frame *frame;
> >  	struct fork_frame *fork_frame;
> >  	struct pt_regs *childregs;
> > +	unsigned long shstk_addr = 0;
> >  	int ret = 0;
> >  
> >  	childregs = task_pt_regs(p);
> > @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const
> > struct kernel_clone_args *args)
> >  	frame->flags = X86_EFLAGS_FIXED;
> >  #endif
> >  
> > -	fpu_clone(p, clone_flags, args->fn);
> > +	/* Allocate a new shadow stack for pthread if needed */
> > +	ret = shstk_alloc_thread_stack(p, clone_flags, args-
> > >stack_size,
> > +				       &shstk_addr);
> 
> That function will return 0 even if shstk_addr hasn't been written in
> it
> and you will continue merrily and call
> 
> 	fpu_clone(..., shstk_addr=0);
> 
> why don't you return the shadow stack address or negative on error
> instead of adding an I/O parameter which is pretty much always nasty
> to
> deal with.

On a shadow stack allocation error, we fail the copy_thread(). When
shadow stack is enabled, the app might be able to handle a clone
failure, but would not be able to handle starting a new thread without
getting a new shadow stack.

So in your suggestion I guess we would have two types of failure one
that signifies shadow stack is enabled and the allocation failed, and
another that signifies that shadow stack is not enabled, so zero needs
to be passed into fpu_clone()?

We need the output param in shstk_alloc_thread_stack() because we need
to update the SSP to the new shadow stack. If we want to make the non-
shadow stack case handled differently, I think the extra conditionals
are worse, like:
/* Allocate a new shadow stack for pthread if needed */
ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
				&shstk_addr);
if (ret == -EOPNOTSUPP)
	fpu_clone(p, clone_flags, args->fn, 0);
else if (ret < 0)
	return ret;
else
	fpu_clone(p, clone_flags, args->fn, shstk_addr);

Do you think?

It used to be that shstk_alloc_thread_stack() reached into FPU
internals to do the SSP update itself. Then the ability to do this was
removed. So I came up with an interface for allowing features to modify
XSAVE buffers from outside the FPU code. On further discussion, letting
code outside the FPU have flexible access to the XSAVE buffer could
constrain the FPU code from adding optimizations. So Thomas suggested
to pass the SSP along into FPU code so that the FPU modification could
be all monolithic and flexible.

If the default SSP value logic is too hidden, what about some clearer
code and comments, like this?

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..bd123527fcca 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct
kernel_clone_args *args)
        struct inactive_task_frame *frame;
        struct fork_frame *fork_frame;
        struct pt_regs *childregs;
-       unsigned long shstk_addr = 0;
+       unsigned long new_ssp;
        int ret = 0;
 
        childregs = task_pt_regs(p);
@@ -177,13 +177,18 @@ int copy_thread(struct task_struct *p, const
struct kernel_clone_args *args)
        frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-       /* Allocate a new shadow stack for pthread if needed */
+       /*
+        * Allocate a new shadow stack for thread if needed. If shadow
stack,
+        * is disabled, new_ssp will remain 0, and fpu_clone() will
know not to
+        * update it.
+        */
+       new_ssp = 0;
        ret = shstk_alloc_thread_stack(p, clone_flags, args-
>stack_size,
-                                      &shstk_addr);
+                                      &new_ssp);
        if (ret)
                return ret;
 
-       fpu_clone(p, clone_flags, args->fn, shstk_addr);
+       fpu_clone(p, clone_flags, args->fn, new_ssp);
 
        /* Kernel thread ? */
        if (unlikely(p->flags & PF_KTHREAD)) {



^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-03-07 10:44         ` David Hildenbrand
@ 2023-03-08 22:48           ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-08 22:48 UTC (permalink / raw)
  To: david, bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz,
	debug, akpm, x86, andrew.cooper3, john.allen, linux-doc, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Tue, 2023-03-07 at 11:44 +0100, David Hildenbrand wrote:
> On 07.03.23 11:32, Borislav Petkov wrote:
> > On Tue, Mar 07, 2023 at 01:29:50AM +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2023-03-06 at 09:08 +0100, Borislav Petkov wrote:
> > > > Just typos:
> > > 
> > > All seem reasonable to me. Thanks.
> > > 
> > > For using the log verbiage for the comment, it is quite big. Does
> > > something like this seem reasonable?
> > 
> > Yeah, it does. I wouldn't want to lose that explanation in a commit
> > message.
> > 
> > However, this special aspect pertains to the shstk implementation
> > in x86
> > but the code is generic mm and such arch-specific comments are
> > kinda
> > unfitting there.
> > 
> > I wonder if it would be better if you could stick that explanation
> > somewhere in arch/x86/ and only refer to it in a short comment
> > above
> > VM_SHADOW_STACK check in stack_guard_start_gap()...
> 
> +1

I can't find a good place for it in the arch code. Basically there is
no arch/x86 functionality that has to do with guard pages. The closest
is pte_mkwrite() because it at least references VM_SHADOW_STACK but it
doesn't really fit.

We could to add an arch version of stack_guard_start_gap() but we had
that and removed it for other style reasons. Code duplication IIRC.

So I thought to just move it elsewhere in mm.h where VM_SHADOW_STACK is
defined.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-08 10:27   ` Borislav Petkov
@ 2023-03-08 23:32     ` Edgecombe, Rick P
  2023-03-09 12:57       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-08 23:32 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Wed, 2023-03-08 at 11:27 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:44PM -0800, Rick Edgecombe wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Add three new arch_prctl() handles:
> > 
> >  - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
> >    feature. Returns 0 on success or an error.
> 
> "... or a negative value on error."

Sure.

> 
> >  - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
> >    specified feature. Returns 0 on success or an error
> 
> ditto.
> 
> What is the use case of the feature locking?
> 
> I'm under the simple assumption that once shstk is enabled for an
> app,
> it remains so. I guess my question is rather, what's the use case for
> enabling shadow stack and then disabling it later for an app...?

This would be for things like the "permissive mode", where glibc
determines that it has to do something like dlopen() an unsupporting
DSO much later.

But being able to late lock the features is required for the working
behavior of glibc as well. Glibc enables shadow stack very early, then
disables it later if it finds that any of the normal dynamic libraries
don't support it. It only locks shadow stack after this point even in
non-permissive mode.

The selftest also does a lot of enabling and disabling.

> 
> > The features are handled per-thread and inherited over
> > fork(2)/clone(2),
> > but reset on exec().
> > 
> > This is preparation patch. It does not implement any features.
> 
> That belongs under the "---" line I guess.

Oh, yes.

> 
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Tested-by: Kees Cook <keescook@chromium.org>
> > Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > [tweaked with feedback from tglx]
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > ---
> > v4:
> >  - Remove references to CET and replace with shadow stack (Peterz)
> > 
> > v3:
> >  - Move shstk.c Makefile changes earlier (Kees)
> >  - Add #ifdef around features_locked and features (Kees)
> >  - Encapsulate features reset earlier in reset_thread_features() so
> >    features and features_locked are not referenced in code that
> > would be
> >    compiled !CONFIG_X86_USER_SHADOW_STACK. (Kees)
> >  - Fix typo in commit log (Kees)
> >  - Switch arch_prctl() numbers to avoid conflict with LAM
> > 
> > v2:
> >  - Only allow one enable/disable per call (tglx)
> >  - Return error code like a normal arch_prctl() (Alexander
> > Potapenko)
> >  - Make CET only (tglx)
> > ---
> >  arch/x86/include/asm/processor.h  |  6 +++++
> >  arch/x86/include/asm/shstk.h      | 21 +++++++++++++++
> >  arch/x86/include/uapi/asm/prctl.h |  6 +++++
> >  arch/x86/kernel/Makefile          |  2 ++
> >  arch/x86/kernel/process_64.c      |  7 ++++-
> >  arch/x86/kernel/shstk.c           | 44
> > +++++++++++++++++++++++++++++++
> >  6 files changed, 85 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/x86/include/asm/shstk.h
> >  create mode 100644 arch/x86/kernel/shstk.c
> 
> ...
> 
> > +long shstk_prctl(struct task_struct *task, int option, unsigned
> > long features)
> > +{
> > +	if (option == ARCH_SHSTK_LOCK) {
> > +		task->thread.features_locked |= features;
> > +		return 0;
> > +	}
> > +
> > +	/* Don't allow via ptrace */
> > +	if (task != current)
> > +		return -EINVAL;
> > +
> > +	/* Do not allow to change locked features */
> > +	if (features & task->thread.features_locked)
> > +		return -EPERM;
> > +
> > +	/* Only support enabling/disabling one feature at a time. */
> > +	if (hweight_long(features) > 1)
> > +		return -EINVAL;
> > +
> > +	if (option == ARCH_SHSTK_DISABLE) {
> > +		return -EINVAL;
> > +	}
> 
> {} braces left over from some previous version. Can go now.
> 

This was intentional, but I wasn't sure on it. It makes the diff
cleaner in later patches, is the reason.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  2023-03-08  9:23   ` Borislav Petkov
@ 2023-03-08 23:35     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-08 23:35 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Wed, 2023-03-08 at 10:23 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:43PM -0800, Rick Edgecombe wrote:
> > When user shadow stack is use, Write=0,Dirty=1 is treated by the
> > CPU as
> 
>                            ^
>                            in

Oops, yes.
> 
> > shadow stack memory. So for shadow stack memory this bit
> > combination is
> > valid, but when Dirty=1,Write=1 (conventionally writable) memory is
> > being
> > write protected, the kernel has been taught to transition the
> > Dirty=1
> > bit to SavedDirty=1, to avoid inadvertently creating shadow stack
> > memory. It does this inside pte_wrprotect() because it knows the
> > PTE is
> > not intended to be a writable shadow stack entry, it is supposed to
> > be
> > write protected.
> 
> 
> > 
> > However, when a PTE is created by a raw prot using mk_pte(),
> > mk_pte()
> > can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
> > distinguish between the caller intending to create a shadow stack
> > PTE or
> > needing the SavedDirty shift.
> > 
> > The kernel has been updated to not do this, and so Write=0,Dirty=1
> > memory should only be created by the pte_mkfoo() helpers. Add a
> > warning
> > to make sure no new mk_pte() start doing this.
> 
> Might wanna add the note from below here:
> 
> "... start doing this, like, for example, set_memory_rox() did."

Fine by me.

Thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma
  2023-03-08  8:53   ` Borislav Petkov
@ 2023-03-08 23:36     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-08 23:36 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Wed, 2023-03-08 at 09:53 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:42PM -0800, Rick Edgecombe wrote:
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> > 
> > One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
> > treated as shadow by the CPU, but this combination used to be
> > created by
> > the kernel on x86. Previous patches have changed the kernel to now
> > avoid
> > creating these PTEs unless they are for shadow stack memory. In
> > case any
> > missed corners of the kernel are still creating PTEs like this for
> > non-shadow stack memory, and to catch any re-introductions of the
> > logic,
> > warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-
> > shadow
> > stack VMAs when they are being zapped. This won't catch transient
> > cases
> > but should have decent coverage. It will be compiled out when
> > shadow
> > stack is not configured.
> > 
> > In order to check if a pte is shadow stack in core mm code, add two
> > arch
> 
> s/pte/PTE/

Yes, it matches the rest.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-08 23:32     ` Edgecombe, Rick P
@ 2023-03-09 12:57       ` Borislav Petkov
  2023-03-09 16:56         ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 12:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Wed, Mar 08, 2023 at 11:32:36PM +0000, Edgecombe, Rick P wrote:
> This would be for things like the "permissive mode", where glibc
> determines that it has to do something like dlopen() an unsupporting
> DSO much later.
> 
> But being able to late lock the features is required for the working
> behavior of glibc as well. Glibc enables shadow stack very early, then
> disables it later if it finds that any of the normal dynamic libraries
> don't support it. It only locks shadow stack after this point even in
> non-permissive mode.

So this all sounds weird. Especially from a user point of view.

Now let's imagine there's a Linux user called Boris and he goes and buys
a CPU which supports shadow stack, gets a distro which has shadow stack
enabled. All good.

Now, at some point he loads a program which pulls in an old library
which hasn't been enabled for shadow stack yet.

In the name of not breaking stuff, his glibc is configured in permissive
mode by default so that program loads and shadow stack for it is
disabled.

And Boris doesn't even know and continues on his merry way thinking that
he has all that cool ROP protection.

So where is the knob that says, "disable permissive mode"?

Or at least where does the user get a warning saying, "hey, this app
doesn't do shadow stack and we disabled it for ya so that it can still
work"?

Or am I way off?

I hope you're catching my drift. Because if there's no enforcement of
shstk and we do this permissive mode by default, this whole overhead is
just a unnecessary nuisance...

But maybe that'll come later and I should keep going through the set...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-03-08 20:03     ` Edgecombe, Rick P
@ 2023-03-09 14:12       ` Borislav Petkov
  2023-03-09 16:59         ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 14:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Wed, Mar 08, 2023 at 08:03:17PM +0000, Edgecombe, Rick P wrote:

Btw,

pls try to trim your replies as I need ot scroll through pages of quoted
text to find the response.

> Sure. Sometimes people tell me to only ifdef out whole functions to
> make it easier to read. I suppose in this case it's not hard to see.

Yeah, the less ifdeffery we have, the better.

> If the default SSP value logic is too hidden, what about some clearer
> code and comments, like this?

The problem with this function is that it needs to return three things:

* success:
 ** 0
 or
 ** shadow stack address
* failure: due to allocation.

How about this below instead? (totally untested ofc):

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..6e323d4e32fc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
-	unsigned long shstk_addr = 0;
+	unsigned long shstk_addr;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -178,10 +178,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 #endif
 
 	/* Allocate a new shadow stack for pthread if needed */
-	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
-				       &shstk_addr);
-	if (ret)
-		return ret;
+	shstk_addr = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	if (IS_ERR_VALUE(shstk_addr))
+		return PTR_ERR((void *)shstk_addr);
 
 	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 13c02747386f..b1668b499e9a 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -157,8 +157,8 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-			     unsigned long stack_size, unsigned long *shstk_addr)
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+				       unsigned long stack_size)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
 	unsigned long addr, size;
@@ -180,14 +180,12 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 	size = adjust_shstk_size(stack_size);
 	addr = alloc_shstk(size);
 	if (IS_ERR_VALUE(addr))
-		return PTR_ERR((void *)addr);
+		return addr;
 
 	shstk->base = addr;
 	shstk->size = size;
 
-	*shstk_addr = addr + size;
-
-	return 0;
+	return addr + size;
 }
 
 static unsigned long get_user_shstk_addr(void)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk
  2023-02-27 22:29 ` [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2023-03-09 16:48   ` Borislav Petkov
  2023-03-09 17:03     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 16:48 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:47PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow stacks are normally written to via CALL/RET or specific CET
				       ^
				       indirectly.

> instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
> operations the kernel will need to write to directly using the ring-0 only

"However, sometimes the kernel will need to..."

> WRUSS instruction.
> 
> A shadow stack restore token marks a restore point of the shadow stack, and
> the address in a token must point directly above the token, which is within
> the same shadow stack. This is distinctively different from other pointers
> on the shadow stack, since those pointers point to executable code area.
> 
> Introduce token setup and verify routines. Also introduce WRUSS, which is
> a kernel-mode instruction but writes directly to user shadow stack.
> 
> In future patches that enable shadow stack to work with signals, the kernel
> will need something to denote the point in the stack where sigreturn may be
> called. This will prevent attackers calling sigreturn at arbitrary places
> in the stack, in order to help prevent SROP attacks.
> 
> To do this, something that can only be written by the kernel needs to be
> placed on the shadow stack. This can be accomplished by setting bit 63 in
> the frame written to the shadow stack. Userspace return addresses can't
> have this bit set as it is in the kernel range. It is also can't be a

s/is //

> valid restore token.

...

> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index de48d1389936..d6cd9344f6c7 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p)
>  		: [pax] "a" (p));
>  }
>  
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> +{
> +	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> +			  _ASM_EXTABLE(1b, %l[fail])
> +			  :: [addr] "r" (addr), [val] "r" (val)
> +			  :: fail);
> +	return 0;
> +fail:
> +	return -EFAULT;

Nice!

> +}
> +#endif /* CONFIG_X86_USER_SHADOW_STACK */
> +
>  #define nop() asm volatile ("nop")
>  
>  static inline void serialize(void)

...

> +static int put_shstk_data(u64 __user *addr, u64 data)
> +{
> +	if (WARN_ON_ONCE(data & BIT(63)))

Dunno, maybe something like:

/*
 * A comment explaining what that is...
 */
#define SHSTK_SIGRETURN_TOKEN	BIT_ULL(63)

or so?

And use that instead of that magical bit 63.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-09 12:57       ` Borislav Petkov
@ 2023-03-09 16:56         ` Edgecombe, Rick P
  2023-03-09 23:51           ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-09 16:56 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	linux-doc, x86, andrew.cooper3, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, 2023-03-09 at 13:57 +0100, Borislav Petkov wrote:
> So this all sounds weird. Especially from a user point of view.
> 
> Now let's imagine there's a Linux user called Boris and he goes and
> buys
> a CPU which supports shadow stack, gets a distro which has shadow
> stack
> enabled. All good.
> 
> Now, at some point he loads a program which pulls in an old library
> which hasn't been enabled for shadow stack yet.
> 
> In the name of not breaking stuff, his glibc is configured in
> permissive
> mode by default so that program loads and shadow stack for it is
> disabled.
> 
> And Boris doesn't even know and continues on his merry way thinking
> that
> he has all that cool ROP protection.

There is a proc that shows if shadow stack is enabled in a thread. It
does indeed come later in the series.

> 
> So where is the knob that says, "disable permissive mode"?

glibc has an environment variable that can change the loader's
behavior. There is also a compile time config for the default mode. But
this "permissive mode" is a glibc thing. The kernel doesn't implement
it per-se, just provides building blocks.

> 
> Or at least where does the user get a warning saying, "hey, this app
> doesn't do shadow stack and we disabled it for ya so that it can
> still
> work"?
> 
> Or am I way off?

I don't think so. The whole "when to enable shadow stack" question is
thornier than it might seem though, and what we have here is based on
some trade offs in the details.

> 
> I hope you're catching my drift. Because if there's no enforcement of
> shstk and we do this permissive mode by default, this whole overhead
> is
> just a unnecessary nuisance...

In the existing glibc patches, and this is highly related to glibc
behavior because the decisions around enabling and locking have been
pushed there, there are two reasons why shadow stack would get disabled
on an supporting executable after it gets enabled.
1. An executable is loaded and one of the shared objects (the ones that
come out of ldd) does not support shadow stack
2. An executable is loaded in permissive mode, and much later during
execution dlopen()s a DSO that does not support shadow stack.

One of the challenges with enabling shadow stack is you only start
recording the shadow stack history when you enable it. If you enable it
at some point, and then return from that function you underflow the
shadow stack and get a violation. So if the shadow stack will be
locked, it has to be enabled at the earliest point it might return to
at some point (for example after returning from main()).

So in 1, the existing logic of glibc is to enable shadow stack at the
very beginning of the loader. Then go through the whole loading/linking
process. If problems are found, disable shadow stack. If no problems
are found, then lock it.

I've wondered if this super early glibc enabling behavior is really
needed and if they could enable it after processing the linked
libraries in the elf. Then save the work of enabling and disabling
shadow stack for situations that don't support it. To me this is the
big wart in the whole thing, but I don't think the kernel can help
resolve it. If glibc can enable later, then we can combine the locking
and enabling into a single operation. But it only saves a syscall and
it might prevent some other libc that needs to do things like glibc
does currently, from being able to make it work at all.

In 2, the enabling happens like normal and the locking is skipped, so
that shadow stack can be enabled during a dlopen(). But glibc
permissive mode promises more than it delivers. Since it can only
disable shadow stack per-thread, it leaves the other threads enabled.
Making a working permissive mode is sort of an unsolved problem. There
are some proposals to make it work in just userspace, and some that
would need additional kernel support. If you are interested I can go
into why per-process disabling is not straightforward.

So the locking is needed for the basic support in 1 and the weak
permissive mode in 2 uses it. I am considering this series to support
1, but people may end up using 2 to get some permissive-ness. In
general the idea of this API is to push the enabling decisions into
userspace because that is where the information for making the decision
is known. We previously tried to add some batch operations to improve
the performance, but tglx had suggested to start with something simple.
So we end up with this simple composable API.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-03-09 14:12       ` Borislav Petkov
@ 2023-03-09 16:59         ` Edgecombe, Rick P
  2023-03-09 17:04           ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-09 16:59 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Yang, Weijiang,
	Lutomirski, Andy, linux-doc, arnd, tglx, Schimpe, Christina,
	mike.kravetz, debug, x86, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, 2023-03-09 at 15:12 +0100, Borislav Petkov wrote:
> On Wed, Mar 08, 2023 at 08:03:17PM +0000, Edgecombe, Rick P wrote:
> 
> Btw,
> 
> pls try to trim your replies as I need ot scroll through pages of
> quoted
> text to find the response.

Sure sorry.


[...]
> 
> > If the default SSP value logic is too hidden, what about some
> > clearer
> > code and comments, like this?
> 
> The problem with this function is that it needs to return three
> things:
> 
> * success:
>  ** 0
>  or
>  ** shadow stack address
> * failure: due to allocation.
> 
> How about this below instead? (totally untested ofc):

Ah, I see what you were saying now. It looks like it will work to me if
you think it is better stylistically.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack
  2023-02-27 22:29 ` [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack Rick Edgecombe
@ 2023-03-09 17:02   ` Borislav Petkov
  2023-03-09 17:16     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 17:02 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:48PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When a signal is handled normally the context is pushed to the stack

s/normally //

> before handling it. For shadow stacks, since the shadow stack only track's

"tracks"

> return addresses, there isn't any state that needs to be pushed. However,
> there are still a few things that need to be done. These things are
> userspace visible and which will be kernel ABI for shadow stacks.

"visible to userspace"

s/which //

> One is to make sure the restorer address is written to shadow stack, since
> the signal handler (if not changing ucontext) returns to the restorer, and
> the restorer calls sigreturn. So add the restorer on the shadow stack
> before handling the signal, so there is not a conflict when the signal
> handler returns to the restorer.
> 
> The other thing to do is to place some type of checkable token on the
> thread's shadow stack before handling the signal and check it during
> sigreturn. This is an extra layer of protection to hamper attackers
> calling sigreturn manually as in SROP-like attacks.
> 
> For this token we can use the shadow stack data format defined earlier.
		^^^

Please use passive voice in your commit message: no "we" or "I", etc.

> Have the data pushed be the previous SSP. In the future the sigreturn
> might want to return back to a different stack. Storing the SSP (instead
> of a restore offset or something) allows for future functionality that
> may want to restore to a different stack.
> 
> So, when handling a signal push
>  - the SSP pointing in the shadow stack data format
>  - the restorer address below the restore token.
> 
> In sigreturn, verify SSP is stored in the data format and pop the shadow
> stack.

...

> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 13c02747386f..40f0a55762a9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -232,6 +232,104 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
>  	return 0;
>  }
>  
> +static int shstk_push_sigframe(unsigned long *ssp)
> +{
> +	unsigned long target_ssp = *ssp;
> +
> +	/* Token must be aligned */
> +	if (!IS_ALIGNED(*ssp, 8))
> +		return -EINVAL;
> +
> +	if (!IS_ALIGNED(target_ssp, 8))
> +		return -EINVAL;

Those two statements are identical AFAICT.

> +	*ssp -= SS_FRAME_SIZE;
> +	if (put_shstk_data((void *__user)*ssp, target_ssp))
> +		return -EFAULT;
> +
> +	return 0;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk
  2023-03-09 16:48   ` Borislav Petkov
@ 2023-03-09 17:03     ` Edgecombe, Rick P
  2023-03-09 17:22       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-09 17:03 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, 2023-03-09 at 17:48 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:47PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > Shadow stacks are normally written to via CALL/RET or specific CET
> 
>                                        ^
>                                        indirectly.

Dunno here, RSTORSSP/SAVEPREVSSP are kind of direct.

> 
> > instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
> > operations the kernel will need to write to directly using the
> > ring-0 only
> 
> "However, sometimes the kernel will need to..."

Ok.

> 
> > WRUSS instruction.
> > 
> > A shadow stack restore token marks a restore point of the shadow
> > stack, and
> > the address in a token must point directly above the token, which
> > is within
> > the same shadow stack. This is distinctively different from other
> > pointers
> > on the shadow stack, since those pointers point to executable code
> > area.
> > 
> > Introduce token setup and verify routines. Also introduce WRUSS,
> > which is
> > a kernel-mode instruction but writes directly to user shadow stack.
> > 
> > In future patches that enable shadow stack to work with signals,
> > the kernel
> > will need something to denote the point in the stack where
> > sigreturn may be
> > called. This will prevent attackers calling sigreturn at arbitrary
> > places
> > in the stack, in order to help prevent SROP attacks.
> > 
> > To do this, something that can only be written by the kernel needs
> > to be
> > placed on the shadow stack. This can be accomplished by setting bit
> > 63 in
> > the frame written to the shadow stack. Userspace return addresses
> > can't
> > have this bit set as it is in the kernel range. It is also can't be
> > a
> 
> s/is //

Yep, thanks.

> 
> > valid restore token.
> 
> ...
> 
> > diff --git a/arch/x86/include/asm/special_insns.h
> > b/arch/x86/include/asm/special_insns.h
> > index de48d1389936..d6cd9344f6c7 100644
> > --- a/arch/x86/include/asm/special_insns.h
> > +++ b/arch/x86/include/asm/special_insns.h
> > @@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p)
> >                : [pax] "a" (p));
> >   }
> >   
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > +{
> > +     asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > +                       _ASM_EXTABLE(1b, %l[fail])
> > +                       :: [addr] "r" (addr), [val] "r" (val)
> > +                       :: fail);
> > +     return 0;
> > +fail:
> > +     return -EFAULT;
> 
> Nice!
> 
> > +}
> > +#endif /* CONFIG_X86_USER_SHADOW_STACK */
> > +
> >   #define nop() asm volatile ("nop")
> >   
> >   static inline void serialize(void)
> 
> ...
> 
> > +static int put_shstk_data(u64 __user *addr, u64 data)
> > +{
> > +     if (WARN_ON_ONCE(data & BIT(63)))
> 
> Dunno, maybe something like:
> 
> /*
>  * A comment explaining what that is...
>  */
> #define SHSTK_SIGRETURN_TOKEN   BIT_ULL(63)
> 
> or so?
> 
> And use that instead of that magical bit 63.

Seems very reasonable. Since we are calling this the "data format", I
might go with SHSTK_DATA_BIT.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-03-09 16:59         ` Edgecombe, Rick P
@ 2023-03-09 17:04           ` Borislav Petkov
  2023-03-09 20:29             ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 17:04 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Yang, Weijiang,
	Lutomirski, Andy, linux-doc, arnd, tglx, Schimpe, Christina,
	mike.kravetz, debug, x86, jamorris, john.allen, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, Mar 09, 2023 at 04:59:52PM +0000, Edgecombe, Rick P wrote:
> Ah, I see what you were saying now. It looks like it will work to me if
> you think it is better stylistically.

Yeah, having a function return an error *and* an I/O parameter at the
same time is more complicated and error prone instead of when you have
a single retval and only input parameters.

I'd say.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack
  2023-03-09 17:02   ` Borislav Petkov
@ 2023-03-09 17:16     ` Edgecombe, Rick P
  2023-03-09 23:35       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-09 17:16 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, 2023-03-09 at 18:02 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:48PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > When a signal is handled normally the context is pushed to the
> > stack
> 
> s/normally //

It is trying to say "When a signal is handled without shadow stack, the
context is pushed to the stack"

> 
> > before handling it. For shadow stacks, since the shadow stack only
> > track's
> 
> "tracks"

Right.

> 
> > return addresses, there isn't any state that needs to be pushed.
> > However,
> > there are still a few things that need to be done. These things are
> > userspace visible and which will be kernel ABI for shadow stacks.
> 
> "visible to userspace"

Sure.

> 
> s/which //

Ok.

> 
> > One is to make sure the restorer address is written to shadow
> > stack, since
> > the signal handler (if not changing ucontext) returns to the
> > restorer, and
> > the restorer calls sigreturn. So add the restorer on the shadow
> > stack
> > before handling the signal, so there is not a conflict when the
> > signal
> > handler returns to the restorer.
> > 
> > The other thing to do is to place some type of checkable token on
> > the
> > thread's shadow stack before handling the signal and check it
> > during
> > sigreturn. This is an extra layer of protection to hamper attackers
> > calling sigreturn manually as in SROP-like attacks.
> > 
> > For this token we can use the shadow stack data format defined
> > earlier.
> 
> 		^^^
> 
> Please use passive voice in your commit message: no "we" or "I", etc.

Argh, right. And it looks like I wrote this one.

> 
> > Have the data pushed be the previous SSP. In the future the
> > sigreturn
> > might want to return back to a different stack. Storing the SSP
> > (instead
> > of a restore offset or something) allows for future functionality
> > that
> > may want to restore to a different stack.
> > 
> > So, when handling a signal push
> >  - the SSP pointing in the shadow stack data format
> >  - the restorer address below the restore token.
> > 
> > In sigreturn, verify SSP is stored in the data format and pop the
> > shadow
> > stack.
> 
> ...
> 
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 13c02747386f..40f0a55762a9 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -232,6 +232,104 @@ static int get_shstk_data(unsigned long
> > *data, unsigned long __user *addr)
> >  	return 0;
> >  }
> >  
> > +static int shstk_push_sigframe(unsigned long *ssp)
> > +{
> > +	unsigned long target_ssp = *ssp;
> > +
> > +	/* Token must be aligned */
> > +	if (!IS_ALIGNED(*ssp, 8))
> > +		return -EINVAL;
> > +
> > +	if (!IS_ALIGNED(target_ssp, 8))
> > +		return -EINVAL;
> 
> Those two statements are identical AFAICT.

Uhh, yes they are. Not sure what happened here.

> 
> > +	*ssp -= SS_FRAME_SIZE;
> > +	if (put_shstk_data((void *__user)*ssp, target_ssp))
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> 
> 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk
  2023-03-09 17:03     ` Edgecombe, Rick P
@ 2023-03-09 17:22       ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 17:22 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, Mar 09, 2023 at 05:03:26PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-03-09 at 17:48 +0100, Borislav Petkov wrote:
> > On Mon, Feb 27, 2023 at 02:29:47PM -0800, Rick Edgecombe wrote:
> > > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > 
> > > Shadow stacks are normally written to via CALL/RET or specific CET
> > 
> >                                        ^
> >                                        indirectly.
> 
> Dunno here, RSTORSSP/SAVEPREVSSP are kind of direct.
> 
> > 
> > > instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
> > > operations the kernel will need to write to directly using the
						  ^^^^^^^^^

Yes, I was trying to make the contrast more obvious because you say
"directly" here.

But not too important.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-02 17:22   ` Szabolcs Nagy
  2023-03-02 21:21     ` Edgecombe, Rick P
@ 2023-03-09 18:55     ` Deepak Gupta
  2023-03-09 19:39       ` Edgecombe, Rick P
  2023-03-14  7:19       ` Mike Rapoport
  1 sibling, 2 replies; 159+ messages in thread
From: Deepak Gupta @ 2023-03-09 18:55 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, nd, al.grant

On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
>The 02/27/2023 14:29, Rick Edgecombe wrote:
>> Previously, a new PROT_SHADOW_STACK was attempted,
>...
>> So rather than repurpose two existing syscalls (mmap, madvise) that don't
>> quite fit, just implement a new map_shadow_stack syscall to allow
>> userspace to map and setup new shadow stacks in one step. While ucontext
>> is the primary motivator, userspace may have other unforeseen reasons to
>> setup it's own shadow stacks using the WRSS instruction. Towards this
>> provide a flag so that stacks can be optionally setup securely for the
>> common case of ucontext without enabling WRSS. Or potentially have the
>> kernel set up the shadow stack in some new way.
>...
>> The following example demonstrates how to create a new shadow stack with
>> map_shadow_stack:
>> void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
>
>i think
>
>mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
>
>could do the same with less disruption to users (new syscalls
>are harder to deal with than new flags). it would do the
>guard page and initial token setup too (there is no flag for
>it but could be squeezed in).

Discussion on this topic in v6
https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/

Again I know earlier CET patches had protection flag and somehow due to pushback
on mailing list, it was adopted to go for special syscall because no one else
had shadow stack.

Seeing a response from Szabolcs, I am assuming arm4 would also want to follow
using mmap to manufacture shadow stack. For reference RFC patches for risc-v shadow stack,
use a new protection flag = PROT_SHADOWSTACK.
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/

I know earlier discussion had been that we let this go and do a re-factor later as other
arch support trickle in. But as I thought more on this and I think it may just be
messy from user mode point of view as well to have cognition of two different ways of
creating shadow stack. One would be special syscall (in current libc) and another `mmap`
(whenever future re-factor happens)

If it's not too late, it would be more wise to take `mmap`
approach rather than special `syscall` approach.


>
>most of the mmap features need not be available (EINVAL) when
>MAP_SHADOW_STACK is specified.
>
>the main drawback is running out of mmap flags so extension
>is limited. (but the new syscall has limitations too).

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-09 18:55     ` Deepak Gupta
@ 2023-03-09 19:39       ` Edgecombe, Rick P
  2023-03-09 21:08         ` Deepak Gupta
  2023-03-14  7:19       ` Mike Rapoport
  1 sibling, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-09 19:39 UTC (permalink / raw)
  To: debug, szabolcs.nagy
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, hjl.tools, pavel, Lutomirski, Andy, linux-doc, arnd,
	tglx, Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang,
	al.grant, jamorris, john.allen, rppt, nd, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov, akpm

On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote:
> On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
> > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > Previously, a new PROT_SHADOW_STACK was attempted,
> > 
> > ...
> > > So rather than repurpose two existing syscalls (mmap, madvise)
> > > that don't
> > > quite fit, just implement a new map_shadow_stack syscall to allow
> > > userspace to map and setup new shadow stacks in one step. While
> > > ucontext
> > > is the primary motivator, userspace may have other unforeseen
> > > reasons to
> > > setup it's own shadow stacks using the WRSS instruction. Towards
> > > this
> > > provide a flag so that stacks can be optionally setup securely
> > > for the
> > > common case of ucontext without enabling WRSS. Or potentially
> > > have the
> > > kernel set up the shadow stack in some new way.
> > 
> > ...
> > > The following example demonstrates how to create a new shadow
> > > stack with
> > > map_shadow_stack:
> > > void *shstk = map_shadow_stack(addr, stack_size,
> > > SHADOW_STACK_SET_TOKEN);
> > 
> > i think
> > 
> > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
> > 
> > could do the same with less disruption to users (new syscalls
> > are harder to deal with than new flags). it would do the
> > guard page and initial token setup too (there is no flag for
> > it but could be squeezed in).
> 
> Discussion on this topic in v6
> 
https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
> 
> Again I know earlier CET patches had protection flag and somehow due
> to pushback
> on mailing list,
>  it was adopted to go for special syscall because no one else
> had shadow stack.
> 
> Seeing a response from Szabolcs, I am assuming arm4 would also want
> to follow
> using mmap to manufacture shadow stack. For reference RFC patches for
> risc-v shadow stack,
> use a new protection flag = PROT_SHADOWSTACK.
> 
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> 
> I know earlier discussion had been that we let this go and do a re-
> factor later as other
> arch support trickle in. But as I thought more on this and I think it
> may just be
> messy from user mode point of view as well to have cognition of two
> different ways of
> creating shadow stack. One would be special syscall (in current libc)
> and another `mmap`
> (whenever future re-factor happens)
> 
> If it's not too late, it would be more wise to take `mmap`
> approach rather than special `syscall` approach.

There is sort of two things intermixed here when we talk about a
PROT_SHADOW_STACK.

One is: what is the interface for specifying how the shadow stack
should be provisioned with data? Right now there are two ways
supported, all zero or with an X86 shadow stack restore token at the
end. Then there was already some conversation about a third type. In
which case the question would be is using mmap MAP_ flags the right
place for this? How many types of initialization will be needed in the
end and what is the overlap between the architectures?

The other thing is: should shadow stack memory creation be tightly
controlled? For example in x86 we limit this to anonymous memory, etc.
Some reasons for this are x86 specific, but some are not. So if we
disallow most of the options why allow the interface to take them? And
then you are in the position of carefully maintaining a list of not-
allowed options instead letting a list of allowed options sit there.

The only benefit I've heard is that it saves creating a new syscall,
but it also saves several MAP_ flags. That, and that the RFC for riscv
did a PROT_SHADOW_STACK to start. So, yes, two people asked the same
question, but I'm still not seeing any benefits. Can you give the pros
and cons please?

BTW, in glibc map_shadow_stack is called from arch code. So I think
userspace wise, for this to affect other architectures there would need
to be some code that could do things generically, with somehow the
shadow stack pivot abstracted but the shadow stack allocation not.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
  2023-03-09 17:04           ` Borislav Petkov
@ 2023-03-09 20:29             ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-09 20:29 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, Yang,
	Weijiang, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	linux-doc, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, 2023-03-09 at 18:04 +0100, Borislav Petkov wrote:
> On Thu, Mar 09, 2023 at 04:59:52PM +0000, Edgecombe, Rick P wrote:
> > Ah, I see what you were saying now. It looks like it will work to
> > me if
> > you think it is better stylistically.
> 
> Yeah, having a function return an error *and* an I/O parameter at the
> same time is more complicated and error prone instead of when you
> have
> a single retval and only input parameters.
> 
> I'd say.

Yea, I agree it's better this way, and at first I just missed your
point. By "if you think it's better", I just meant that if someone told
me to do it the other way I wouldn't die on the hill.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-09 19:39       ` Edgecombe, Rick P
@ 2023-03-09 21:08         ` Deepak Gupta
  2023-03-10  0:14           ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Deepak Gupta @ 2023-03-09 21:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: szabolcs.nagy, david, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, al.grant, jamorris, john.allen, rppt, nd,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm

On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P wrote:
>On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote:
>> On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
>> > The 02/27/2023 14:29, Rick Edgecombe wrote:
>> > > Previously, a new PROT_SHADOW_STACK was attempted,
>> >
>> > ...
>> > > So rather than repurpose two existing syscalls (mmap, madvise)
>> > > that don't
>> > > quite fit, just implement a new map_shadow_stack syscall to allow
>> > > userspace to map and setup new shadow stacks in one step. While
>> > > ucontext
>> > > is the primary motivator, userspace may have other unforeseen
>> > > reasons to
>> > > setup it's own shadow stacks using the WRSS instruction. Towards
>> > > this
>> > > provide a flag so that stacks can be optionally setup securely
>> > > for the
>> > > common case of ucontext without enabling WRSS. Or potentially
>> > > have the
>> > > kernel set up the shadow stack in some new way.
>> >
>> > ...
>> > > The following example demonstrates how to create a new shadow
>> > > stack with
>> > > map_shadow_stack:
>> > > void *shstk = map_shadow_stack(addr, stack_size,
>> > > SHADOW_STACK_SET_TOKEN);
>> >
>> > i think
>> >
>> > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
>> >
>> > could do the same with less disruption to users (new syscalls
>> > are harder to deal with than new flags). it would do the
>> > guard page and initial token setup too (there is no flag for
>> > it but could be squeezed in).
>>
>> Discussion on this topic in v6
>>
>https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
>>
>> Again I know earlier CET patches had protection flag and somehow due
>> to pushback
>> on mailing list,
>>  it was adopted to go for special syscall because no one else
>> had shadow stack.
>>
>> Seeing a response from Szabolcs, I am assuming arm4 would also want
>> to follow
>> using mmap to manufacture shadow stack. For reference RFC patches for
>> risc-v shadow stack,
>> use a new protection flag = PROT_SHADOWSTACK.
>>
>https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
>>
>> I know earlier discussion had been that we let this go and do a re-
>> factor later as other
>> arch support trickle in. But as I thought more on this and I think it
>> may just be
>> messy from user mode point of view as well to have cognition of two
>> different ways of
>> creating shadow stack. One would be special syscall (in current libc)
>> and another `mmap`
>> (whenever future re-factor happens)
>>
>> If it's not too late, it would be more wise to take `mmap`
>> approach rather than special `syscall` approach.
>
>There is sort of two things intermixed here when we talk about a
>PROT_SHADOW_STACK.
>
>One is: what is the interface for specifying how the shadow stack
>should be provisioned with data? Right now there are two ways
>supported, all zero or with an X86 shadow stack restore token at the
>end. Then there was already some conversation about a third type. In
>which case the question would be is using mmap MAP_ flags the right
>place for this? How many types of initialization will be needed in the
>end and what is the overlap between the architectures?

First of all, arches can choose to have token at the bottom or not.

Token serve following purposes
  - It allows one to put desired value in shadow stack pointer in safe/secure manner.
    Note: x86 doesn't provide any opcode encoding to value in SSP register. So having
    a token is kind of a necessity because x86 doesn't easily allow writing shadow stack.

  - A token at the bottom acts marker / barrier and can be useful in debugging

  - If (and a big *if*) we ever reach a point in future where return address is only pushed
    on shadow stack (x86 should have motivation to do this because less uops on call/ret),
    a token at the bottom (bottom means lower address) is ensuring sure shot way of getting
    a fault when exhausted.

Current RISCV zisslpcfi proposal doesn't define CPU based tokens because it's RISC.
It allows mechanisms using which software can define formatting of token for itself.
Not sure of what ARM is doing.

Now coming to the point of all zero v/s shadow stack token.
Why not always have token at the bottom?

In case of x86, Why need for two ways and why not always have a token at the bottom.
The way x86 is going, user mode is responsible for establishing shadow stack and thus
whenever shadow stack is created then if x86 kernel implementation always place a token
at the base/bottom.

Now user mode can do following:--
  - If it has access to WRSS, it can sure go ahead and create a token of its choosing and
    overwrite kernel created token. and then do RSTORSSP on it's own created token.

  - If it doesn't have access to WRSS (and dont need to create its own token), it can do
    RSTORSSP on this. As soon as it does, no other thread in process can restore to it.
    On `fork`, you get the same un-restorable token.

So why not always have a token at the bottom.
This is my plan for riscv implementation as well (to have a token at the bottom)

>
>The other thing is: should shadow stack memory creation be tightly
>controlled? For example in x86 we limit this to anonymous memory, etc.
>Some reasons for this are x86 specific, but some are not. So if we
>disallow most of the options why allow the interface to take them? And
>then you are in the position of carefully maintaining a list of not-
>allowed options instead letting a list of allowed options sit there.

I am new to linux kernel and thus may be not able to follow the argument of
limiting to anonymous memory.

Why is limiting it to anonymous memory a problem. IIRC, ARM's PROT_MTE is applicable
only to anonymous memory. I can probably find few more examples. 

Eventually syscall will also go ahead and use memory management code to
perform mapping. So I didn't understand the reasoning here. The way syscall
can limit it to anonymous memory, why mmap can't do the same if it sees
PROT_SHADOWSTACK.

>
>The only benefit I've heard is that it saves creating a new syscall,
>but it also saves several MAP_ flags. That, and that the RFC for riscv
>did a PROT_SHADOW_STACK to start. So, yes, two people asked the same
>question, but I'm still not seeing any benefits. Can you give the pros
>and cons please?

Again the way syscall will limit it to anonymous memory, Why mmap can't do same?
There is precedence for it (like PROT_MTE is applicable only to anonymous memory)

So if it can be done, then why introduce a new syscall?

>
>BTW, in glibc map_shadow_stack is called from arch code. So I think
>userspace wise, for this to affect other architectures there would need
>to be some code that could do things generically, with somehow the
>shadow stack pivot abstracted but the shadow stack allocation not.

Agreed, yes it can be done in a way where it won't put tax on other architectures.

But what about fragmentation within x86. Will x86 always choose to use system call
method map shadow stack. If future re-factor results in x86 also use `mmap` method.
Isn't it a mess for x86 glibc to figure out what to do; whether to use system call
or `mmap`?


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack
  2023-03-09 17:16     ` Edgecombe, Rick P
@ 2023-03-09 23:35       ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 23:35 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Thu, Mar 09, 2023 at 05:16:42PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-03-09 at 18:02 +0100, Borislav Petkov wrote:
> > On Mon, Feb 27, 2023 at 02:29:48PM -0800, Rick Edgecombe wrote:
> > > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > 
> > > When a signal is handled normally the context is pushed to the
> > > stack
> > 
> > s/normally //
> 
> It is trying to say "When a signal is handled without shadow stack, the
> context is pushed to the stack"

Yeah, I see that. But "normally" is implicit in the "normal" case,
without shadow stack.

So you don't really need to say "normally". :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-09 16:56         ` Edgecombe, Rick P
@ 2023-03-09 23:51           ` Borislav Petkov
  2023-03-10  1:13             ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-09 23:51 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	linux-doc, x86, andrew.cooper3, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Mar 09, 2023 at 04:56:37PM +0000, Edgecombe, Rick P wrote:
> There is a proc that shows if shadow stack is enabled in a thread. It
> does indeed come later in the series.

Not good enough:

1. buried somewhere in proc where no one knows about it

2. it is per thread so user needs to grep *all*

>  ... We previously tried to add some batch operations to improve the
>  performance, but tglx had suggested to start with something simple.
>  So we end up with this simple composable API.

I agree with starting simple and thanks for explaining this in detail.

TBH, though, it already sounds like a mess to me. I guess a mess we'll
have to deal with because there will always be this case of some
shared object/lib not being enabled for shstk because of raisins.

And TBH #2, I would've done it even simpler: if some shared object can't
do shadow stack, we disable it for the whole process. I mean, what's the
point? Only some of the stack is shadowed so an attacker could find
a way to keep the process perhaps run this shstk-unsupporting shared
object more/longer and ROP its way around the system.

But I tend to oversimplify things sometimes so...

What I'd like to have, though, is a kernel cmdline param which disables
permissive mode and userspace can't do anything about it. So that once
you boot your kernel, you can know that everything that runs on the
machine has shstk and is properly protected.

Also, it'll allow for faster fixing of all those shared objects to use
shstk by way of political pressure.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-09 21:08         ` Deepak Gupta
@ 2023-03-10  0:14           ` Edgecombe, Rick P
  2023-03-10 21:00             ` Deepak Gupta
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10  0:14 UTC (permalink / raw)
  To: debug
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, al.grant, tglx, mike.kravetz, x86, Schimpe,
	Christina, jamorris, john.allen, linux-doc, nd, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm

On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote:
> On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote:
> > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
> > > > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > > > Previously, a new PROT_SHADOW_STACK was attempted,
> > > > 
> > > > ...
> > > > > So rather than repurpose two existing syscalls (mmap,
> > > > > madvise)
> > > > > that don't
> > > > > quite fit, just implement a new map_shadow_stack syscall to
> > > > > allow
> > > > > userspace to map and setup new shadow stacks in one step.
> > > > > While
> > > > > ucontext
> > > > > is the primary motivator, userspace may have other unforeseen
> > > > > reasons to
> > > > > setup it's own shadow stacks using the WRSS instruction.
> > > > > Towards
> > > > > this
> > > > > provide a flag so that stacks can be optionally setup
> > > > > securely
> > > > > for the
> > > > > common case of ucontext without enabling WRSS. Or potentially
> > > > > have the
> > > > > kernel set up the shadow stack in some new way.
> > > > 
> > > > ...
> > > > > The following example demonstrates how to create a new shadow
> > > > > stack with
> > > > > map_shadow_stack:
> > > > > void *shstk = map_shadow_stack(addr, stack_size,
> > > > > SHADOW_STACK_SET_TOKEN);
> > > > 
> > > > i think
> > > > 
> > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
> > > > 
> > > > could do the same with less disruption to users (new syscalls
> > > > are harder to deal with than new flags). it would do the
> > > > guard page and initial token setup too (there is no flag for
> > > > it but could be squeezed in).
> > > 
> > > Discussion on this topic in v6
> > > 
> > 
> > 
https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
> > > 
> > > Again I know earlier CET patches had protection flag and somehow
> > > due
> > > to pushback
> > > on mailing list,
> > >  it was adopted to go for special syscall because no one else
> > > had shadow stack.
> > > 
> > > Seeing a response from Szabolcs, I am assuming arm4 would also
> > > want
> > > to follow
> > > using mmap to manufacture shadow stack. For reference RFC patches
> > > for
> > > risc-v shadow stack,
> > > use a new protection flag = PROT_SHADOWSTACK.
> > > 
> > 
> > 
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> > > 
> > > I know earlier discussion had been that we let this go and do a
> > > re-
> > > factor later as other
> > > arch support trickle in. But as I thought more on this and I
> > > think it
> > > may just be
> > > messy from user mode point of view as well to have cognition of
> > > two
> > > different ways of
> > > creating shadow stack. One would be special syscall (in current
> > > libc)
> > > and another `mmap`
> > > (whenever future re-factor happens)
> > > 
> > > If it's not too late, it would be more wise to take `mmap`
> > > approach rather than special `syscall` approach.
> > 
> > There is sort of two things intermixed here when we talk about a
> > PROT_SHADOW_STACK.
> > 
> > One is: what is the interface for specifying how the shadow stack
> > should be provisioned with data? Right now there are two ways
> > supported, all zero or with an X86 shadow stack restore token at
> > the
> > end. Then there was already some conversation about a third type.
> > In
> > which case the question would be is using mmap MAP_ flags the right
> > place for this? How many types of initialization will be needed in
> > the
> > end and what is the overlap between the architectures?
> 
> First of all, arches can choose to have token at the bottom or not.
> 
> Token serve following purposes
>   - It allows one to put desired value in shadow stack pointer in
> safe/secure manner.
>     Note: x86 doesn't provide any opcode encoding to value in SSP
> register. So having
>     a token is kind of a necessity because x86 doesn't easily allow
> writing shadow stack.
> 
>   - A token at the bottom acts marker / barrier and can be useful in
> debugging
> 
>   - If (and a big *if*) we ever reach a point in future where return
> address is only pushed
>     on shadow stack (x86 should have motivation to do this because
> less uops on call/ret),
>     a token at the bottom (bottom means lower address) is ensuring
> sure shot way of getting
>     a fault when exhausted.
> 
> Current RISCV zisslpcfi proposal doesn't define CPU based tokens
> because it's RISC.
> It allows mechanisms using which software can define formatting of
> token for itself.
> Not sure of what ARM is doing.

Ok, so riscv doesn't need to have the kernel write the token, but x86
does.

> 
> Now coming to the point of all zero v/s shadow stack token.
> Why not always have token at the bottom?

With WRSS you can setup the shadow stack however you want. So the user
would then have to take care to erase the token if they didn't want it.
Not the end of the world, but kind of clunky if there is no reason for
it.

> 
> In case of x86, Why need for two ways and why not always have a token
> at the bottom.
> The way x86 is going, user mode is responsible for establishing
> shadow stack and thus
> whenever shadow stack is created then if x86 kernel implementation
> always place a token
> at the base/bottom.

There was also some discussion recently of adding a token AND an end of
stack marker, as a potential solution for backtracing in ucontext
stacks. In this case it could cause an ABI break to just start adding
the end of stack marker where the token was, and so would require a new
map_shadow_stack flag.

> 
> Now user mode can do following:--
>   - If it has access to WRSS, it can sure go ahead and create a token
> of its choosing and
>     overwrite kernel created token. and then do RSTORSSP on it's own
> created token.
> 
>   - If it doesn't have access to WRSS (and dont need to create its
> own token), it can do
>     RSTORSSP on this. As soon as it does, no other thread in process
> can restore to it.
>     On `fork`, you get the same un-restorable token.
> 
> So why not always have a token at the bottom.
> This is my plan for riscv implementation as well (to have a token at
> the bottom)
> 
> > 
> > The other thing is: should shadow stack memory creation be tightly
> > controlled? For example in x86 we limit this to anonymous memory,
> > etc.
> > Some reasons for this are x86 specific, but some are not. So if we
> > disallow most of the options why allow the interface to take them?
> > And
> > then you are in the position of carefully maintaining a list of
> > not-
> > allowed options instead letting a list of allowed options sit
> > there.
> 
> I am new to linux kernel and thus may be not able to follow the
> argument of
> limiting to anonymous memory.
> 
> Why is limiting it to anonymous memory a problem. IIRC, ARM's
> PROT_MTE is applicable
> only to anonymous memory. I can probably find few more examples. 

Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that only
gets set if all the rules are followed. Then PROT_MTE can only be set
on that to set VM_MTE. That is kind of nice because certain other
special situations can choose to support it.

It does take another arch vma flag though. For x86 I guess I would need
to figure out how to squeeze VM_SHADOW_STACK into other flags to have a
free flag to use the same method. It also only supports mprotect() and
shadow stack would only want to support mmap(). And you still have the
initialization stuff to plumb through. Yea, I think the PROT_MTE is a
good thing to consider, but it's not super obvious to me how similar
the logic would be for shadow stack.

The question I'm asking though is, not "can mmap code and rules be
changed to enforce the required limitations?". I think it is yes. But
the question is "why is that plumbing better than a new syscall?". I
guess to get a better idea, the mmap solution would need to get POCed.
I had half done this at one point, but abandoned the approach.

For your question about why limit it, the special x86 case is the
Dirty=1,Write=0 PTE bit combination for shadow stacks. So for shadow
stack you could have some confusion about whether a PTE is actually
dirty for writeback, etc. I wouldn't say it's known to be impossible to
do MAP_SHARED, but it has not been fully analyzed enough to know what
the changes would be. There were some solvable concrete issues that
tipped the scale as well. It was also not expected to be a common
usage, if at all.

The non-x86, general reasons for it, are for a smaller benefit. It
blocks a lot of ways shadow stack memory could be written to. Like say
you have a memory mapped writable file, and you also map it shadow
stack. So it has better security properties depending on what your
threat model is.

> 
> Eventually syscall will also go ahead and use memory management code
> to
> perform mapping. So I didn't understand the reasoning here. The way
> syscall
> can limit it to anonymous memory, why mmap can't do the same if it
> sees
> PROT_SHADOWSTACK.
> 
> > 
> > The only benefit I've heard is that it saves creating a new
> > syscall,
> > but it also saves several MAP_ flags. That, and that the RFC for
> > riscv
> > did a PROT_SHADOW_STACK to start. So, yes, two people asked the
> > same
> > question, but I'm still not seeing any benefits. Can you give the
> > pros
> > and cons please?
> 
> Again the way syscall will limit it to anonymous memory, Why mmap
> can't do same?
> There is precedence for it (like PROT_MTE is applicable only to
> anonymous memory)
> 
> So if it can be done, then why introduce a new syscall?
> 
> > 
> > BTW, in glibc map_shadow_stack is called from arch code. So I think
> > userspace wise, for this to affect other architectures there would
> > need
> > to be some code that could do things generically, with somehow the
> > shadow stack pivot abstracted but the shadow stack allocation not.
> 
> Agreed, yes it can be done in a way where it won't put tax on other
> architectures.
> 
> But what about fragmentation within x86. Will x86 always choose to
> use system call
> method map shadow stack. If future re-factor results in x86 also use
> `mmap` method.
> Isn't it a mess for x86 glibc to figure out what to do; whether to
> use system call
> or `mmap`?
> 

Ok, so this is the downside I guess. What happens if we want to support
the other types of memory in the future and end up using mmap for this?
Then we have 15-20 lines of extra syscall wrapping code to maintain to
support legacy.

For the mmap solution, we have the downside of using extra MAP_ flags,
and *some* amount of currently unknown vm_flag and address range logic,
plus mmap arch breakouts to add to core MM. Like I said earlier, you
would need to POC it out to see how bad that looks and get some core MM
feedback on the new type of MAP flag usage. But, syscalls being pretty
straightforward, it would probably be *some* amount of added complexity
_now_ to support something that might happen in the future. I'm not
seeing either one as a landslide win.

It's kind of an eternal software design philosophical question, isn't
it? How much work should you do to prepare for things that might be
needed in the future? From what I've seen the balance in the kernel
seems to be to try not to paint yourself in to an ABI corner, but
otherwise let the kernel evolve naturally in response to real usages.
If anyone wants to correct this, please do. But otherwise I think the
new syscall is aligned with that.

TBH, you are making me wonder if I'm missing something. It seems you
strongly don't prefer this approach, but I'm not hearing any huge
potential negative impacts. And you also say it won't tax the riscv
implementation. Is this just something just smells bad here? Or it
would shrink the riscv series?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-09 23:51           ` Borislav Petkov
@ 2023-03-10  1:13             ` Edgecombe, Rick P
  2023-03-10  2:03               ` H.J. Lu
  2023-03-10 11:40               ` Borislav Petkov
  0 siblings, 2 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10  1:13 UTC (permalink / raw)
  To: bp, joao
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, jamorris, arnd,
	tglx, Schimpe, Christina, mike.kravetz, debug, Yang, Weijiang,
	x86, andrew.cooper3, john.allen, linux-doc, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

+Joao regarding mixed mode designs

On Fri, 2023-03-10 at 00:51 +0100, Borislav Petkov wrote:
> On Thu, Mar 09, 2023 at 04:56:37PM +0000, Edgecombe, Rick P wrote:
> > There is a proc that shows if shadow stack is enabled in a thread.
> > It
> > does indeed come later in the series.
> 
> Not good enough:
> 
> 1. buried somewhere in proc where no one knows about it
> 
> 2. it is per thread so user needs to grep *all*

See "x86: Expose thread features in /proc/$PID/status" for the patch.
We could emit something in dmesg I guess? The logic would be:
 - Record the presence of elf SHSTK bit on exec
 - On shadow stack disable, if it had the elf bit, pr_info("bad!")

> 
> >   ... We previously tried to add some batch operations to improve
> > the
> >   performance, but tglx had suggested to start with something
> > simple.
> >   So we end up with this simple composable API.
> 
> I agree with starting simple and thanks for explaining this in
> detail.
> 
> TBH, though, it already sounds like a mess to me. I guess a mess
> we'll
> have to deal with because there will always be this case of some
> shared object/lib not being enabled for shstk because of raisins.

The compatibility problems are totally the mess in this whole thing.
When you try to look at a "permissive" mode that actually works it gets
even more complex. Joao and I have been banging our heads on that
problem for months.

But there are some expected users of this that say: we compile and
check our known set of binaries, we won't get any surprises. So it's
more of a distro problem.

> 
> And TBH #2, I would've done it even simpler: if some shared object
> can't
> do shadow stack, we disable it for the whole process. I mean, what's
> the
> point? 

You mean a late loaded dlopen()ed DSO? The enabling logic can't know
this will happen ahead of time.

If you mean if the shared objects in the elf all support shadow stack,
then this is what happens. The complication is that the loader wants to
enable shadow stack before it has checked the elf libs so it doesn't
underflow the shadow stack when it returns from the function that does
this checking.

So it does:
1. Enable shadow stack
2. Call elf libs checking functions
3. If all good, lock shadow stack. Else, disable shadow stack.
4. Return from elf checking functions and if shstk is enabled, don't
underflow because it was enabled in step 1 and we have return addresses
from 2 on the shadow stack

I'm wondering if this can't be improved in glibc to look like:
1. Check elf libs, and record it somewhere
2. Wait until just the right spot
3. If all good, enable and lock shadow stack.

But it depends on the loader code design which I don't know well
enough.

> Only some of the stack is shadowed so an attacker could find
> a way to keep the process perhaps run this shstk-unsupporting shared
> object more/longer and ROP its way around the system.

I hope non-permissive mode is the standard usage eventually.

> 
> But I tend to oversimplify things sometimes so...
> 
> What I'd like to have, though, is a kernel cmdline param which
> disables
> permissive mode and userspace can't do anything about it. So that
> once
> you boot your kernel, you can know that everything that runs on the
> machine has shstk and is properly protected.

Szabolcs Nagy was commenting something similar in another thread, for
supporting kernel enforced security policies. I think the way to do it
would have the kernel detect the the elf bit itself (like it used to)
and enable shadow stack on exec. If you can't rely on userspace to call
in to enable it, it's not clear at what point the kernel should check
that it did.

But then if you trigger off of the elf bit in the kernel, you get all
the regression issues of the old glibcs at that point. But it is
already an "I don't care if I crash" mode, so...

I think if you trust your libc, glibc could implement this in userspace
too. It would be useful even as as testing override.

> 
> Also, it'll allow for faster fixing of all those shared objects to
> use
> shstk by way of political pressure.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-10  1:13             ` Edgecombe, Rick P
@ 2023-03-10  2:03               ` H.J. Lu
  2023-03-10 20:00                 ` H.J. Lu
  2023-03-10 11:40               ` Borislav Petkov
  1 sibling, 1 reply; 159+ messages in thread
From: H.J. Lu @ 2023-03-10  2:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bp, joao, david, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, akpm, Lutomirski, Andy, jamorris, arnd,
	tglx, Schimpe, Christina, mike.kravetz, debug, Yang, Weijiang,
	x86, andrew.cooper3, john.allen, linux-doc, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Mar 9, 2023 at 5:13 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> +Joao regarding mixed mode designs
>
> On Fri, 2023-03-10 at 00:51 +0100, Borislav Petkov wrote:
> > On Thu, Mar 09, 2023 at 04:56:37PM +0000, Edgecombe, Rick P wrote:
> > > There is a proc that shows if shadow stack is enabled in a thread.
> > > It
> > > does indeed come later in the series.
> >
> > Not good enough:
> >
> > 1. buried somewhere in proc where no one knows about it
> >
> > 2. it is per thread so user needs to grep *all*
>
> See "x86: Expose thread features in /proc/$PID/status" for the patch.
> We could emit something in dmesg I guess? The logic would be:
>  - Record the presence of elf SHSTK bit on exec
>  - On shadow stack disable, if it had the elf bit, pr_info("bad!")
>
> >
> > >   ... We previously tried to add some batch operations to improve
> > > the
> > >   performance, but tglx had suggested to start with something
> > > simple.
> > >   So we end up with this simple composable API.
> >
> > I agree with starting simple and thanks for explaining this in
> > detail.
> >
> > TBH, though, it already sounds like a mess to me. I guess a mess
> > we'll
> > have to deal with because there will always be this case of some
> > shared object/lib not being enabled for shstk because of raisins.
>
> The compatibility problems are totally the mess in this whole thing.
> When you try to look at a "permissive" mode that actually works it gets
> even more complex. Joao and I have been banging our heads on that
> problem for months.
>
> But there are some expected users of this that say: we compile and
> check our known set of binaries, we won't get any surprises. So it's
> more of a distro problem.
>
> >
> > And TBH #2, I would've done it even simpler: if some shared object
> > can't
> > do shadow stack, we disable it for the whole process. I mean, what's
> > the
> > point?
>
> You mean a late loaded dlopen()ed DSO? The enabling logic can't know
> this will happen ahead of time.
>
> If you mean if the shared objects in the elf all support shadow stack,
> then this is what happens. The complication is that the loader wants to
> enable shadow stack before it has checked the elf libs so it doesn't
> underflow the shadow stack when it returns from the function that does
> this checking.
>
> So it does:
> 1. Enable shadow stack
> 2. Call elf libs checking functions
> 3. If all good, lock shadow stack. Else, disable shadow stack.
> 4. Return from elf checking functions and if shstk is enabled, don't
> underflow because it was enabled in step 1 and we have return addresses
> from 2 on the shadow stack
>
> I'm wondering if this can't be improved in glibc to look like:
> 1. Check elf libs, and record it somewhere
> 2. Wait until just the right spot
> 3. If all good, enable and lock shadow stack.

I will try it out.

> But it depends on the loader code design which I don't know well
> enough.
>
> > Only some of the stack is shadowed so an attacker could find
> > a way to keep the process perhaps run this shstk-unsupporting shared
> > object more/longer and ROP its way around the system.
>
> I hope non-permissive mode is the standard usage eventually.
>
> >
> > But I tend to oversimplify things sometimes so...
> >
> > What I'd like to have, though, is a kernel cmdline param which
> > disables
> > permissive mode and userspace can't do anything about it. So that
> > once
> > you boot your kernel, you can know that everything that runs on the
> > machine has shstk and is properly protected.
>
> Szabolcs Nagy was commenting something similar in another thread, for
> supporting kernel enforced security policies. I think the way to do it
> would have the kernel detect the the elf bit itself (like it used to)
> and enable shadow stack on exec. If you can't rely on userspace to call
> in to enable it, it's not clear at what point the kernel should check
> that it did.
>
> But then if you trigger off of the elf bit in the kernel, you get all
> the regression issues of the old glibcs at that point. But it is
> already an "I don't care if I crash" mode, so...
>
> I think if you trust your libc, glibc could implement this in userspace
> too. It would be useful even as as testing override.
>
> >
> > Also, it'll allow for faster fixing of all those shared objects to
> > use
> > shstk by way of political pressure.



-- 
H.J.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-10  1:13             ` Edgecombe, Rick P
  2023-03-10  2:03               ` H.J. Lu
@ 2023-03-10 11:40               ` Borislav Petkov
  1 sibling, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-10 11:40 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: joao, david, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, akpm, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	Yang, Weijiang, x86, andrew.cooper3, john.allen, linux-doc, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Fri, Mar 10, 2023 at 01:13:42AM +0000, Edgecombe, Rick P wrote:
> See "x86: Expose thread features in /proc/$PID/status" for the patch.
> We could emit something in dmesg I guess? The logic would be:

dmesg is just flaky: ring buffer can get overwritten, users don't see
it, ...

> The compatibility problems are totally the mess in this whole thing.
> When you try to look at a "permissive" mode that actually works it gets
> even more complex. Joao and I have been banging our heads on that
> problem for months.

Oh yeah, I'm soo NOT jealous. :-\

> But there are some expected users of this that say: we compile and
> check our known set of binaries, we won't get any surprises. So it's
> more of a distro problem.

I'm guessing what will happen here is that distros will gradually enable
shstk and once it is ubiquitous, there will be no reason to disable it
at all.

> You mean a late loaded dlopen()ed DSO? The enabling logic can't know
> this will happen ahead of time.

No, I meant the case where you start with shstk enabled and later
disable it when some lib does not support it.

From now on that whole process is marked as "cannot use shstk anymore"
and any other shared object that tries to use shstk simply doesn't get
it.

But meh, this makes the situation even more convoluted as the stuff that
has loaded before the first shstk-not-supporting lib, already uses
shstk.

So you have half and half.

What a mess.

> I hope non-permissive mode is the standard usage eventually.

Yah.

> I think if you trust your libc, glibc could implement this in userspace
> too. It would be useful even as as testing override.

No, you cannot trust any userspace. And there are other libc's beside
glibc.

This should be a kernel parameter. I'm not saying we should do it now
but we should do it at some point.

So that user Boris again, he installs his new shiny distro, he checks
that all the use cases and software he uses there is already
shstk-enabled and then he goes and builds the kernel with

	CONFIG_X86_USER_SHADOW_STACK_STRICT=y

or supplies a cmdline param and from now on, nothing can run without
shstk. No checking, no trusting, no nothing.

We fail any thread creation which doesn't init shstk.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-02-27 22:29 ` [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
  2023-03-02 17:22   ` Szabolcs Nagy
@ 2023-03-10 16:11   ` Borislav Petkov
  2023-03-10 17:12     ` Edgecombe, Rick P
  1 sibling, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-10 16:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:49PM -0800, Rick Edgecombe wrote:
> When operating with shadow stacks enabled, the kernel will automatically
> allocate shadow stacks for new threads, however in some cases userspace
> will need additional shadow stacks. The main example of this is the
> ucontext family of functions, which require userspace allocating and
> pivoting to userspace managed stacks.
> 
> Unlike most other user memory permissions, shadow stacks need to be
> provisioned with special data in order to be useful. They need to be setup
> with a restore token so that userspace can pivot to them via the RSTORSSP
> instruction. But, the security design of shadow stack's is that they

"stacks"

> should not be written to except in limited circumstances. This presents a
> problem for userspace, as to how userspace can provision this special
> data, without allowing for the shadow stack to be generally writable.
> 
> Previously, a new PROT_SHADOW_STACK was attempted, which could be
> mprotect()ed from RW permissions after the data was provisioned. This was
> found to not be secure enough, as other thread's could write to the

"threads"

> shadow stack during the writable window.
> 
> The kernel can use a special instruction, WRUSS, to write directly to
> userspace shadow stacks. So the solution can be that memory can be mapped
> as shadow stack permissions from the beginning (never generally writable
> in userspace), and the kernel itself can write the restore token.
> 
> First, a new madvise() flag was explored, which could operate on the
> PROT_SHADOW_STACK memory. This had a couple downsides:
					     ^
					     of


> 1. Extra checks were needed in mprotect() to prevent writable memory from
>    ever becoming PROT_SHADOW_STACK.
> 2. Extra checks/vma state were needed in the new madvise() to prevent
>    restore tokens being written into the middle of pre-used shadow stacks.
>    It is ideal to prevent restore tokens being added at arbitrary
>    locations, so the check was to make sure the shadow stack had never been
>    written to.
> 3. It stood out from the rest of the madvise flags, as more of direct
>    action than a hint at future desired behavior.
> 
> So rather than repurpose two existing syscalls (mmap, madvise) that don't
> quite fit, just implement a new map_shadow_stack syscall to allow
> userspace to map and setup new shadow stacks in one step. While ucontext
> is the primary motivator, userspace may have other unforeseen reasons to
> setup it's own shadow stacks using the WRSS instruction. Towards this

"its"

> provide a flag so that stacks can be optionally setup securely for the
> common case of ucontext without enabling WRSS. Or potentially have the
> kernel set up the shadow stack in some new way.
> 
> The following example demonstrates how to create a new shadow stack with
> map_shadow_stack:
> void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

...

> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..f65c671ce3b1 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	64	map_shadow_stack	sys_map_shadow_stack

Yeah, this'll need a manpage too, I presume. But later.

> +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
> +{
> +	bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
> +	unsigned long aligned_size;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return -EOPNOTSUPP;
> +
> +	if (flags & ~SHADOW_STACK_SET_TOKEN)
> +		return -EINVAL;
> +
> +	/* If there isn't space for a token */
> +	if (set_tok && size < 8)
> +		return -EINVAL;
> +
> +	if (addr && addr <= 0xFFFFFFFF)

			< SZ_4G

> +		return -EINVAL;

Can we use distinct negative retvals in each case so that it is clear to
userspace where it fails, *if* it fails?

> +	/*
> +	 * An overflow would result in attempting to write the restore token
> +	 * to the wrong location. Not catastrophic, but just return the right
> +	 * error code and block it.
> +	 */
> +	aligned_size = PAGE_ALIGN(size);
> +	if (aligned_size < size)
> +		return -EOVERFLOW;
> +
> +	return alloc_shstk(addr, aligned_size, size, set_tok);
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 34/41] x86/shstk: Support WRSS for userspace
  2023-02-27 22:29 ` [PATCH v7 34/41] x86/shstk: Support WRSS for userspace Rick Edgecombe
@ 2023-03-10 16:44   ` Borislav Petkov
  2023-03-10 17:16     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-10 16:44 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:50PM -0800, Rick Edgecombe wrote:
> For the current shadow stack implementation, shadow stacks contents can't
> easily be provisioned with arbitrary data. This property helps apps
> protect themselves better, but also restricts any potential apps that may
> want to do exotic things at the expense of a little security.
> 
> The x86 shadow stack feature introduces a new instruction, WRSS, which
> can be enabled to write directly to shadow stack permissioned memory from

s/permissioned //

By now it is clear that shadow stack memory is a special thing anyway.

> userspace. Allow it to get enabled via the prctl interface.
> 
> Only enable the userspace WRSS instruction, which allows writes to
> userspace shadow stacks from userspace. Do not allow it to be enabled
> independently of shadow stack, as HW does not support using WRSS when
> shadow stack is disabled.
> 
> From a fault handler perspective, WRSS will behave very similar to WRUSS,
> which is treated like a user access from a #PF err code perspective.

...

> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..2d3b35c957ad 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
>  int msr_set_bit(u32 msr, u8 bit);
>  int msr_clear_bit(u32 msr, u8 bit);
>  
> +/* Helper that can never get accidentally un-inlined. */
> +#define set_clr_bits_msrl(msr, set, clear)	do {	\

Uff, pls kill this thing.

Our MSR interfaces universe is already insane and arch/x86/lib/msr.c
already has similar attempts to what you're doing here in addition to
all the other gunk in msr.h.

I highly doubt this can't be done the usual way, lemme see...

> +	u64 __val, __new_val, __msr = msr;		\
> +							\
> +	rdmsrl(__msr, __val);				\
> +	__new_val = (__val & ~(clear)) | (set);		\
> +							\
> +	if (__new_val != __val)				\
> +		wrmsrl(__msr, __new_val);		\
> +} while (0)
> +
>  #ifdef CONFIG_SMP
>  int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
>  int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 7dfd9dc00509..e31495668056 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -28,5 +28,6 @@
>  
>  /* ARCH_SHSTK_ features bits */
>  #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
> +#define ARCH_SHSTK_WRSS			(1ULL <<  1)
>  
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 0a3decab70ee..009cb3fa0ae5 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -363,6 +363,36 @@ void shstk_free(struct task_struct *tsk)
>  	unmap_shadow_stack(shstk->base, shstk->size);
>  }
>  
> +static int wrss_control(bool enable)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Only enable wrss if shadow stack is enabled. If shadow stack is not

"WRSS". Insns in all caps pls.

> +	 * enabled, wrss will already be disabled, so don't bother clearing it

Ditto.

> +	 * when disabling.
> +	 */
> +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +		return -EPERM;
> +
> +	/* Already enabled/disabled? */
> +	if (features_enabled(ARCH_SHSTK_WRSS) == enable)
> +		return 0;
> +
> +	fpregs_lock_and_load();
> +	if (enable) {
> +		set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
> +		features_set(ARCH_SHSTK_WRSS);
> +	} else {
> +		set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
> +		features_clr(ARCH_SHSTK_WRSS);
> +	}
> +	fpregs_unlock();

Yes, doing it the "usual" way is more readable because it is a common
code pattern which one encounters all around arch/x86/.

Diff ontop:

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 009cb3fa0ae5..914feff26b23 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -365,6 +365,8 @@ void shstk_free(struct task_struct *tsk)
 
 static int wrss_control(bool enable)
 {
+	u64 msrval;
+
 	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
 		return -EOPNOTSUPP;
 
@@ -381,13 +383,22 @@ static int wrss_control(bool enable)
 		return 0;
 
 	fpregs_lock_and_load();
+	rdmsrl(MSR_IA32_U_CET, msrval);
+
 	if (enable) {
-		set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
 		features_set(ARCH_SHSTK_WRSS);
+		msrval |= CET_WRSS_EN;
 	} else {
-		set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
 		features_clr(ARCH_SHSTK_WRSS);
+		if (!(msrval & CET_WRSS_EN))
+			goto unlock;
+
+		msrval &= ~CET_WRSS_EN;
 	}
+
+	wrmsrl(MSR_IA32_U_CET, msrval);
+
+unlock:
 	fpregs_unlock();
 
 	return 0;

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10 16:11   ` Borislav Petkov
@ 2023-03-10 17:12     ` Edgecombe, Rick P
  2023-03-10 20:05       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10 17:12 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, 2023-03-10 at 17:11 +0100, Borislav Petkov wrote:

[...]

Thanks on all the text edits.

> On Mon, Feb 27, 2023 at 02:29:49PM -0800, Rick Edgecombe wrote:
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..f65c671ce3b1 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448	common	process_mrelease	sys_process_mreleas
> > e
> >  449	common	futex_waitv		sys_futex_waitv
> >  450	common	set_mempolicy_home_node	sys_set_mempolicy_h
> > ome_node
> > +451	64	map_shadow_stack	sys_map_shadow_stack
> 
> Yeah, this'll need a manpage too, I presume. But later.

I have one to submit.

[...]

> > +
> > +	if (addr && addr <= 0xFFFFFFFF)
> 
> 			< SZ_4G
> 
> > +		return -EINVAL;
> 
> Can we use distinct negative retvals in each case so that it is clear
> to
> userspace where it fails, *if* it fails?

Good idea, I think maybe ERANGE.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 34/41] x86/shstk: Support WRSS for userspace
  2023-03-10 16:44   ` Borislav Petkov
@ 2023-03-10 17:16     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10 17:16 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, 2023-03-10 at 17:44 +0100, Borislav Petkov wrote:

Thanks on text edits.

> >   
> > +/* Helper that can never get accidentally un-inlined. */
> > +#define set_clr_bits_msrl(msr, set, clear)   do {    \
> 
> Uff, pls kill this thing.
> 
> Our MSR interfaces universe is already insane and arch/x86/lib/msr.c
> already has similar attempts to what you're doing here in addition to
> all the other gunk in msr.h.
> 
> I highly doubt this can't be done the usual way, lemme see...

Seems reasonable.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-10  2:03               ` H.J. Lu
@ 2023-03-10 20:00                 ` H.J. Lu
  2023-03-10 20:27                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: H.J. Lu @ 2023-03-10 20:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bp, joao, david, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, akpm, Lutomirski, Andy, jamorris, arnd,
	tglx, Schimpe, Christina, mike.kravetz, debug, Yang, Weijiang,
	x86, andrew.cooper3, john.allen, linux-doc, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Mar 9, 2023 at 6:03 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Mar 9, 2023 at 5:13 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > +Joao regarding mixed mode designs
> >
> > On Fri, 2023-03-10 at 00:51 +0100, Borislav Petkov wrote:
> > > On Thu, Mar 09, 2023 at 04:56:37PM +0000, Edgecombe, Rick P wrote:
> > > > There is a proc that shows if shadow stack is enabled in a thread.
> > > > It
> > > > does indeed come later in the series.
> > >
> > > Not good enough:
> > >
> > > 1. buried somewhere in proc where no one knows about it
> > >
> > > 2. it is per thread so user needs to grep *all*
> >
> > See "x86: Expose thread features in /proc/$PID/status" for the patch.
> > We could emit something in dmesg I guess? The logic would be:
> >  - Record the presence of elf SHSTK bit on exec
> >  - On shadow stack disable, if it had the elf bit, pr_info("bad!")
> >
> > >
> > > >   ... We previously tried to add some batch operations to improve
> > > > the
> > > >   performance, but tglx had suggested to start with something
> > > > simple.
> > > >   So we end up with this simple composable API.
> > >
> > > I agree with starting simple and thanks for explaining this in
> > > detail.
> > >
> > > TBH, though, it already sounds like a mess to me. I guess a mess
> > > we'll
> > > have to deal with because there will always be this case of some
> > > shared object/lib not being enabled for shstk because of raisins.
> >
> > The compatibility problems are totally the mess in this whole thing.
> > When you try to look at a "permissive" mode that actually works it gets
> > even more complex. Joao and I have been banging our heads on that
> > problem for months.
> >
> > But there are some expected users of this that say: we compile and
> > check our known set of binaries, we won't get any surprises. So it's
> > more of a distro problem.
> >
> > >
> > > And TBH #2, I would've done it even simpler: if some shared object
> > > can't
> > > do shadow stack, we disable it for the whole process. I mean, what's
> > > the
> > > point?
> >
> > You mean a late loaded dlopen()ed DSO? The enabling logic can't know
> > this will happen ahead of time.
> >
> > If you mean if the shared objects in the elf all support shadow stack,
> > then this is what happens. The complication is that the loader wants to
> > enable shadow stack before it has checked the elf libs so it doesn't
> > underflow the shadow stack when it returns from the function that does
> > this checking.
> >
> > So it does:
> > 1. Enable shadow stack
> > 2. Call elf libs checking functions
> > 3. If all good, lock shadow stack. Else, disable shadow stack.
> > 4. Return from elf checking functions and if shstk is enabled, don't
> > underflow because it was enabled in step 1 and we have return addresses
> > from 2 on the shadow stack
> >
> > I'm wondering if this can't be improved in glibc to look like:
> > 1. Check elf libs, and record it somewhere
> > 2. Wait until just the right spot
> > 3. If all good, enable and lock shadow stack.
>
> I will try it out.
>

Currently glibc enables shadow stack as early as possible.  There
are only a few places where a function call in glibc never returns.
We can enable shadow stack just before calling main.   There are
quite some code paths without shadow stack protection.   Is this
an issue?

H.J.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10 17:12     ` Edgecombe, Rick P
@ 2023-03-10 20:05       ` Borislav Petkov
  2023-03-10 20:19         ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-10 20:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, Mar 10, 2023 at 05:12:40PM +0000, Edgecombe, Rick P wrote:
> > Can we use distinct negative retvals in each case so that it is clear
> > to
> > userspace where it fails, *if* it fails?
> 
> Good idea, I think maybe ERANGE.

For those two, right?

        /* If there isn't space for a token */
        if (set_tok && size < 8)
                return -EINVAL;

        if (addr && addr <= 0xFFFFFFFF)
                return -EINVAL;

They are kinda range-checking of sorts. A wider range but still
similar... 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10 20:05       ` Borislav Petkov
@ 2023-03-10 20:19         ` Edgecombe, Rick P
  2023-03-10 20:26           ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10 20:19 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	linux-doc, x86, andrew.cooper3, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2023-03-10 at 21:05 +0100, Borislav Petkov wrote:
> On Fri, Mar 10, 2023 at 05:12:40PM +0000, Edgecombe, Rick P wrote:
> > > Can we use distinct negative retvals in each case so that it is
> > > clear
> > > to
> > > userspace where it fails, *if* it fails?
> > 
> > Good idea, I think maybe ERANGE.
> 
> For those two, right?
> 
>         /* If there isn't space for a token */
>         if (set_tok && size < 8)
>                 return -EINVAL;
> 
>         if (addr && addr <= 0xFFFFFFFF)
>                 return -EINVAL;
> 
> They are kinda range-checking of sorts. A wider range but still
> similar... 

I was thinking ERANGE would be for the 4GB limit. This is the weird 32
bit limiting thing. So if someone hit it they could look up in the docs
what is going on. The size-big-enough-for-a-token check could be
ENOSPC? Then each reason could have a different error code.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10 20:19         ` Edgecombe, Rick P
@ 2023-03-10 20:26           ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-10 20:26 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	linux-doc, x86, andrew.cooper3, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, Mar 10, 2023 at 08:19:46PM +0000, Edgecombe, Rick P wrote:
> I was thinking ERANGE would be for the 4GB limit. This is the weird 32
> bit limiting thing. So if someone hit it they could look up in the docs
> what is going on. The size-big-enough-for-a-token check could be
> ENOSPC? Then each reason could have a different error code.

Yah, makes perfect sense to me.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-10 20:00                 ` H.J. Lu
@ 2023-03-10 20:27                   ` Edgecombe, Rick P
  2023-03-10 20:43                     ` H.J. Lu
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10 20:27 UTC (permalink / raw)
  To: hjl.tools
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, dave.hansen, kirill.shutemov,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, andrew.cooper3, akpm, Lutomirski, Andy, jamorris, joao,
	arnd, Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang,
	debug, pavel, john.allen, linux-doc, tglx, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2023-03-10 at 12:00 -0800, H.J. Lu wrote:
> > > So it does:
> > > 1. Enable shadow stack
> > > 2. Call elf libs checking functions
> > > 3. If all good, lock shadow stack. Else, disable shadow stack.
> > > 4. Return from elf checking functions and if shstk is enabled,
> > > don't
> > > underflow because it was enabled in step 1 and we have return
> > > addresses
> > > from 2 on the shadow stack
> > > 
> > > I'm wondering if this can't be improved in glibc to look like:
> > > 1. Check elf libs, and record it somewhere
> > > 2. Wait until just the right spot
> > > 3. If all good, enable and lock shadow stack.
> > 
> > I will try it out.
> > 
> 
> Currently glibc enables shadow stack as early as possible.  There
> are only a few places where a function call in glibc never returns.
> We can enable shadow stack just before calling main.   There are
> quite some code paths without shadow stack protection.   Is this
> an issue?

Thanks for checking. Hmm, does the loader get attacked?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-10 20:27                   ` Edgecombe, Rick P
@ 2023-03-10 20:43                     ` H.J. Lu
  2023-03-10 21:01                       ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: H.J. Lu @ 2023-03-10 20:43 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, dave.hansen, kirill.shutemov,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, andrew.cooper3, akpm, Lutomirski, Andy, jamorris, joao,
	arnd, Schimpe, Christina, mike.kravetz, x86, Yang, Weijiang,
	debug, pavel, john.allen, linux-doc, tglx, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, Mar 10, 2023 at 12:27 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2023-03-10 at 12:00 -0800, H.J. Lu wrote:
> > > > So it does:
> > > > 1. Enable shadow stack
> > > > 2. Call elf libs checking functions
> > > > 3. If all good, lock shadow stack. Else, disable shadow stack.
> > > > 4. Return from elf checking functions and if shstk is enabled,
> > > > don't
> > > > underflow because it was enabled in step 1 and we have return
> > > > addresses
> > > > from 2 on the shadow stack
> > > >
> > > > I'm wondering if this can't be improved in glibc to look like:
> > > > 1. Check elf libs, and record it somewhere
> > > > 2. Wait until just the right spot
> > > > 3. If all good, enable and lock shadow stack.
> > >
> > > I will try it out.
> > >
> >
> > Currently glibc enables shadow stack as early as possible.  There
> > are only a few places where a function call in glibc never returns.
> > We can enable shadow stack just before calling main.   There are
> > quite some code paths without shadow stack protection.   Is this
> > an issue?
>
> Thanks for checking. Hmm, does the loader get attacked?

Not I know of.  But there are user codes from .init_array
and .preinit_array which are executed before main.   In theory,
an attack can happen before main.

-- 
H.J.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10  0:14           ` Edgecombe, Rick P
@ 2023-03-10 21:00             ` Deepak Gupta
  2023-03-10 21:43               ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Deepak Gupta @ 2023-03-10 21:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, al.grant, tglx, mike.kravetz, x86, Schimpe,
	Christina, jamorris, john.allen, linux-doc, nd, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm

On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote:
>On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote:
>> On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P wrote:
>> > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote:
>> > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
>> > > > The 02/27/2023 14:29, Rick Edgecombe wrote:
>> > > > > Previously, a new PROT_SHADOW_STACK was attempted,
>> > > >
>> > > > ...
>> > > > > So rather than repurpose two existing syscalls (mmap,
>> > > > > madvise)
>> > > > > that don't
>> > > > > quite fit, just implement a new map_shadow_stack syscall to
>> > > > > allow
>> > > > > userspace to map and setup new shadow stacks in one step.
>> > > > > While
>> > > > > ucontext
>> > > > > is the primary motivator, userspace may have other unforeseen
>> > > > > reasons to
>> > > > > setup it's own shadow stacks using the WRSS instruction.
>> > > > > Towards
>> > > > > this
>> > > > > provide a flag so that stacks can be optionally setup
>> > > > > securely
>> > > > > for the
>> > > > > common case of ucontext without enabling WRSS. Or potentially
>> > > > > have the
>> > > > > kernel set up the shadow stack in some new way.
>> > > >
>> > > > ...
>> > > > > The following example demonstrates how to create a new shadow
>> > > > > stack with
>> > > > > map_shadow_stack:
>> > > > > void *shstk = map_shadow_stack(addr, stack_size,
>> > > > > SHADOW_STACK_SET_TOKEN);
>> > > >
>> > > > i think
>> > > >
>> > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
>> > > >
>> > > > could do the same with less disruption to users (new syscalls
>> > > > are harder to deal with than new flags). it would do the
>> > > > guard page and initial token setup too (there is no flag for
>> > > > it but could be squeezed in).
>> > >
>> > > Discussion on this topic in v6
>> > >
>> >
>> >
>https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
>> > >
>> > > Again I know earlier CET patches had protection flag and somehow
>> > > due
>> > > to pushback
>> > > on mailing list,
>> > >  it was adopted to go for special syscall because no one else
>> > > had shadow stack.
>> > >
>> > > Seeing a response from Szabolcs, I am assuming arm4 would also
>> > > want
>> > > to follow
>> > > using mmap to manufacture shadow stack. For reference RFC patches
>> > > for
>> > > risc-v shadow stack,
>> > > use a new protection flag = PROT_SHADOWSTACK.
>> > >
>> >
>> >
>https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
>> > >
>> > > I know earlier discussion had been that we let this go and do a
>> > > re-
>> > > factor later as other
>> > > arch support trickle in. But as I thought more on this and I
>> > > think it
>> > > may just be
>> > > messy from user mode point of view as well to have cognition of
>> > > two
>> > > different ways of
>> > > creating shadow stack. One would be special syscall (in current
>> > > libc)
>> > > and another `mmap`
>> > > (whenever future re-factor happens)
>> > >
>> > > If it's not too late, it would be more wise to take `mmap`
>> > > approach rather than special `syscall` approach.
>> >
>> > There is sort of two things intermixed here when we talk about a
>> > PROT_SHADOW_STACK.
>> >
>> > One is: what is the interface for specifying how the shadow stack
>> > should be provisioned with data? Right now there are two ways
>> > supported, all zero or with an X86 shadow stack restore token at
>> > the
>> > end. Then there was already some conversation about a third type.
>> > In
>> > which case the question would be is using mmap MAP_ flags the right
>> > place for this? How many types of initialization will be needed in
>> > the
>> > end and what is the overlap between the architectures?
>>
>> First of all, arches can choose to have token at the bottom or not.
>>
>> Token serve following purposes
>>   - It allows one to put desired value in shadow stack pointer in
>> safe/secure manner.
>>     Note: x86 doesn't provide any opcode encoding to value in SSP
>> register. So having
>>     a token is kind of a necessity because x86 doesn't easily allow
>> writing shadow stack.
>>
>>   - A token at the bottom acts marker / barrier and can be useful in
>> debugging
>>
>>   - If (and a big *if*) we ever reach a point in future where return
>> address is only pushed
>>     on shadow stack (x86 should have motivation to do this because
>> less uops on call/ret),
>>     a token at the bottom (bottom means lower address) is ensuring
>> sure shot way of getting
>>     a fault when exhausted.
>>
>> Current RISCV zisslpcfi proposal doesn't define CPU based tokens
>> because it's RISC.
>> It allows mechanisms using which software can define formatting of
>> token for itself.
>> Not sure of what ARM is doing.
>
>Ok, so riscv doesn't need to have the kernel write the token, but x86
>does.
>
>>
>> Now coming to the point of all zero v/s shadow stack token.
>> Why not always have token at the bottom?
>
>With WRSS you can setup the shadow stack however you want. So the user
>would then have to take care to erase the token if they didn't want it.
>Not the end of the world, but kind of clunky if there is no reason for
>it.

Yes but kernel always assumes the user is going to use the token. It' upto the user
to decide whether they want to use the restore token or not. If they've WRSS capability
security posture is anyways diluted. An attacker who would be clever enough to
re-use `RSTORSSP` present in address space to restore using kernel prepared token, should
anyways can be clever enough to use WRSS as well.

It kind of makes shadow stack creation simpler for kernel to always place the token.
This point is irrespective of whether to use system call or mmap.

>
>>
>> In case of x86, Why need for two ways and why not always have a token
>> at the bottom.
>> The way x86 is going, user mode is responsible for establishing
>> shadow stack and thus
>> whenever shadow stack is created then if x86 kernel implementation
>> always place a token
>> at the base/bottom.
>
>There was also some discussion recently of adding a token AND an end of
>stack marker, as a potential solution for backtracing in ucontext
>stacks. In this case it could cause an ABI break to just start adding
>the end of stack marker where the token was, and so would require a new
>map_shadow_stack flag.

Was this discussed why restore token itself can't be used as marker for
end of stack (if we assume there is always going to be one at the bottom).
It's a unique value. An address pointing to itself.

>
>>
>> Now user mode can do following:--
>>   - If it has access to WRSS, it can sure go ahead and create a token
>> of its choosing and
>>     overwrite kernel created token. and then do RSTORSSP on it's own
>> created token.
>>
>>   - If it doesn't have access to WRSS (and dont need to create its
>> own token), it can do
>>     RSTORSSP on this. As soon as it does, no other thread in process
>> can restore to it.
>>     On `fork`, you get the same un-restorable token.
>>
>> So why not always have a token at the bottom.
>> This is my plan for riscv implementation as well (to have a token at
>> the bottom)
>>
>> >
>> > The other thing is: should shadow stack memory creation be tightly
>> > controlled? For example in x86 we limit this to anonymous memory,
>> > etc.
>> > Some reasons for this are x86 specific, but some are not. So if we
>> > disallow most of the options why allow the interface to take them?
>> > And
>> > then you are in the position of carefully maintaining a list of
>> > not-
>> > allowed options instead letting a list of allowed options sit
>> > there.
>>
>> I am new to linux kernel and thus may be not able to follow the
>> argument of
>> limiting to anonymous memory.
>>
>> Why is limiting it to anonymous memory a problem. IIRC, ARM's
>> PROT_MTE is applicable
>> only to anonymous memory. I can probably find few more examples.
>
>Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that only
>gets set if all the rules are followed. Then PROT_MTE can only be set
>on that to set VM_MTE. That is kind of nice because certain other
>special situations can choose to support it.

That's because MTE is different. It allows to assign tags to existing
virtual memory. So one need to know whether a memory can have tags assigned.

>
>It does take another arch vma flag though. For x86 I guess I would need
>to figure out how to squeeze VM_SHADOW_STACK into other flags to have a
>free flag to use the same method. It also only supports mprotect() and
>shadow stack would only want to support mmap(). And you still have the
>initialization stuff to plumb through. Yea, I think the PROT_MTE is a
>good thing to consider, but it's not super obvious to me how similar
>the logic would be for shadow stack.

I dont think you need another VMA flag. Memory tagging allows adding tags
to existing virtual memory. That's why having `mprotect` makes sense for MTE.
In shadow stack case, there is no requirement of changing a shadow stack
to regular memory or vice-versa. 

All that's needed to change is `mmap`. `mprotect` should fail. Syscall
approach gives that benefit by default because there is no protection flag
for shadow stack.

I was giving example that any feature which gives new meaning to virtual memory
has been able to work with existing memory mapping APIs without the need of new
system call (including whether you're dealing with anonymous memory).

>
>The question I'm asking though is, not "can mmap code and rules be
>changed to enforce the required limitations?". I think it is yes. But
>the question is "why is that plumbing better than a new syscall?". I
>guess to get a better idea, the mmap solution would need to get POCed.
>I had half done this at one point, but abandoned the approach.
>
>For your question about why limit it, the special x86 case is the
>Dirty=1,Write=0 PTE bit combination for shadow stacks. So for shadow
>stack you could have some confusion about whether a PTE is actually
>dirty for writeback, etc. I wouldn't say it's known to be impossible to
>do MAP_SHARED, but it has not been fully analyzed enough to know what
>the changes would be. There were some solvable concrete issues that
>tipped the scale as well. It was also not expected to be a common
>usage, if at all.

I am not sure how confusion of D=1,W=0 is not completely taken away by
syscall approach. It'll always be there. One can only do things to minimize
the chances.

In case of syscall approach, syscall makes sure that 

`flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G`

This can be easily checked in arch specific landing function for mmap.


Additionally, If you always have the token at base, you don't need that ABI
between user and kernel.


>
>The non-x86, general reasons for it, are for a smaller benefit. It
>blocks a lot of ways shadow stack memory could be written to. Like say
>you have a memory mapped writable file, and you also map it shadow
>stack. So it has better security properties depending on what your
>threat model is.

I wouldn't say any architecture should allow such primitives. It kind of defeats
the purpose for shadow stack. Yes if some sort of secure memory is needed, there may
be new ISA extensions for that.

>
>>
>> Eventually syscall will also go ahead and use memory management code
>> to
>> perform mapping. So I didn't understand the reasoning here. The way
>> syscall
>> can limit it to anonymous memory, why mmap can't do the same if it
>> sees
>> PROT_SHADOWSTACK.
>>
>> >
>> > The only benefit I've heard is that it saves creating a new
>> > syscall,
>> > but it also saves several MAP_ flags. That, and that the RFC for
>> > riscv
>> > did a PROT_SHADOW_STACK to start. So, yes, two people asked the
>> > same
>> > question, but I'm still not seeing any benefits. Can you give the
>> > pros
>> > and cons please?
>>
>> Again the way syscall will limit it to anonymous memory, Why mmap
>> can't do same?
>> There is precedence for it (like PROT_MTE is applicable only to
>> anonymous memory)
>>
>> So if it can be done, then why introduce a new syscall?
>>
>> >
>> > BTW, in glibc map_shadow_stack is called from arch code. So I think
>> > userspace wise, for this to affect other architectures there would
>> > need
>> > to be some code that could do things generically, with somehow the
>> > shadow stack pivot abstracted but the shadow stack allocation not.
>>
>> Agreed, yes it can be done in a way where it won't put tax on other
>> architectures.
>>
>> But what about fragmentation within x86. Will x86 always choose to
>> use system call
>> method map shadow stack. If future re-factor results in x86 also use
>> `mmap` method.
>> Isn't it a mess for x86 glibc to figure out what to do; whether to
>> use system call
>> or `mmap`?
>>
>
>Ok, so this is the downside I guess. What happens if we want to support
>the other types of memory in the future and end up using mmap for this?
>Then we have 15-20 lines of extra syscall wrapping code to maintain to
>support legacy.
>
>For the mmap solution, we have the downside of using extra MAP_ flags,
>and *some* amount of currently unknown vm_flag and address range logic,
>plus mmap arch breakouts to add to core MM. Like I said earlier, you
>would need to POC it out to see how bad that looks and get some core MM
>feedback on the new type of MAP flag usage. But, syscalls being pretty
>straightforward, it would probably be *some* amount of added complexity
>_now_ to support something that might happen in the future. I'm not
>seeing either one as a landslide win.
>
>It's kind of an eternal software design philosophical question, isn't
>it? How much work should you do to prepare for things that might be
>needed in the future? From what I've seen the balance in the kernel
>seems to be to try not to paint yourself in to an ABI corner, but
>otherwise let the kernel evolve naturally in response to real usages.
>If anyone wants to correct this, please do. But otherwise I think the
>new syscall is aligned with that.
>
>TBH, you are making me wonder if I'm missing something. It seems you
>strongly don't prefer this approach, but I'm not hearing any huge
>potential negative impacts. And you also say it won't tax the riscv
>implementation. Is this just something just smells bad here? Or it
>would shrink the riscv series?

No you're not missing anything. It's just wierdness of adding a system call
which enforces certain MAP_XX flags and pretty much mapping API.
And difference between architectures on how they will create shadow stack. + 
if x86 chooses to use `mmap` in future, then there is ugliness in user mode to
decide which method to choose.

And yes you got it right, to some extent there is my own selfishness playing out
as well here to reduce riscv patches.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 28/41] x86: Introduce userspace API for shadow stack
  2023-03-10 20:43                     ` H.J. Lu
@ 2023-03-10 21:01                       ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10 21:01 UTC (permalink / raw)
  To: hjl.tools
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, andrew.cooper3, oleg, akpm, Lutomirski, Andy, bp, joao,
	jamorris, Schimpe, Christina, debug, arnd, Yang, Weijiang, x86,
	tglx, mike.kravetz, john.allen, rppt, linux-doc, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2023-03-10 at 12:43 -0800, H.J. Lu wrote:
> On Fri, Mar 10, 2023 at 12:27 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Fri, 2023-03-10 at 12:00 -0800, H.J. Lu wrote:
> > > > > So it does:
> > > > > 1. Enable shadow stack
> > > > > 2. Call elf libs checking functions
> > > > > 3. If all good, lock shadow stack. Else, disable shadow
> > > > > stack.
> > > > > 4. Return from elf checking functions and if shstk is
> > > > > enabled,
> > > > > don't
> > > > > underflow because it was enabled in step 1 and we have return
> > > > > addresses
> > > > > from 2 on the shadow stack
> > > > > 
> > > > > I'm wondering if this can't be improved in glibc to look
> > > > > like:
> > > > > 1. Check elf libs, and record it somewhere
> > > > > 2. Wait until just the right spot
> > > > > 3. If all good, enable and lock shadow stack.
> > > > 
> > > > I will try it out.
> > > > 
> > > 
> > > Currently glibc enables shadow stack as early as possible.  There
> > > are only a few places where a function call in glibc never
> > > returns.
> > > We can enable shadow stack just before calling main.   There are
> > > quite some code paths without shadow stack protection.   Is this
> > > an issue?
> > 
> > Thanks for checking. Hmm, does the loader get attacked?
> 
> Not I know of.  But there are user codes from .init_array
> and .preinit_array which are executed before main.   In theory,
> an attack can happen before main.

Hmm, it would be nice to not add any startup overhead to non-shadow
stack binaries. I guess it's a tradeoff. Might be worth asking around.

But you can't just enable shadow stack before any user code? It has to
go something like?
1. Execute init codes
2. Check elf libs
3. Enable SHSTK

Or what if you just did the enable-disable dance if the execing binary
itself has shadow stack. If it doesn't have shadow stack, the elf libs
won't change the decision.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10 21:00             ` Deepak Gupta
@ 2023-03-10 21:43               ` Edgecombe, Rick P
  2023-03-16 20:07                 ` Deepak Gupta
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-10 21:43 UTC (permalink / raw)
  To: debug
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, pavel, x86, mike.kravetz, Schimpe,
	Christina, al.grant, nd, john.allen, linux-doc, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm

On Fri, 2023-03-10 at 13:00 -0800, Deepak Gupta wrote:
> On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote:
> > > On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P
> > > wrote:
> > > > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote:
> > > > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy
> > > > > wrote:
> > > > > > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > > > > > Previously, a new PROT_SHADOW_STACK was attempted,
> > > > > > 
> > > > > > ...
> > > > > > > So rather than repurpose two existing syscalls (mmap,
> > > > > > > madvise)
> > > > > > > that don't
> > > > > > > quite fit, just implement a new map_shadow_stack syscall
> > > > > > > to
> > > > > > > allow
> > > > > > > userspace to map and setup new shadow stacks in one step.
> > > > > > > While
> > > > > > > ucontext
> > > > > > > is the primary motivator, userspace may have other
> > > > > > > unforeseen
> > > > > > > reasons to
> > > > > > > setup it's own shadow stacks using the WRSS instruction.
> > > > > > > Towards
> > > > > > > this
> > > > > > > provide a flag so that stacks can be optionally setup
> > > > > > > securely
> > > > > > > for the
> > > > > > > common case of ucontext without enabling WRSS. Or
> > > > > > > potentially
> > > > > > > have the
> > > > > > > kernel set up the shadow stack in some new way.
> > > > > > 
> > > > > > ...
> > > > > > > The following example demonstrates how to create a new
> > > > > > > shadow
> > > > > > > stack with
> > > > > > > map_shadow_stack:
> > > > > > > void *shstk = map_shadow_stack(addr, stack_size,
> > > > > > > SHADOW_STACK_SET_TOKEN);
> > > > > > 
> > > > > > i think
> > > > > > 
> > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1,
> > > > > > 0);
> > > > > > 
> > > > > > could do the same with less disruption to users (new
> > > > > > syscalls
> > > > > > are harder to deal with than new flags). it would do the
> > > > > > guard page and initial token setup too (there is no flag
> > > > > > for
> > > > > > it but could be squeezed in).
> > > > > 
> > > > > Discussion on this topic in v6
> > > > > 
> > > > 
> > > > 
> > 
> > 
https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
> > > > > 
> > > > > Again I know earlier CET patches had protection flag and
> > > > > somehow
> > > > > due
> > > > > to pushback
> > > > > on mailing list,
> > > > >  it was adopted to go for special syscall because no one else
> > > > > had shadow stack.
> > > > > 
> > > > > Seeing a response from Szabolcs, I am assuming arm4 would
> > > > > also
> > > > > want
> > > > > to follow
> > > > > using mmap to manufacture shadow stack. For reference RFC
> > > > > patches
> > > > > for
> > > > > risc-v shadow stack,
> > > > > use a new protection flag = PROT_SHADOWSTACK.
> > > > > 
> > > > 
> > > > 
> > 
> > 
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> > > > > 
> > > > > I know earlier discussion had been that we let this go and do
> > > > > a
> > > > > re-
> > > > > factor later as other
> > > > > arch support trickle in. But as I thought more on this and I
> > > > > think it
> > > > > may just be
> > > > > messy from user mode point of view as well to have cognition
> > > > > of
> > > > > two
> > > > > different ways of
> > > > > creating shadow stack. One would be special syscall (in
> > > > > current
> > > > > libc)
> > > > > and another `mmap`
> > > > > (whenever future re-factor happens)
> > > > > 
> > > > > If it's not too late, it would be more wise to take `mmap`
> > > > > approach rather than special `syscall` approach.
> > > > 
> > > > There is sort of two things intermixed here when we talk about
> > > > a
> > > > PROT_SHADOW_STACK.
> > > > 
> > > > One is: what is the interface for specifying how the shadow
> > > > stack
> > > > should be provisioned with data? Right now there are two ways
> > > > supported, all zero or with an X86 shadow stack restore token
> > > > at
> > > > the
> > > > end. Then there was already some conversation about a third
> > > > type.
> > > > In
> > > > which case the question would be is using mmap MAP_ flags the
> > > > right
> > > > place for this? How many types of initialization will be needed
> > > > in
> > > > the
> > > > end and what is the overlap between the architectures?
> > > 
> > > First of all, arches can choose to have token at the bottom or
> > > not.
> > > 
> > > Token serve following purposes
> > >   - It allows one to put desired value in shadow stack pointer in
> > > safe/secure manner.
> > >     Note: x86 doesn't provide any opcode encoding to value in SSP
> > > register. So having
> > >     a token is kind of a necessity because x86 doesn't easily
> > > allow
> > > writing shadow stack.
> > > 
> > >   - A token at the bottom acts marker / barrier and can be useful
> > > in
> > > debugging
> > > 
> > >   - If (and a big *if*) we ever reach a point in future where
> > > return
> > > address is only pushed
> > >     on shadow stack (x86 should have motivation to do this
> > > because
> > > less uops on call/ret),
> > >     a token at the bottom (bottom means lower address) is
> > > ensuring
> > > sure shot way of getting
> > >     a fault when exhausted.
> > > 
> > > Current RISCV zisslpcfi proposal doesn't define CPU based tokens
> > > because it's RISC.
> > > It allows mechanisms using which software can define formatting
> > > of
> > > token for itself.
> > > Not sure of what ARM is doing.
> > 
> > Ok, so riscv doesn't need to have the kernel write the token, but
> > x86
> > does.
> > 
> > > 
> > > Now coming to the point of all zero v/s shadow stack token.
> > > Why not always have token at the bottom?
> > 
> > With WRSS you can setup the shadow stack however you want. So the
> > user
> > would then have to take care to erase the token if they didn't want
> > it.
> > Not the end of the world, but kind of clunky if there is no reason
> > for
> > it.
> 
> Yes but kernel always assumes the user is going to use the token. It'
> upto the user
> to decide whether they want to use the restore token or not. If
> they've WRSS capability
> security posture is anyways diluted. An attacker who would be clever
> enough to
> re-use `RSTORSSP` present in address space to restore using kernel
> prepared token, should
> anyways can be clever enough to use WRSS as well.
> 
> It kind of makes shadow stack creation simpler for kernel to always
> place the token.
> This point is irrespective of whether to use system call or mmap.

Think about like CRIU restoring the shadow stack, or other special
cases like that. Userspace can always overwrite the token, but this
involves some amount of extra work (extra writes, earlier faulting in
the page, etc). It is clunky and very negligibly worse.

> 
> > 
> > > 
> > > In case of x86, Why need for two ways and why not always have a
> > > token
> > > at the bottom.
> > > The way x86 is going, user mode is responsible for establishing
> > > shadow stack and thus
> > > whenever shadow stack is created then if x86 kernel
> > > implementation
> > > always place a token
> > > at the base/bottom.
> > 
> > There was also some discussion recently of adding a token AND an
> > end of
> > stack marker, as a potential solution for backtracing in ucontext
> > stacks. In this case it could cause an ABI break to just start
> > adding
> > the end of stack marker where the token was, and so would require a
> > new
> > map_shadow_stack flag.
> 
> Was this discussed why restore token itself can't be used as marker
> for
> end of stack (if we assume there is always going to be one at the
> bottom).
> It's a unique value. An address pointing to itself.

I thought the same thing at first, but it gets clobbered during the
pivot and push.

> 
> > 
> > > 
> > > Now user mode can do following:--
> > >   - If it has access to WRSS, it can sure go ahead and create a
> > > token
> > > of its choosing and
> > >     overwrite kernel created token. and then do RSTORSSP on it's
> > > own
> > > created token.
> > > 
> > >   - If it doesn't have access to WRSS (and dont need to create
> > > its
> > > own token), it can do
> > >     RSTORSSP on this. As soon as it does, no other thread in
> > > process
> > > can restore to it.
> > >     On `fork`, you get the same un-restorable token.
> > > 
> > > So why not always have a token at the bottom.
> > > This is my plan for riscv implementation as well (to have a token
> > > at
> > > the bottom)
> > > 
> > > > 
> > > > The other thing is: should shadow stack memory creation be
> > > > tightly
> > > > controlled? For example in x86 we limit this to anonymous
> > > > memory,
> > > > etc.
> > > > Some reasons for this are x86 specific, but some are not. So if
> > > > we
> > > > disallow most of the options why allow the interface to take
> > > > them?
> > > > And
> > > > then you are in the position of carefully maintaining a list of
> > > > not-
> > > > allowed options instead letting a list of allowed options sit
> > > > there.
> > > 
> > > I am new to linux kernel and thus may be not able to follow the
> > > argument of
> > > limiting to anonymous memory.
> > > 
> > > Why is limiting it to anonymous memory a problem. IIRC, ARM's
> > > PROT_MTE is applicable
> > > only to anonymous memory. I can probably find few more examples.
> > 
> > Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that
> > only
> > gets set if all the rules are followed. Then PROT_MTE can only be
> > set
> > on that to set VM_MTE. That is kind of nice because certain other
> > special situations can choose to support it.
> 
> That's because MTE is different. It allows to assign tags to existing
> virtual memory. So one need to know whether a memory can have tags
> assigned.
> 
> > 
> > It does take another arch vma flag though. For x86 I guess I would
> > need
> > to figure out how to squeeze VM_SHADOW_STACK into other flags to
> > have a
> > free flag to use the same method. It also only supports mprotect()
> > and
> > shadow stack would only want to support mmap(). And you still have
> > the
> > initialization stuff to plumb through. Yea, I think the PROT_MTE is
> > a
> > good thing to consider, but it's not super obvious to me how
> > similar
> > the logic would be for shadow stack.
> 
> I dont think you need another VMA flag. Memory tagging allows adding
> tags
> to existing virtual memory.

...need another VMA flag to use the existing mmap arch breakouts in the
same way as VM_MTE. Of course changing mmap makes other solutions
possible.

>  That's why having `mprotect` makes sense for MTE.
> In shadow stack case, there is no requirement of changing a shadow
> stack
> to regular memory or vice-versa. 

uffd needs mprotect internals. You might take a look at it in regards
to your VM_WRITE/mprotect blocking approach for riscv. I was imagining,
even if mmap was the syscall, mprotect() would not be blocked in the
x86 case at least. The mprotect() blocking is a separate thing than the
syscall, right?

> 
> All that's needed to change is `mmap`. `mprotect` should fail.
> Syscall
> approach gives that benefit by default because there is no protection
> flag
> for shadow stack.
> 
> I was giving example that any feature which gives new meaning to
> virtual memory
> has been able to work with existing memory mapping APIs without the
> need of new
> system call (including whether you're dealing with anonymous memory).
> 
> > 
> > The question I'm asking though is, not "can mmap code and rules be
> > changed to enforce the required limitations?". I think it is yes.
> > But
> > the question is "why is that plumbing better than a new syscall?".
> > I
> > guess to get a better idea, the mmap solution would need to get
> > POCed.
> > I had half done this at one point, but abandoned the approach.
> > 
> > For your question about why limit it, the special x86 case is the
> > Dirty=1,Write=0 PTE bit combination for shadow stacks. So for
> > shadow
> > stack you could have some confusion about whether a PTE is actually
> > dirty for writeback, etc. I wouldn't say it's known to be
> > impossible to
> > do MAP_SHARED, but it has not been fully analyzed enough to know
> > what
> > the changes would be. There were some solvable concrete issues that
> > tipped the scale as well. It was also not expected to be a common
> > usage, if at all.
> 
> I am not sure how confusion of D=1,W=0 is not completely taken away
> by
> syscall approach. It'll always be there. One can only do things to
> minimize
> the chances.
> 
> In case of syscall approach, syscall makes sure that 
> 
> `flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G`
> 
> This can be easily checked in arch specific landing function for
> mmap.

Right, this is why I listed two types of things in the mix here. The
memory features supported, and what the syscall is. You asked why limit
the memory features, so that is the explanation.

> 
> 
> Additionally, If you always have the token at base, you don't need
> that ABI
> between user and kernel.
> 
> 
> > 
> > The non-x86, general reasons for it, are for a smaller benefit. It
> > blocks a lot of ways shadow stack memory could be written to. Like
> > say
> > you have a memory mapped writable file, and you also map it shadow
> > stack. So it has better security properties depending on what your
> > threat model is.
> 
> I wouldn't say any architecture should allow such primitives. It kind
> of defeats
> the purpose for shadow stack. Yes if some sort of secure memory is
> needed, there may
> be new ISA extensions for that.

Yea, seems reasonable to prevent this regardless of the extra x86
reasons, if that is what you are saying. It depends on people's threat
models (as always in security).

> 
> > 
> > > 
> > > Eventually syscall will also go ahead and use memory management
> > > code
> > > to
> > > perform mapping. So I didn't understand the reasoning here. The
> > > way
> > > syscall
> > > can limit it to anonymous memory, why mmap can't do the same if
> > > it
> > > sees
> > > PROT_SHADOWSTACK.
> > > 
> > > > 
> > > > The only benefit I've heard is that it saves creating a new
> > > > syscall,
> > > > but it also saves several MAP_ flags. That, and that the RFC
> > > > for
> > > > riscv
> > > > did a PROT_SHADOW_STACK to start. So, yes, two people asked the
> > > > same
> > > > question, but I'm still not seeing any benefits. Can you give
> > > > the
> > > > pros
> > > > and cons please?
> > > 
> > > Again the way syscall will limit it to anonymous memory, Why mmap
> > > can't do same?
> > > There is precedence for it (like PROT_MTE is applicable only to
> > > anonymous memory)
> > > 
> > > So if it can be done, then why introduce a new syscall?
> > > 
> > > > 
> > > > BTW, in glibc map_shadow_stack is called from arch code. So I
> > > > think
> > > > userspace wise, for this to affect other architectures there
> > > > would
> > > > need
> > > > to be some code that could do things generically, with somehow
> > > > the
> > > > shadow stack pivot abstracted but the shadow stack allocation
> > > > not.
> > > 
> > > Agreed, yes it can be done in a way where it won't put tax on
> > > other
> > > architectures.
> > > 
> > > But what about fragmentation within x86. Will x86 always choose
> > > to
> > > use system call
> > > method map shadow stack. If future re-factor results in x86 also
> > > use
> > > `mmap` method.
> > > Isn't it a mess for x86 glibc to figure out what to do; whether
> > > to
> > > use system call
> > > or `mmap`?
> > > 
> > 
> > Ok, so this is the downside I guess. What happens if we want to
> > support
> > the other types of memory in the future and end up using mmap for
> > this?
> > Then we have 15-20 lines of extra syscall wrapping code to maintain
> > to
> > support legacy.
> > 
> > For the mmap solution, we have the downside of using extra MAP_
> > flags,
> > and *some* amount of currently unknown vm_flag and address range
> > logic,
> > plus mmap arch breakouts to add to core MM. Like I said earlier,
> > you
> > would need to POC it out to see how bad that looks and get some
> > core MM
> > feedback on the new type of MAP flag usage. But, syscalls being
> > pretty
> > straightforward, it would probably be *some* amount of added
> > complexity
> > _now_ to support something that might happen in the future. I'm not
> > seeing either one as a landslide win.
> > 
> > It's kind of an eternal software design philosophical question,
> > isn't
> > it? How much work should you do to prepare for things that might be
> > needed in the future? From what I've seen the balance in the kernel
> > seems to be to try not to paint yourself in to an ABI corner, but
> > otherwise let the kernel evolve naturally in response to real
> > usages.
> > If anyone wants to correct this, please do. But otherwise I think
> > the
> > new syscall is aligned with that.
> > 
> > TBH, you are making me wonder if I'm missing something. It seems
> > you
> > strongly don't prefer this approach, but I'm not hearing any huge
> > potential negative impacts. And you also say it won't tax the riscv
> > implementation. Is this just something just smells bad here? Or it
> > would shrink the riscv series?
> 
> No you're not missing anything. It's just wierdness of adding a
> system call
> which enforces certain MAP_XX flags and pretty much mapping API.
> And difference between architectures on how they will create shadow
> stack. + 
> if x86 chooses to use `mmap` in future, then there is ugliness in
> user mode to
> decide which method to choose.

Ok, I think I will leave it given it's entirely in arch/x86. It just
got some special error codes in the other thread today too.

> 
> And yes you got it right, to some extent there is my own selfishness
> playing out
> as well here to reduce riscv patches.
> 

Feel free to join the map_shadow_stack party. :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-02-27 22:29 ` [PATCH v7 38/41] x86/fpu: Add helper for initing features Rick Edgecombe
@ 2023-03-11 12:54   ` Borislav Petkov
  2023-03-13  2:45     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-11 12:54 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug

On Mon, Feb 27, 2023 at 02:29:54PM -0800, Rick Edgecombe wrote:
> Subject: Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features

"initializing"

> If an xfeature is saved in a buffer, the xfeature's bit will be set in
> xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
> is in it's init state. In this case the xfeature buffer address cannot

"its"

> be retrieved with get_xsave_addr().
> 
> Future patches will need to handle the case of writing to an xfeature
> that may not be saved. So provide helpers to init an xfeature in an
> xsave buffer.
> 
> This could of course be done directly by reaching into the xsave buffer,
> however this would not be robust against future changes to optimize the
> xsave buffer by compacting it. In that case the xsave buffer would need
> to be re-arranged as well. So the logic properly belongs encapsulated
> in a helper where the logic can be unified.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> v2:
>  - New patch
> ---
>  arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
>  arch/x86/kernel/fpu/xstate.h |  6 ++++
>  2 files changed, 53 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 13a80521dd51..3ff80be0a441 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -934,6 +934,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>  	return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
>  }
>  
> +static int xsave_buffer_access_checks(int xfeature_nr)

Function name needs a verb.

> +{
> +	/*
> +	 * Do we even *have* xsave state?
> +	 */

That comment is superfluous.

> +	if (!boot_cpu_has(X86_FEATURE_XSAVE))

check_for_deprecated_apis: WARNING: arch/x86/kernel/fpu/xstate.c:942: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.

> +		return 1;
> +
> +	/*
> +	 * We should not ever be requesting features that we

Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.

> +	 * have not enabled.
> +	 */
> +	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  /*
>   * Given the xsave area and a state inside, this function returns the
>   * address of the state.
> @@ -954,17 +972,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>   */
>  void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>  {
> -	/*
> -	 * Do we even *have* xsave state?
> -	 */
> -	if (!boot_cpu_has(X86_FEATURE_XSAVE))
> -		return NULL;
> -
> -	/*
> -	 * We should not ever be requesting features that we
> -	 * have not enabled.
> -	 */
> -	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> +	if (xsave_buffer_access_checks(xfeature_nr))
>  		return NULL;
>  
>  	/*
> @@ -984,6 +992,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>  	return __raw_xsave_addr(xsave, xfeature_nr);
>  }
>  
> +/*
> + * Given the xsave area and a state inside, this function
> + * initializes an xfeature in the buffer.

s/this function initializes/initialize/

> + *
> + * get_xsave_addr() will return NULL if the feature bit is
> + * not present in the header. This function will make it so
> + * the xfeature buffer address is ready to be retrieved by
> + * get_xsave_addr().

So users of get_xsave_addr() would have to know that they would need to
call init_xfeature()?

I think the better approach would be:

void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr, bool init)

and then that @init controls whether get_xsave_addr() should init the
buffer.

And then you don't have to have a bunch of small functions here and
there and know when to call what but get_xsave_addr() would simply DTRT.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack
  2023-02-27 22:29 ` [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack Rick Edgecombe
@ 2023-03-11 15:06   ` Borislav Petkov
  2023-03-13  2:53     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-11 15:06 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Yu-cheng Yu

On Mon, Feb 27, 2023 at 02:29:55PM -0800, Rick Edgecombe wrote:
> The only downside to not having a generic supervisor xfeature regset,
> is that apps need to be enlightened of any new supervisor xfeature
> exposed this way (i.e. they can't try to have generic save/restore
> logic). But maybe that is a good thing, because they have to think
> through each new xfeature instead of encountering issues when new a new

Remove the first "new".

> supervisor xfeature was added.
> 
> By adding a shadow stack regset, it also has the effect of including the
> shadow stack state in a core dump, which could be useful for debugging.
> 
> The shadow stack specific xstate includes the SSP, and the shadow stack
> and WRSS enablement status. Enabling shadow stack or wrss in the kernel
						       ^^^^

"WRSS"

> involves more than just flipping the bit. The kernel is made aware that
> it has to do extra things when cloning or handling signals. That logic
> is triggered off of separate feature enablement state kept in the task
> struct. So the flipping on HW shadow stack enforcement without notifying
> the kernel to change its behavior would severely limit what an application
> could do without crashing, and the results would depend on kernel
> internal implementation details. There is also no known use for controlling
> this state via prtace today. So only expose the SSP, which is something

Unknown word [prtace] in commit message.
Suggestions: ['ptrace'

> that userspace already has indirect control over.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

I think your SOB should come last:

...
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Pls check whole set.


> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +int ssp_active(struct task_struct *target, const struct user_regset *regset)
> +{
> +	if (target->thread.features & ARCH_SHSTK_SHSTK)
> +		return regset->n;
> +
> +	return 0;
> +}
> +
> +int ssp_get(struct task_struct *target, const struct user_regset *regset,
> +	    struct membuf to)
> +{
> +	struct fpu *fpu = &target->thread.fpu;
> +	struct cet_user_state *cetregs;
> +
> +	if (!boot_cpu_has(X86_FEATURE_USER_SHSTK))

check_for_deprecated_apis: WARNING: arch/x86/kernel/fpu/regset.c:193: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.

Check your whole set pls.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK
  2023-02-27 22:29 ` [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
@ 2023-03-11 15:11   ` Borislav Petkov
  2023-03-13  3:04     ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-11 15:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, Mike Rapoport

On Mon, Feb 27, 2023 at 02:29:56PM -0800, Rick Edgecombe wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Userspace loaders may lock features before a CRIU restore operation has
> the chance to set them to whatever state is required by the process
> being restored. Allow a way for CRIU to unlock features. Add it as an
> arch_prctl() like the other shadow stack operations, but restrict it being
> called by the ptrace arch_pctl() interface.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>

That tag is kinda implicit here. Unless he doesn't ACK his own patch.
:-P

> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> [Merged into recent API changes, added commit log and docs]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

...

> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 2faf9b45ac72..3197ff824809 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -451,9 +451,14 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
>  		return 0;
>  	}
>  
> -	/* Don't allow via ptrace */
> -	if (task != current)
> +	/* Only allow via ptrace */
> +	if (task != current) {

Is that the only case? task != current means ptrace and there's no other
way to do this from userspace?

Isn't there some flag which says that task is ptraced? I think we should
check that one too...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-03-11 12:54   ` Borislav Petkov
@ 2023-03-13  2:45     ` Edgecombe, Rick P
  2023-03-13 11:03       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-13  2:45 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Sat, 2023-03-11 at 13:54 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:54PM -0800, Rick Edgecombe wrote:
> > Subject: Re: [PATCH v7 38/41] x86/fpu: Add helper for initing
> > features
> 
> "initializing"

Sure.

> 
> > If an xfeature is saved in a buffer, the xfeature's bit will be set
> > in
> > xsave->header.xfeatures. The CPU may opt to not save the xfeature
> > if it
> > is in it's init state. In this case the xfeature buffer address
> > cannot
> 
> "its"

I clearly need to be better about it's and its.

> 
> > be retrieved with get_xsave_addr().
> > 
> > Future patches will need to handle the case of writing to an
> > xfeature
> > that may not be saved. So provide helpers to init an xfeature in an
> > xsave buffer.
> > 
> > This could of course be done directly by reaching into the xsave
> > buffer,
> > however this would not be robust against future changes to optimize
> > the
> > xsave buffer by compacting it. In that case the xsave buffer would
> > need
> > to be re-arranged as well. So the logic properly belongs
> > encapsulated
> > in a helper where the logic can be unified.
> > 
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Tested-by: Kees Cook <keescook@chromium.org>
> > Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > ---
> > v2:
> >  - New patch
> > ---
> >  arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++---
> > ----
> >  arch/x86/kernel/fpu/xstate.h |  6 ++++
> >  2 files changed, 53 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/fpu/xstate.c
> > b/arch/x86/kernel/fpu/xstate.c
> > index 13a80521dd51..3ff80be0a441 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -934,6 +934,24 @@ static void *__raw_xsave_addr(struct
> > xregs_state *xsave, int xfeature_nr)
> >  	return (void *)xsave + xfeature_get_offset(xcomp_bv,
> > xfeature_nr);
> >  }
> >  
> > +static int xsave_buffer_access_checks(int xfeature_nr)
> 
> Function name needs a verb.

Right.

> 
> > +{
> > +	/*
> > +	 * Do we even *have* xsave state?
> > +	 */
> 
> That comment is superfluous.
> 
> > +	if (!boot_cpu_has(X86_FEATURE_XSAVE))
> 
> check_for_deprecated_apis: WARNING: arch/x86/kernel/fpu/xstate.c:942:
> Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
> 
> > +		return 1;
> > +
> > +	/*
> > +	 * We should not ever be requesting features that we
> 
> Please use passive voice in your commit message: no "we" or "I", etc,
> and describe your changes in imperative mood.

These two are from the existing code. Basically they get extracted into
a new function.

> 
> > +	 * have not enabled.
> > +	 */
> > +	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Given the xsave area and a state inside, this function returns
> > the
> >   * address of the state.
> > @@ -954,17 +972,7 @@ static void *__raw_xsave_addr(struct
> > xregs_state *xsave, int xfeature_nr)
> >   */
> >  void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
> >  {
> > -	/*
> > -	 * Do we even *have* xsave state?
> > -	 */
> > -	if (!boot_cpu_has(X86_FEATURE_XSAVE))
> > -		return NULL;
> > -
> > -	/*
> > -	 * We should not ever be requesting features that we
> > -	 * have not enabled.
> > -	 */
> > -	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> > +	if (xsave_buffer_access_checks(xfeature_nr))
> >  		return NULL;
> >  
> >  	/*
> > @@ -984,6 +992,34 @@ void *get_xsave_addr(struct xregs_state
> > *xsave, int xfeature_nr)
> >  	return __raw_xsave_addr(xsave, xfeature_nr);
> >  }
> >  
> > +/*
> > + * Given the xsave area and a state inside, this function
> > + * initializes an xfeature in the buffer.
> 
> s/this function initializes/initialize/

Sure.

> 
> > + *
> > + * get_xsave_addr() will return NULL if the feature bit is
> > + * not present in the header. This function will make it so
> > + * the xfeature buffer address is ready to be retrieved by
> > + * get_xsave_addr().
> 
> So users of get_xsave_addr() would have to know that they would need
> to
> call init_xfeature()?

That is the situation today. FWIW both of these functions are limited
to the FPU internals, so I would think it's not a too unreasonable
assumption.

> 
> I think the better approach would be:
> 
> void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr, bool
> init)
> 
> and then that @init controls whether get_xsave_addr() should init the
> buffer.
> 
> And then you don't have to have a bunch of small functions here and
> there and know when to call what but get_xsave_addr() would simply
> DTRT.

It would have to actually copy the init state to the buffer
from init_fpstate, because otherwise the caller couldn't know 
if get_xsave_addr() was returning valid data or some old data in the
buffer. And I guess the `init` mean to initialize it only if it is in
the init state, not to overwrite the current state with the init state.

I did it up, and it makes the caller code cleaner. But I'm not sure
what to think of it. Is this not mixing two operations together? Today
get_xsave_addr() pretty much just gets a buffer offset with some
checks. Now it would compute the offset and also silently go off and
changes the buffer.

I looked at this fpu code originally and thought I could add some
useful abstractions, but this failed. I came away wondering if this was
just an area with so many special cases and details, that abstractions
just added confusion. I'm just bringing this up because the other
option is to just do this in the regset code:
xsave->header.xfeatures |= BIT_ULL(XFEATURE_CET_USER);

Let me know if you think it would be better to just open code it.

> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack
  2023-03-11 15:06   ` Borislav Petkov
@ 2023-03-13  2:53     ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-13  2:53 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Sat, 2023-03-11 at 16:06 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:55PM -0800, Rick Edgecombe wrote:
[...]
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Tested-by: Kees Cook <keescook@chromium.org>
> > Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> I think your SOB should come last:

Right on commit log typos, and yea this is screwed up. I think Dave re-
ordered the SOB's already.

> 
> ...
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> Pls check whole set.
> 
> 
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +int ssp_active(struct task_struct *target, const struct
> > user_regset *regset)
> > +{
> > +     if (target->thread.features & ARCH_SHSTK_SHSTK)
> > +             return regset->n;
> > +
> > +     return 0;
> > +}
> > +
> > +int ssp_get(struct task_struct *target, const struct user_regset
> > *regset,
> > +         struct membuf to)
> > +{
> > +     struct fpu *fpu = &target->thread.fpu;
> > +     struct cet_user_state *cetregs;
> > +
> > +     if (!boot_cpu_has(X86_FEATURE_USER_SHSTK))
> 
> check_for_deprecated_apis: WARNING: arch/x86/kernel/fpu/regset.c:193:
> Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
> 
> Check your whole set pls.

Ok. I think the other case is in "x86/fpu: Add helper for initing
features" where it was moved code.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK
  2023-03-11 15:11   ` Borislav Petkov
@ 2023-03-13  3:04     ` Edgecombe, Rick P
  2023-03-13 11:05       ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-13  3:04 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, rppt, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Sat, 2023-03-11 at 16:11 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:56PM -0800, Rick Edgecombe wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Userspace loaders may lock features before a CRIU restore operation
> > has
> > the chance to set them to whatever state is required by the process
> > being restored. Allow a way for CRIU to unlock features. Add it as
> > an
> > arch_prctl() like the other shadow stack operations, but restrict
> > it being
> > called by the ptrace arch_pctl() interface.
> > 
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Tested-by: Kees Cook <keescook@chromium.org>
> > Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> 
> That tag is kinda implicit here. Unless he doesn't ACK his own patch.
> :-P

Uhh, right. This was me mindlessly adding his ack to all the patches in
the series.

> 
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > [Merged into recent API changes, added commit log and docs]
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ...
> 
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 2faf9b45ac72..3197ff824809 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -451,9 +451,14 @@ long shstk_prctl(struct task_struct *task, int
> > option, unsigned long features)
> >                return 0;
> >        }
> >   
> > -     /* Don't allow via ptrace */
> > -     if (task != current)
> > +     /* Only allow via ptrace */
> > +     if (task != current) {
> 
> Is that the only case? task != current means ptrace and there's no
> other
> way to do this from userspace?

Not that I could see...

> 
> Isn't there some flag which says that task is ptraced? I think we
> should
> check that one too...

This is how the other arch_prctl()s handle it (if they do handle it,
some don't). So I would think it would be nice to keep all the logic
the same.

I guess the flag might work based on the assumption that if the task is
being ptraced, the arch_prctl() couldn't be coming from anywhere else.
Maybe it should get a nicely named helper that they could all use and
whatever best logic could be commented.

Would this maybe be better as a future cleanup that did the change for
them all? 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-03-13  2:45     ` Edgecombe, Rick P
@ 2023-03-13 11:03       ` Borislav Petkov
  2023-03-13 16:10         ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-13 11:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	akpm, debug, jamorris, john.allen, rppt, andrew.cooper3, mingo,
	corbet, linux-kernel, linux-api, gorcunov

On Mon, Mar 13, 2023 at 02:45:08AM +0000, Edgecombe, Rick P wrote:
> These two are from the existing code. Basically they get extracted into
> a new function.

I know but you can fix them while at it.

> I did it up, and it makes the caller code cleaner. But I'm not sure
> what to think of it. Is this not mixing two operations together? Today
> get_xsave_addr() pretty much just gets a buffer offset with some
> checks. Now it would compute the offset and also silently go off and
> changes the buffer.

Ok, so why don't you write the call site this way instead:

        cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
        if (!cetregs) {
		if (xfeature_saved(xsave, XFEATURE_CET_USER)) {
			WARN("something's wrong with this buffer")
			return ...;
		}

		/* Not saved, initialize it */
		init_xfeature(xsave, XFEATURE_CET_USER));
	}

	cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
	if (!cetregs) {
		WARN_ON("WTF")
                return -ENODEV;
	}

Now it is clear what happens and it is a common code pattern of trying
to get something and initializing it if it wasn't initialized yet, and
then retrying...

Hmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK
  2023-03-13  3:04     ` Edgecombe, Rick P
@ 2023-03-13 11:05       ` Borislav Petkov
  0 siblings, 0 replies; 159+ messages in thread
From: Borislav Petkov @ 2023-03-13 11:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, rppt, kcc,
	linux-arch, pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz,
	x86, akpm, debug, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On Mon, Mar 13, 2023 at 03:04:10AM +0000, Edgecombe, Rick P wrote:
> This is how the other arch_prctl()s handle it (if they do handle it,
> some don't). So I would think it would be nice to keep all the logic
> the same.
> 
> I guess the flag might work based on the assumption that if the task is
> being ptraced, the arch_prctl() couldn't be coming from anywhere else.
> Maybe it should get a nicely named helper that they could all use and
> whatever best logic could be commented.
> 
> Would this maybe be better as a future cleanup that did the change for
> them all?

Yeah, I'm just being overly paranoid.

Because if there's another way to unlock that feature, then this whole
"overhead" we're doing is for nothing.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-03-13 11:03       ` Borislav Petkov
@ 2023-03-13 16:10         ` Edgecombe, Rick P
  2023-03-13 17:10           ` Borislav Petkov
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-13 16:10 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	linux-doc, x86, andrew.cooper3, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2023-03-13 at 12:03 +0100, Borislav Petkov wrote:
> On Mon, Mar 13, 2023 at 02:45:08AM +0000, Edgecombe, Rick P wrote:
> > These two are from the existing code. Basically they get extracted
> > into
> > a new function.
> 
> I know but you can fix them while at it.

Ok.

> 
> > I did it up, and it makes the caller code cleaner. But I'm not sure
> > what to think of it. Is this not mixing two operations together?
> > Today
> > get_xsave_addr() pretty much just gets a buffer offset with some
> > checks. Now it would compute the offset and also silently go off
> > and
> > changes the buffer.
> 
> Ok, so why don't you write the call site this way instead:
> 
>         cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
>         if (!cetregs) {
>                 if (xfeature_saved(xsave, XFEATURE_CET_USER)) {
>                         WARN("something's wrong with this buffer")
>                         return ...;
>                 }
> 
>                 /* Not saved, initialize it */
>                 init_xfeature(xsave, XFEATURE_CET_USER));
>         }
> 
>         cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
>         if (!cetregs) {
>                 WARN_ON("WTF")
>                 return -ENODEV;
>         }
> 
> Now it is clear what happens and it is a common code pattern of
> trying
> to get something and initializing it if it wasn't initialized yet,
> and
> then retrying...
> 
> Hmm?

This seems more clear. I'm sorry for the noise here though, because
this has made me realize that the initing logic should never be hit. We
used to support the full CET_U state in ptrace, but then dropped it to
just the SSP and only allowed it when shadow stack is active. This
means that CET_U will always have at least the CET_SHSTK_EN bit set and
so not be in the init state. So this can probably just warn and bail if
it sees an init state.

Unless the extra logic seems more robust? But it is always nice when
the chance comes to drop a patch out of this thing...

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-03-13 16:10         ` Edgecombe, Rick P
@ 2023-03-13 17:10           ` Borislav Petkov
  2023-03-13 23:31             ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Borislav Petkov @ 2023-03-13 17:10 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, tglx, Schimpe, Christina, mike.kravetz, debug,
	linux-doc, x86, andrew.cooper3, john.allen, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, Mar 13, 2023 at 04:10:14PM +0000, Edgecombe, Rick P wrote:
> This seems more clear. I'm sorry for the noise here though, because
> this has made me realize that the initing logic should never be hit. We
> used to support the full CET_U state in ptrace, but then dropped it to
> just the SSP and only allowed it when shadow stack is active.

Right, you do check that at function entry.

> This means that CET_U will always have at least the CET_SHSTK_EN bit
> set and so not be in the init state. So this can probably just warn
> and bail if it sees an init state.

I don't mind the additional checks as this is a security thing so
sanity checks are good, especially if they're cheap.

And you don't need to reinit the buffer - just scream loudly when get_xsave_addr()
returns NULL.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 38/41] x86/fpu: Add helper for initing features
  2023-03-13 17:10           ` Borislav Petkov
@ 2023-03-13 23:31             ` Edgecombe, Rick P
  0 siblings, 0 replies; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-13 23:31 UTC (permalink / raw)
  To: bp
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, akpm, Lutomirski, Andy, jamorris, arnd,
	tglx, Schimpe, Christina, mike.kravetz, debug, Yang, Weijiang,
	x86, andrew.cooper3, john.allen, linux-doc, rppt, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2023-03-13 at 18:10 +0100, Borislav Petkov wrote:
> > This means that CET_U will always have at least the CET_SHSTK_EN
> > bit
> > set and so not be in the init state. So this can probably just warn
> > and bail if it sees an init state.
> 
> I don't mind the additional checks as this is a security thing so
> sanity checks are good, especially if they're cheap.
> 
> And you don't need to reinit the buffer - just scream loudly when
> get_xsave_addr()
> returns NULL.

Ok, will do this instead in "x86: Add PTRACE interface for shadow
stack".

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-09 18:55     ` Deepak Gupta
  2023-03-09 19:39       ` Edgecombe, Rick P
@ 2023-03-14  7:19       ` Mike Rapoport
  2023-03-16 19:30         ` Deepak Gupta
  1 sibling, 1 reply; 159+ messages in thread
From: Mike Rapoport @ 2023-03-14  7:19 UTC (permalink / raw)
  To: Deepak Gupta
  Cc: Szabolcs Nagy, Rick Edgecombe, x86, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, nd, al.grant

Hi,

On Thu, Mar 09, 2023 at 10:55:11AM -0800, Deepak Gupta wrote:
> On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
> > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > Previously, a new PROT_SHADOW_STACK was attempted,
> > ...
> > > So rather than repurpose two existing syscalls (mmap, madvise) that don't
> > > quite fit, just implement a new map_shadow_stack syscall to allow
> > > userspace to map and setup new shadow stacks in one step. While ucontext
> > > is the primary motivator, userspace may have other unforeseen reasons to
> > > setup it's own shadow stacks using the WRSS instruction. Towards this
> > > provide a flag so that stacks can be optionally setup securely for the
> > > common case of ucontext without enabling WRSS. Or potentially have the
> > > kernel set up the shadow stack in some new way.
> > ...
> > > The following example demonstrates how to create a new shadow stack with
> > > map_shadow_stack:
> > > void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
> > 
> > i think
> > 
> > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
> > 
> > could do the same with less disruption to users (new syscalls
> > are harder to deal with than new flags). it would do the
> > guard page and initial token setup too (there is no flag for
> > it but could be squeezed in).
> 
> Discussion on this topic in v6
> https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
> 
> Again I know earlier CET patches had protection flag and somehow due to pushback
> on mailing list, it was adopted to go for special syscall because no one else
> had shadow stack.
> 
> Seeing a response from Szabolcs, I am assuming arm4 would also want to follow
> using mmap to manufacture shadow stack. For reference RFC patches for risc-v shadow stack,
> use a new protection flag = PROT_SHADOWSTACK.
> https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> 
> I know earlier discussion had been that we let this go and do a re-factor later as other
> arch support trickle in. But as I thought more on this and I think it may just be
> messy from user mode point of view as well to have cognition of two different ways of
> creating shadow stack. One would be special syscall (in current libc) and another `mmap`
> (whenever future re-factor happens)
> 
> If it's not too late, it would be more wise to take `mmap`
> approach rather than special `syscall` approach.
 
I disagree. 

Having shadow stack flags for mmap() adds unnecessary complexity to the
core-mm, while having a dedicated syscall hides all the details in the
architecture specific code.

Another reason to use a dedicated system call allows for better
extensibility if/when we'd need to update the way shadow stack VMA is
created.

As for the userspace convenience, it is anyway required to add special
code for creating the shadow stack and it wouldn't matter if that code
would use mmap(NEW_FLAG) or map_shadow_stack().

> > most of the mmap features need not be available (EINVAL) when
> > MAP_SHADOW_STACK is specified.
> > 
> > the main drawback is running out of mmap flags so extension
> > is limited. (but the new syscall has limitations too).

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-14  7:19       ` Mike Rapoport
@ 2023-03-16 19:30         ` Deepak Gupta
  2023-03-20 11:35           ` Szabolcs Nagy
  0 siblings, 1 reply; 159+ messages in thread
From: Deepak Gupta @ 2023-03-16 19:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Szabolcs Nagy, Rick Edgecombe, x86, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, nd, al.grant

On Tue, Mar 14, 2023 at 12:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> On Thu, Mar 09, 2023 at 10:55:11AM -0800, Deepak Gupta wrote:
> > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote:
> > > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > > Previously, a new PROT_SHADOW_STACK was attempted,
> > > ...
> > > > So rather than repurpose two existing syscalls (mmap, madvise) that don't
> > > > quite fit, just implement a new map_shadow_stack syscall to allow
> > > > userspace to map and setup new shadow stacks in one step. While ucontext
> > > > is the primary motivator, userspace may have other unforeseen reasons to
> > > > setup it's own shadow stacks using the WRSS instruction. Towards this
> > > > provide a flag so that stacks can be optionally setup securely for the
> > > > common case of ucontext without enabling WRSS. Or potentially have the
> > > > kernel set up the shadow stack in some new way.
> > > ...
> > > > The following example demonstrates how to create a new shadow stack with
> > > > map_shadow_stack:
> > > > void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
> > >
> > > i think
> > >
> > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0);
> > >
> > > could do the same with less disruption to users (new syscalls
> > > are harder to deal with than new flags). it would do the
> > > guard page and initial token setup too (there is no flag for
> > > it but could be squeezed in).
> >
> > Discussion on this topic in v6
> > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
> >
> > Again I know earlier CET patches had protection flag and somehow due to pushback
> > on mailing list, it was adopted to go for special syscall because no one else
> > had shadow stack.
> >
> > Seeing a response from Szabolcs, I am assuming arm4 would also want to follow
> > using mmap to manufacture shadow stack. For reference RFC patches for risc-v shadow stack,
> > use a new protection flag = PROT_SHADOWSTACK.
> > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> >
> > I know earlier discussion had been that we let this go and do a re-factor later as other
> > arch support trickle in. But as I thought more on this and I think it may just be
> > messy from user mode point of view as well to have cognition of two different ways of
> > creating shadow stack. One would be special syscall (in current libc) and another `mmap`
> > (whenever future re-factor happens)
> >
> > If it's not too late, it would be more wise to take `mmap`
> > approach rather than special `syscall` approach.
>
> I disagree.
>
> Having shadow stack flags for mmap() adds unnecessary complexity to the
> core-mm, while having a dedicated syscall hides all the details in the
> architecture specific code.

Again reiterating it would've made sense if only x86 had a shadow stack.
aarch64 announced support for guarded stack. risc-v spec is in
development to support shadow stack.

So there will be shadow stack related flow in these arches.

>
> Another reason to use a dedicated system call allows for better
> extensibility if/when we'd need to update the way shadow stack VMA is
> created.

I see two valid points here
    - Shadow stack doesn't need conversion into different memory types
(which is usually the case for address ranges created by mmap)
      So there is a static page permissions on shadow stack which is
not mutable.

    - Future feature addition (if there is one needed) at the time of
shadow stack creation
      It would avoid future tax on mmap

I'll think more about this.

>
> As for the userspace convenience, it is anyway required to add special
> code for creating the shadow stack and it wouldn't matter if that code
> would use mmap(NEW_FLAG) or map_shadow_stack().

Yes *strictly* from userspace convenience, it doesn't matter which option.

>
> > > most of the mmap features need not be available (EINVAL) when
> > > MAP_SHADOW_STACK is specified.
> > >
> > > the main drawback is running out of mmap flags so extension
> > > is limited. (but the new syscall has limitations too).
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-10 21:43               ` Edgecombe, Rick P
@ 2023-03-16 20:07                 ` Deepak Gupta
  0 siblings, 0 replies; 159+ messages in thread
From: Deepak Gupta @ 2023-03-16 20:07 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, szabolcs.nagy,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, tglx, pavel, x86, mike.kravetz, Schimpe,
	Christina, al.grant, nd, john.allen, linux-doc, rppt,
	andrew.cooper3, mingo, corbet, linux-kernel, linux-api, gorcunov,
	akpm

On Fri, Mar 10, 2023 at 1:43 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2023-03-10 at 13:00 -0800, Deepak Gupta wrote:
> > On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote:
> > > > On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P
> > > > wrote:
> > > > > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote:
> > > > > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy
> > > > > > wrote:
> > > > > > > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > > > > > > Previously, a new PROT_SHADOW_STACK was attempted,
> > > > > > >
> > > > > > > ...
> > > > > > > > So rather than repurpose two existing syscalls (mmap,
> > > > > > > > madvise)
> > > > > > > > that don't
> > > > > > > > quite fit, just implement a new map_shadow_stack syscall
> > > > > > > > to
> > > > > > > > allow
> > > > > > > > userspace to map and setup new shadow stacks in one step.
> > > > > > > > While
> > > > > > > > ucontext
> > > > > > > > is the primary motivator, userspace may have other
> > > > > > > > unforeseen
> > > > > > > > reasons to
> > > > > > > > setup it's own shadow stacks using the WRSS instruction.
> > > > > > > > Towards
> > > > > > > > this
> > > > > > > > provide a flag so that stacks can be optionally setup
> > > > > > > > securely
> > > > > > > > for the
> > > > > > > > common case of ucontext without enabling WRSS. Or
> > > > > > > > potentially
> > > > > > > > have the
> > > > > > > > kernel set up the shadow stack in some new way.
> > > > > > >
> > > > > > > ...
> > > > > > > > The following example demonstrates how to create a new
> > > > > > > > shadow
> > > > > > > > stack with
> > > > > > > > map_shadow_stack:
> > > > > > > > void *shstk = map_shadow_stack(addr, stack_size,
> > > > > > > > SHADOW_STACK_SET_TOKEN);
> > > > > > >
> > > > > > > i think
> > > > > > >
> > > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1,
> > > > > > > 0);
> > > > > > >
> > > > > > > could do the same with less disruption to users (new
> > > > > > > syscalls
> > > > > > > are harder to deal with than new flags). it would do the
> > > > > > > guard page and initial token setup too (there is no flag
> > > > > > > for
> > > > > > > it but could be squeezed in).
> > > > > >
> > > > > > Discussion on this topic in v6
> > > > > >
> > > > >
> > > > >
> > >
> > >
> https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/
> > > > > >
> > > > > > Again I know earlier CET patches had protection flag and
> > > > > > somehow
> > > > > > due
> > > > > > to pushback
> > > > > > on mailing list,
> > > > > >  it was adopted to go for special syscall because no one else
> > > > > > had shadow stack.
> > > > > >
> > > > > > Seeing a response from Szabolcs, I am assuming arm4 would
> > > > > > also
> > > > > > want
> > > > > > to follow
> > > > > > using mmap to manufacture shadow stack. For reference RFC
> > > > > > patches
> > > > > > for
> > > > > > risc-v shadow stack,
> > > > > > use a new protection flag = PROT_SHADOWSTACK.
> > > > > >
> > > > >
> > > > >
> > >
> > >
> https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
> > > > > >
> > > > > > I know earlier discussion had been that we let this go and do
> > > > > > a
> > > > > > re-
> > > > > > factor later as other
> > > > > > arch support trickle in. But as I thought more on this and I
> > > > > > think it
> > > > > > may just be
> > > > > > messy from user mode point of view as well to have cognition
> > > > > > of
> > > > > > two
> > > > > > different ways of
> > > > > > creating shadow stack. One would be special syscall (in
> > > > > > current
> > > > > > libc)
> > > > > > and another `mmap`
> > > > > > (whenever future re-factor happens)
> > > > > >
> > > > > > If it's not too late, it would be more wise to take `mmap`
> > > > > > approach rather than special `syscall` approach.
> > > > >
> > > > > There is sort of two things intermixed here when we talk about
> > > > > a
> > > > > PROT_SHADOW_STACK.
> > > > >
> > > > > One is: what is the interface for specifying how the shadow
> > > > > stack
> > > > > should be provisioned with data? Right now there are two ways
> > > > > supported, all zero or with an X86 shadow stack restore token
> > > > > at
> > > > > the
> > > > > end. Then there was already some conversation about a third
> > > > > type.
> > > > > In
> > > > > which case the question would be is using mmap MAP_ flags the
> > > > > right
> > > > > place for this? How many types of initialization will be needed
> > > > > in
> > > > > the
> > > > > end and what is the overlap between the architectures?
> > > >
> > > > First of all, arches can choose to have token at the bottom or
> > > > not.
> > > >
> > > > Token serve following purposes
> > > >   - It allows one to put desired value in shadow stack pointer in
> > > > safe/secure manner.
> > > >     Note: x86 doesn't provide any opcode encoding to value in SSP
> > > > register. So having
> > > >     a token is kind of a necessity because x86 doesn't easily
> > > > allow
> > > > writing shadow stack.
> > > >
> > > >   - A token at the bottom acts marker / barrier and can be useful
> > > > in
> > > > debugging
> > > >
> > > >   - If (and a big *if*) we ever reach a point in future where
> > > > return
> > > > address is only pushed
> > > >     on shadow stack (x86 should have motivation to do this
> > > > because
> > > > less uops on call/ret),
> > > >     a token at the bottom (bottom means lower address) is
> > > > ensuring
> > > > sure shot way of getting
> > > >     a fault when exhausted.
> > > >
> > > > Current RISCV zisslpcfi proposal doesn't define CPU based tokens
> > > > because it's RISC.
> > > > It allows mechanisms using which software can define formatting
> > > > of
> > > > token for itself.
> > > > Not sure of what ARM is doing.
> > >
> > > Ok, so riscv doesn't need to have the kernel write the token, but
> > > x86
> > > does.
> > >
> > > >
> > > > Now coming to the point of all zero v/s shadow stack token.
> > > > Why not always have token at the bottom?
> > >
> > > With WRSS you can setup the shadow stack however you want. So the
> > > user
> > > would then have to take care to erase the token if they didn't want
> > > it.
> > > Not the end of the world, but kind of clunky if there is no reason
> > > for
> > > it.
> >
> > Yes but kernel always assumes the user is going to use the token. It'
> > upto the user
> > to decide whether they want to use the restore token or not. If
> > they've WRSS capability
> > security posture is anyways diluted. An attacker who would be clever
> > enough to
> > re-use `RSTORSSP` present in address space to restore using kernel
> > prepared token, should
> > anyways can be clever enough to use WRSS as well.
> >
> > It kind of makes shadow stack creation simpler for kernel to always
> > place the token.
> > This point is irrespective of whether to use system call or mmap.
>
> Think about like CRIU restoring the shadow stack, or other special
> cases like that. Userspace can always overwrite the token, but this
> involves some amount of extra work (extra writes, earlier faulting in
> the page, etc). It is clunky and very negligibly worse.

Earlier faulting in the page because the kernel is writing the token at base?

>
> >
> > >
> > > >
> > > > In case of x86, Why need for two ways and why not always have a
> > > > token
> > > > at the bottom.
> > > > The way x86 is going, user mode is responsible for establishing
> > > > shadow stack and thus
> > > > whenever shadow stack is created then if x86 kernel
> > > > implementation
> > > > always place a token
> > > > at the base/bottom.
> > >
> > > There was also some discussion recently of adding a token AND an
> > > end of
> > > stack marker, as a potential solution for backtracing in ucontext
> > > stacks. In this case it could cause an ABI break to just start
> > > adding
> > > the end of stack marker where the token was, and so would require a
> > > new
> > > map_shadow_stack flag.
> >
> > Was this discussed why restore token itself can't be used as marker
> > for
> > end of stack (if we assume there is always going to be one at the
> > bottom).
> > It's a unique value. An address pointing to itself.
>
> I thought the same thing at first, but it gets clobbered during the
> pivot and push.

aah I remember. It was changed from savessp/rstorssp pair to
rstorssp/saveprevssp to follow the `make before break` model.

>
> >
> > >
> > > >
> > > > Now user mode can do following:--
> > > >   - If it has access to WRSS, it can sure go ahead and create a
> > > > token
> > > > of its choosing and
> > > >     overwrite kernel created token. and then do RSTORSSP on it's
> > > > own
> > > > created token.
> > > >
> > > >   - If it doesn't have access to WRSS (and dont need to create
> > > > its
> > > > own token), it can do
> > > >     RSTORSSP on this. As soon as it does, no other thread in
> > > > process
> > > > can restore to it.
> > > >     On `fork`, you get the same un-restorable token.
> > > >
> > > > So why not always have a token at the bottom.
> > > > This is my plan for riscv implementation as well (to have a token
> > > > at
> > > > the bottom)
> > > >
> > > > >
> > > > > The other thing is: should shadow stack memory creation be
> > > > > tightly
> > > > > controlled? For example in x86 we limit this to anonymous
> > > > > memory,
> > > > > etc.
> > > > > Some reasons for this are x86 specific, but some are not. So if
> > > > > we
> > > > > disallow most of the options why allow the interface to take
> > > > > them?
> > > > > And
> > > > > then you are in the position of carefully maintaining a list of
> > > > > not-
> > > > > allowed options instead letting a list of allowed options sit
> > > > > there.
> > > >
> > > > I am new to linux kernel and thus may be not able to follow the
> > > > argument of
> > > > limiting to anonymous memory.
> > > >
> > > > Why is limiting it to anonymous memory a problem. IIRC, ARM's
> > > > PROT_MTE is applicable
> > > > only to anonymous memory. I can probably find few more examples.
> > >
> > > Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that
> > > only
> > > gets set if all the rules are followed. Then PROT_MTE can only be
> > > set
> > > on that to set VM_MTE. That is kind of nice because certain other
> > > special situations can choose to support it.
> >
> > That's because MTE is different. It allows to assign tags to existing
> > virtual memory. So one need to know whether a memory can have tags
> > assigned.
> >
> > >
> > > It does take another arch vma flag though. For x86 I guess I would
> > > need
> > > to figure out how to squeeze VM_SHADOW_STACK into other flags to
> > > have a
> > > free flag to use the same method. It also only supports mprotect()
> > > and
> > > shadow stack would only want to support mmap(). And you still have
> > > the
> > > initialization stuff to plumb through. Yea, I think the PROT_MTE is
> > > a
> > > good thing to consider, but it's not super obvious to me how
> > > similar
> > > the logic would be for shadow stack.
> >
> > I dont think you need another VMA flag. Memory tagging allows adding
> > tags
> > to existing virtual memory.
>
> ...need another VMA flag to use the existing mmap arch breakouts in the
> same way as VM_MTE. Of course changing mmap makes other solutions
> possible.
>
> >  That's why having `mprotect` makes sense for MTE.
> > In shadow stack case, there is no requirement of changing a shadow
> > stack
> > to regular memory or vice-versa.
>
> uffd needs mprotect internals. You might take a look at it in regards
> to your VM_WRITE/mprotect blocking approach for riscv. I was imagining,
> even if mmap was the syscall, mprotect() would not be blocked in the
> x86 case at least. The mprotect() blocking is a separate thing than the
> syscall, right?

Yes, mprotect blocking is a different thing.
VM_XXX flags are not exposed to mprotect (or any memory mapping API).
PROT_XXX flags are. On riscv, in my current plan if mprotect or mmap
specifies PROT_WRITE (no PROT_READ),
It'll be mapped to `VM_READ | VM_WRITE` on vma flags (this is to make
sure we don't break compat with existing user code which has
been using only PROT_WRITE)

If PROT_SHADOWSTACK (new protection flag) is specified, it'll be
mapped to `VM_WRITE` on the vma_flag.

Yes I am aware of uffd. I intend to handle it the same way I am
handling the fork of riscv shadow stack.
core-mm write protect checks if VM_WRiTE is specified it'll convert
PTE encodings to read-only.
uffd for regular memory should work as it is. In case if someone was
monitoring shadow stack memory, following could occur

1) A write happens on shadow stack memory, a store page fault would occur.
2) A read happens, this would be allowed.
3) A shadow stack load / store happens, a store access fault would occur.

Case 1 and 3 are reported to the kernel and it can make sure the uffd
monitor is reported about it.

Was there a specific concern here with respect to uffd and x86 shadow stack?

>
> >
> > All that's needed to change is `mmap`. `mprotect` should fail.
> > Syscall
> > approach gives that benefit by default because there is no protection
> > flag
> > for shadow stack.
> >
> > I was giving example that any feature which gives new meaning to
> > virtual memory
> > has been able to work with existing memory mapping APIs without the
> > need of new
> > system call (including whether you're dealing with anonymous memory).
> >
> > >
> > > The question I'm asking though is, not "can mmap code and rules be
> > > changed to enforce the required limitations?". I think it is yes.
> > > But
> > > the question is "why is that plumbing better than a new syscall?".
> > > I
> > > guess to get a better idea, the mmap solution would need to get
> > > POCed.
> > > I had half done this at one point, but abandoned the approach.
> > >
> > > For your question about why limit it, the special x86 case is the
> > > Dirty=1,Write=0 PTE bit combination for shadow stacks. So for
> > > shadow
> > > stack you could have some confusion about whether a PTE is actually
> > > dirty for writeback, etc. I wouldn't say it's known to be
> > > impossible to
> > > do MAP_SHARED, but it has not been fully analyzed enough to know
> > > what
> > > the changes would be. There were some solvable concrete issues that
> > > tipped the scale as well. It was also not expected to be a common
> > > usage, if at all.
> >
> > I am not sure how confusion of D=1,W=0 is not completely taken away
> > by
> > syscall approach. It'll always be there. One can only do things to
> > minimize
> > the chances.
> >
> > In case of syscall approach, syscall makes sure that
> >
> > `flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G`
> >
> > This can be easily checked in arch specific landing function for
> > mmap.
>
> Right, this is why I listed two types of things in the mix here. The
> memory features supported, and what the syscall is. You asked why limit
> the memory features, so that is the explanation.
>
> >
> >
> > Additionally, If you always have the token at base, you don't need
> > that ABI
> > between user and kernel.
> >
> >
> > >
> > > The non-x86, general reasons for it, are for a smaller benefit. It
> > > blocks a lot of ways shadow stack memory could be written to. Like
> > > say
> > > you have a memory mapped writable file, and you also map it shadow
> > > stack. So it has better security properties depending on what your
> > > threat model is.
> >
> > I wouldn't say any architecture should allow such primitives. It kind
> > of defeats
> > the purpose for shadow stack. Yes if some sort of secure memory is
> > needed, there may
> > be new ISA extensions for that.
>
> Yea, seems reasonable to prevent this regardless of the extra x86
> reasons, if that is what you are saying. It depends on people's threat
> models (as always in security).
>
> >
> > >
> > > >
> > > > Eventually syscall will also go ahead and use memory management
> > > > code
> > > > to
> > > > perform mapping. So I didn't understand the reasoning here. The
> > > > way
> > > > syscall
> > > > can limit it to anonymous memory, why mmap can't do the same if
> > > > it
> > > > sees
> > > > PROT_SHADOWSTACK.
> > > >
> > > > >
> > > > > The only benefit I've heard is that it saves creating a new
> > > > > syscall,
> > > > > but it also saves several MAP_ flags. That, and that the RFC
> > > > > for
> > > > > riscv
> > > > > did a PROT_SHADOW_STACK to start. So, yes, two people asked the
> > > > > same
> > > > > question, but I'm still not seeing any benefits. Can you give
> > > > > the
> > > > > pros
> > > > > and cons please?
> > > >
> > > > Again the way syscall will limit it to anonymous memory, Why mmap
> > > > can't do same?
> > > > There is precedence for it (like PROT_MTE is applicable only to
> > > > anonymous memory)
> > > >
> > > > So if it can be done, then why introduce a new syscall?
> > > >
> > > > >
> > > > > BTW, in glibc map_shadow_stack is called from arch code. So I
> > > > > think
> > > > > userspace wise, for this to affect other architectures there
> > > > > would
> > > > > need
> > > > > to be some code that could do things generically, with somehow
> > > > > the
> > > > > shadow stack pivot abstracted but the shadow stack allocation
> > > > > not.
> > > >
> > > > Agreed, yes it can be done in a way where it won't put tax on
> > > > other
> > > > architectures.
> > > >
> > > > But what about fragmentation within x86. Will x86 always choose
> > > > to
> > > > use system call
> > > > method map shadow stack. If future re-factor results in x86 also
> > > > use
> > > > `mmap` method.
> > > > Isn't it a mess for x86 glibc to figure out what to do; whether
> > > > to
> > > > use system call
> > > > or `mmap`?
> > > >
> > >
> > > Ok, so this is the downside I guess. What happens if we want to
> > > support
> > > the other types of memory in the future and end up using mmap for
> > > this?
> > > Then we have 15-20 lines of extra syscall wrapping code to maintain
> > > to
> > > support legacy.
> > >
> > > For the mmap solution, we have the downside of using extra MAP_
> > > flags,
> > > and *some* amount of currently unknown vm_flag and address range
> > > logic,
> > > plus mmap arch breakouts to add to core MM. Like I said earlier,
> > > you
> > > would need to POC it out to see how bad that looks and get some
> > > core MM
> > > feedback on the new type of MAP flag usage. But, syscalls being
> > > pretty
> > > straightforward, it would probably be *some* amount of added
> > > complexity
> > > _now_ to support something that might happen in the future. I'm not
> > > seeing either one as a landslide win.
> > >
> > > It's kind of an eternal software design philosophical question,
> > > isn't
> > > it? How much work should you do to prepare for things that might be
> > > needed in the future? From what I've seen the balance in the kernel
> > > seems to be to try not to paint yourself in to an ABI corner, but
> > > otherwise let the kernel evolve naturally in response to real
> > > usages.
> > > If anyone wants to correct this, please do. But otherwise I think
> > > the
> > > new syscall is aligned with that.
> > >
> > > TBH, you are making me wonder if I'm missing something. It seems
> > > you
> > > strongly don't prefer this approach, but I'm not hearing any huge
> > > potential negative impacts. And you also say it won't tax the riscv
> > > implementation. Is this just something just smells bad here? Or it
> > > would shrink the riscv series?
> >
> > No you're not missing anything. It's just wierdness of adding a
> > system call
> > which enforces certain MAP_XX flags and pretty much mapping API.
> > And difference between architectures on how they will create shadow
> > stack. +
> > if x86 chooses to use `mmap` in future, then there is ugliness in
> > user mode to
> > decide which method to choose.
>
> Ok, I think I will leave it given it's entirely in arch/x86. It just
> got some special error codes in the other thread today too.
>
> >
> > And yes you got it right, to some extent there is my own selfishness
> > playing out
> > as well here to reduce riscv patches.
> >
>
> Feel free to join the map_shadow_stack party. :)


I am warming up to it :-)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory
  2023-02-27 22:29 ` [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
  2023-03-06 13:10   ` Borislav Petkov
@ 2023-03-17 17:05   ` Deepak Gupta
  1 sibling, 0 replies; 159+ messages in thread
From: Deepak Gupta @ 2023-03-17 17:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david

On Mon, Feb 27, 2023 at 2:31 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index eab18ba045db..e7c7bcc0e268 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -978,7 +978,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>                 return -EFAULT;
>
>         if (write) {
> -               if (!(vm_flags & VM_WRITE)) {
> +               if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {

I think I missed this in the review.
`VM_SHADOW_STACK` is an x86 specific vmaflag to represent a shadow stack VMA.
Since this is arch agnostic code. Can we instead have
`is_arch_shadow_stack_vma` which consumes vma flags and returns true.
This allows different architectures to choose whatever encoding of the
vma flag to represent a shadow stack.


>                         if (!(gup_flags & FOLL_FORCE))
>                                 return -EFAULT;
>                         /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 21/41] mm: Add guard pages around a shadow stack.
  2023-02-27 22:29 ` [PATCH v7 21/41] mm: Add guard pages around a shadow stack Rick Edgecombe
  2023-03-06  8:08   ` Borislav Petkov
@ 2023-03-17 17:09   ` Deepak Gupta
  1 sibling, 0 replies; 159+ messages in thread
From: Deepak Gupta @ 2023-03-17 17:09 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, Yu-cheng Yu

On Mon, Feb 27, 2023 at 2:31 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> The architecture of shadow stack constrains the ability of userspace to
> move the shadow stack pointer (SSP) in order to  prevent corrupting or
> switching to other shadow stacks. The RSTORSSP can move the ssp to
> different shadow stacks, but it requires a specially placed token in order
> to do this. However, the architecture does not prevent incrementing the
> stack pointer to wander onto an adjacent shadow stack. To prevent this in
> software, enforce guard pages at the beginning of shadow stack vmas, such
> that there will always be a gap between adjacent shadow stacks.
>
> Make the gap big enough so that no userspace SSP changing operations
> (besides RSTORSSP), can move the SSP from one stack to the next. The
> SSP can increment or decrement by CALL, RET  and INCSSP. CALL and RET
> can move the SSP by a maximum of 8 bytes, at which point the shadow
> stack would be accessed.
>
> The INCSSP instruction can also increment the shadow stack pointer. It
> is the shadow stack analog of an instruction like:
>
>         addq    $0x80, %rsp
>
> However, there is one important difference between an ADD on %rsp and
> INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
> of the first and last elements that were "popped". It can be thought of
> as acting like this:
>
> READ_ONCE(ssp);       // read+discard top element on stack
> ssp += nr_to_pop * 8; // move the shadow stack
> READ_ONCE(ssp-8);     // read+discard last popped stack element
>
> The maximum distance INCSSP can move the SSP is 2040 bytes, before it
> would read the memory. Therefore a single page gap will be enough to
> prevent any operation from shifting the SSP to an adjacent stack, since
> it would have to land in the gap at least once, causing a fault.
>
> This could be accomplished by using VM_GROWSDOWN, but this has a
> downside. The behavior would allow shadow stack's to grow, which is
> unneeded and adds a strange difference to how most regular stacks work.
>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
>
> ---
> v5:
>  - Fix typo in commit log
>
> v4:
>  - Drop references to 32 bit instructions
>  - Switch to generic code to drop __weak (Peterz)
>
> v2:
>  - Use __weak instead of #ifdef (Dave Hansen)
>  - Only have start gap on shadow stack (Andy Luto)
>  - Create stack_guard_start_gap() to not duplicate code
>    in an arch version of vm_start_gap() (Dave Hansen)
>  - Improve commit log partly with verbiage from (Dave Hansen)
>
> Yu-cheng v25:
>  - Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.
> ---
>  include/linux/mm.h | 31 ++++++++++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 097544afb1aa..6a093daced88 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3107,15 +3107,36 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
>         return mtree_load(&mm->mm_mt, addr);
>  }
>
> +static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> +       if (vma->vm_flags & VM_GROWSDOWN)
> +               return stack_guard_gap;
> +
> +       /*
> +        * Shadow stack pointer is moved by CALL, RET, and INCSSPQ.
> +        * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
> +        * and touches the first and the last element in the range, which
> +        * triggers a page fault if the range is not in a shadow stack.
> +        * Because of this, creating 4-KB guard pages around a shadow
> +        * stack prevents these instructions from going beyond.
> +        *
> +        * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
> +        * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
> +        */
> +       if (vma->vm_flags & VM_SHADOW_STACK)
> +               return PAGE_SIZE;

This is an arch agnostic header file. Can we remove `VM_SHADOW_STACK`
from here? and instead
have `arch_is_shadow_stack` which consumes vma flags and returns true or false.
This allows different architectures to choose their own encoding of
vma flags to represent a shadow stack.

> +
> +       return 0;
> +}
> +
>  static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
>  {
> +       unsigned long gap = stack_guard_start_gap(vma);
>         unsigned long vm_start = vma->vm_start;
>
> -       if (vma->vm_flags & VM_GROWSDOWN) {
> -               vm_start -= stack_guard_gap;
> -               if (vm_start > vma->vm_start)
> -                       vm_start = 0;
> -       }
> +       vm_start -= gap;
> +       if (vm_start > vma->vm_start)
> +               vm_start = 0;
>         return vm_start;
>  }
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-02-27 22:29 ` [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
  2023-03-06 13:01   ` Borislav Petkov
  2023-03-07 10:42   ` David Hildenbrand
@ 2023-03-17 17:12   ` Deepak Gupta
  2023-03-17 17:16     ` Dave Hansen
  2 siblings, 1 reply; 159+ messages in thread
From: Deepak Gupta @ 2023-03-17 17:12 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, Yu-cheng Yu

On Mon, Feb 27, 2023 at 2:31 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> Account shadow stack pages to stack memory. Do this by adding a
> VM_SHADOW_STACK check in is_stack_mapping().
>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
>
> ---
> v7:
>  - Change is_stack_mapping() to know about VM_SHADOW_STACK so the
>    additions in vm_stat_account() can be dropped. (David Hildenbrand)
>
> v3:
>  - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
>    (Kirill)
>
> v2:
>  - Remove is_shadow_stack_mapping() and just change it to directly bitwise
>    and VM_SHADOW_STACK.
>
> Yu-cheng v26:
>  - Remove redundant #ifdef CONFIG_MMU.
>
> Yu-cheng v25:
>  - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().
> ---
>  mm/internal.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 7920a8b7982e..1d13d5580f64 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -491,14 +491,14 @@ static inline bool is_exec_mapping(vm_flags_t flags)
>  }
>
>  /*
> - * Stack area - automatically grows in one direction
> + * Stack area
>   *
> - * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
> - * do_mmap() forbids all other combinations.
> + * VM_GROWSUP, VM_GROWSDOWN VMAs are always private
> + * anonymous. do_mmap() forbids all other combinations.
>   */
>  static inline bool is_stack_mapping(vm_flags_t flags)
>  {
> -       return (flags & VM_STACK) == VM_STACK;
> +       return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);

Same comment here. `VM_SHADOW_STACK` is an x86 specific way of
encoding a shadow stack.
Instead let's have a proxy here which allows architectures to have
their own encodings to represent a shadow stack.

>  }
>
>  /*
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-03-17 17:12   ` Deepak Gupta
@ 2023-03-17 17:16     ` Dave Hansen
  2023-03-17 17:28       ` Deepak Gupta
  0 siblings, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-03-17 17:16 UTC (permalink / raw)
  To: Deepak Gupta, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, Yu-cheng Yu

On 3/17/23 10:12, Deepak Gupta wrote:
>>  /*
>> - * Stack area - automatically grows in one direction
>> + * Stack area
>>   *
>> - * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
>> - * do_mmap() forbids all other combinations.
>> + * VM_GROWSUP, VM_GROWSDOWN VMAs are always private
>> + * anonymous. do_mmap() forbids all other combinations.
>>   */
>>  static inline bool is_stack_mapping(vm_flags_t flags)
>>  {
>> -       return (flags & VM_STACK) == VM_STACK;
>> +       return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
> Same comment here. `VM_SHADOW_STACK` is an x86 specific way of
> encoding a shadow stack.
> Instead let's have a proxy here which allows architectures to have
> their own encodings to represent a shadow stack.

This doesn't _preclude_ another architecture from coming along and doing
that, right?  I'd just prefer that shadow stack architecture #2 comes
along and refactors this in precisely the way _they_ need it.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-03-17 17:16     ` Dave Hansen
@ 2023-03-17 17:28       ` Deepak Gupta
  2023-03-17 17:42         ` Edgecombe, Rick P
  0 siblings, 1 reply; 159+ messages in thread
From: Deepak Gupta @ 2023-03-17 17:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, Yu-cheng Yu

On Fri, Mar 17, 2023 at 10:16 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/17/23 10:12, Deepak Gupta wrote:
> >>  /*
> >> - * Stack area - automatically grows in one direction
> >> + * Stack area
> >>   *
> >> - * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
> >> - * do_mmap() forbids all other combinations.
> >> + * VM_GROWSUP, VM_GROWSDOWN VMAs are always private
> >> + * anonymous. do_mmap() forbids all other combinations.
> >>   */
> >>  static inline bool is_stack_mapping(vm_flags_t flags)
> >>  {
> >> -       return (flags & VM_STACK) == VM_STACK;
> >> +       return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
> > Same comment here. `VM_SHADOW_STACK` is an x86 specific way of
> > encoding a shadow stack.
> > Instead let's have a proxy here which allows architectures to have
> > their own encodings to represent a shadow stack.
>
> This doesn't _preclude_ another architecture from coming along and doing
> that, right?  I'd just prefer that shadow stack architecture #2 comes
> along and refactors this in precisely the way _they_ need it.

There are two issues here
 - Encoding of shadow stack: Another arch can choose different encoding.
   And yes, another architecture can come in and re-factor it. But so
much thought and work has been given to x86 implementation to keep
   shadow stack to not impact arch agnostic parts of the kernel. So
why creep it in here.

- VM_SHADOW_STACK is coming out of the VM_HIGH_ARCH_XX bit position
which makes it arch specific.

If re-factor takes care then I would say the 2nd issue still exists,
it's better to keep it away from arch agnostic code.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-03-17 17:28       ` Deepak Gupta
@ 2023-03-17 17:42         ` Edgecombe, Rick P
  2023-03-17 19:26           ` Deepak Gupta
  0 siblings, 1 reply; 159+ messages in thread
From: Edgecombe, Rick P @ 2023-03-17 17:42 UTC (permalink / raw)
  To: debug, Hansen, Dave
  Cc: david, bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	linux-doc, arnd, tglx, Schimpe, Christina, mike.kravetz, x86,
	Yang, Weijiang, jamorris, john.allen, rppt, andrew.cooper3,
	mingo, corbet, linux-kernel, linux-api, gorcunov, akpm

On Fri, 2023-03-17 at 10:28 -0700, Deepak Gupta wrote:
> On Fri, Mar 17, 2023 at 10:16 AM Dave Hansen <dave.hansen@intel.com>
> wrote:
> > 
> > On 3/17/23 10:12, Deepak Gupta wrote:
> > > >   /*
> > > > - * Stack area - automatically grows in one direction
> > > > + * Stack area
> > > >    *
> > > > - * VM_GROWSUP / VM_GROWSDOWN VMAs are always private
> > > > anonymous:
> > > > - * do_mmap() forbids all other combinations.
> > > > + * VM_GROWSUP, VM_GROWSDOWN VMAs are always private
> > > > + * anonymous. do_mmap() forbids all other combinations.
> > > >    */
> > > >   static inline bool is_stack_mapping(vm_flags_t flags)
> > > >   {
> > > > -       return (flags & VM_STACK) == VM_STACK;
> > > > +       return ((flags & VM_STACK) == VM_STACK) || (flags &
> > > > VM_SHADOW_STACK);
> > > 
> > > Same comment here. `VM_SHADOW_STACK` is an x86 specific way of
> > > encoding a shadow stack.
> > > Instead let's have a proxy here which allows architectures to
> > > have
> > > their own encodings to represent a shadow stack.
> > 
> > This doesn't _preclude_ another architecture from coming along and
> > doing
> > that, right?  I'd just prefer that shadow stack architecture #2
> > comes
> > along and refactors this in precisely the way _they_ need it.
> 
> There are two issues here
>  - Encoding of shadow stack: Another arch can choose different
> encoding.
>    And yes, another architecture can come in and re-factor it. But so
> much thought and work has been given to x86 implementation to keep
>    shadow stack to not impact arch agnostic parts of the kernel. So
> why creep it in here.
> 
> - VM_SHADOW_STACK is coming out of the VM_HIGH_ARCH_XX bit position
> which makes it arch specific.
> 
> 

VM_SHADOW_STACK is defined like this (trimmed for clarity):
#ifdef CONFIG_X86_USER_SHADOW_STACK
# define VM_SHADOW_STACK	VM_HIGH_ARCH_5
#else
# define VM_SHADOW_STACK	VM_NONE
#endif

Also, we actually had an is_shadow_stack_mapping(vma) in the past, but
it was dropped from other feedback. I think it might be too soon to say
whether other implementations won't end up with a similar vma flag, so
this would be premature refactoring. If not though, a helper like that
seems like a reasonable solution.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting
  2023-03-17 17:42         ` Edgecombe, Rick P
@ 2023-03-17 19:26           ` Deepak Gupta
  0 siblings, 0 replies; 159+ messages in thread
From: Deepak Gupta @ 2023-03-17 19:26 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hansen, Dave, david, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Yu, Yu-cheng, dave.hansen,
	kirill.shutemov, Eranian, Stephane, linux-mm, fweimer,
	nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg, hjl.tools,
	pavel, Lutomirski, Andy, linux-doc, arnd, tglx, Schimpe,
	Christina, mike.kravetz, x86, Yang, Weijiang, jamorris,
	john.allen, rppt, andrew.cooper3, mingo, corbet, linux-kernel,
	linux-api, gorcunov, akpm

On Fri, Mar 17, 2023 at 10:42 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2023-03-17 at 10:28 -0700, Deepak Gupta wrote:
> > On Fri, Mar 17, 2023 at 10:16 AM Dave Hansen <dave.hansen@intel.com>
> > wrote:
> > >
> > > On 3/17/23 10:12, Deepak Gupta wrote:
> > > > >   /*
> > > > > - * Stack area - automatically grows in one direction
> > > > > + * Stack area
> > > > >    *
> > > > > - * VM_GROWSUP / VM_GROWSDOWN VMAs are always private
> > > > > anonymous:
> > > > > - * do_mmap() forbids all other combinations.
> > > > > + * VM_GROWSUP, VM_GROWSDOWN VMAs are always private
> > > > > + * anonymous. do_mmap() forbids all other combinations.
> > > > >    */
> > > > >   static inline bool is_stack_mapping(vm_flags_t flags)
> > > > >   {
> > > > > -       return (flags & VM_STACK) == VM_STACK;
> > > > > +       return ((flags & VM_STACK) == VM_STACK) || (flags &
> > > > > VM_SHADOW_STACK);
> > > >
> > > > Same comment here. `VM_SHADOW_STACK` is an x86 specific way of
> > > > encoding a shadow stack.
> > > > Instead let's have a proxy here which allows architectures to
> > > > have
> > > > their own encodings to represent a shadow stack.
> > >
> > > This doesn't _preclude_ another architecture from coming along and
> > > doing
> > > that, right?  I'd just prefer that shadow stack architecture #2
> > > comes
> > > along and refactors this in precisely the way _they_ need it.
> >
> > There are two issues here
> >  - Encoding of shadow stack: Another arch can choose different
> > encoding.
> >    And yes, another architecture can come in and re-factor it. But so
> > much thought and work has been given to x86 implementation to keep
> >    shadow stack to not impact arch agnostic parts of the kernel. So
> > why creep it in here.
> >
> > - VM_SHADOW_STACK is coming out of the VM_HIGH_ARCH_XX bit position
> > which makes it arch specific.
> >
> >
>
> VM_SHADOW_STACK is defined like this (trimmed for clarity):
> #ifdef CONFIG_X86_USER_SHADOW_STACK
> # define VM_SHADOW_STACK        VM_HIGH_ARCH_5
> #else
> # define VM_SHADOW_STACK        VM_NONE
> #endif

Ok.

>
> Also, we actually had an is_shadow_stack_mapping(vma) in the past, but
> it was dropped from other feedback.

looks like I've been late to the party.
IMHO, that was the right approach.

>
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall
  2023-03-16 19:30         ` Deepak Gupta
@ 2023-03-20 11:35           ` Szabolcs Nagy
  0 siblings, 0 replies; 159+ messages in thread
From: Szabolcs Nagy @ 2023-03-20 11:35 UTC (permalink / raw)
  To: Deepak Gupta, Mike Rapoport
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, nd, al.grant

The 03/16/2023 12:30, Deepak Gupta wrote:
> On Tue, Mar 14, 2023 at 12:19 AM Mike Rapoport <rppt@kernel.org> wrote:
> > As for the userspace convenience, it is anyway required to add special
> > code for creating the shadow stack and it wouldn't matter if that code
> > would use mmap(NEW_FLAG) or map_shadow_stack().
> 
> Yes *strictly* from userspace convenience, it doesn't matter which option.

everybody seems to assume that the new syscall only matters for
the code allocating the shadow stack.

there are tools like strace, seccomp,.. that need to learn
about the new syscall and anything that's built on top of them
as well as libc api interposers like address sanitizer need to
learn about the related new libc apis (if there are any.. which
will be another long debate on the userspace side, delaying the
usability of shadow stacks even more). such tools already know
about mmap and often can handle new flags without much change.

i agree that too much special logic in mmap is not ideal and
using an mmap flag limits future extensions of both mmap and
shadow map functionality. but i disagree that a new syscall is
generally easy for userspace to deal with. in this case the
cost seems acceptable to me, but it's not free at all.


^ permalink raw reply	[flat|nested] 159+ messages in thread

end of thread, other threads:[~2023-03-20 11:36 UTC | newest]

Thread overview: 159+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27 22:29 [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description Rick Edgecombe
2023-03-01 14:21   ` Szabolcs Nagy
2023-03-01 14:38     ` Szabolcs Nagy
2023-03-01 18:07     ` Edgecombe, Rick P
2023-03-01 18:32       ` Edgecombe, Rick P
2023-03-02 16:34         ` szabolcs.nagy
2023-03-03 22:35           ` Edgecombe, Rick P
2023-03-06 16:20             ` szabolcs.nagy
2023-03-06 16:31               ` Florian Weimer
2023-03-06 18:08                 ` Edgecombe, Rick P
2023-03-07 13:03                   ` szabolcs.nagy
2023-03-07 14:00                     ` Florian Weimer
2023-03-07 16:14                       ` Szabolcs Nagy
2023-03-06 18:05               ` Edgecombe, Rick P
2023-03-06 20:31                 ` Liang, Kan
2023-03-02 16:14       ` szabolcs.nagy
2023-03-02 21:17         ` Edgecombe, Rick P
2023-03-03 16:30           ` szabolcs.nagy
2023-03-03 16:57             ` H.J. Lu
2023-03-03 17:39               ` szabolcs.nagy
2023-03-03 17:50                 ` H.J. Lu
2023-03-03 17:41             ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 02/41] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 03/41] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 04/41] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 05/41] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 06/41] x86/fpu: Add helper for modifying xstate Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 07/41] x86: Move control protection handler to separate file Rick Edgecombe
2023-03-01 15:38   ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler Rick Edgecombe
2023-03-01 18:06   ` Borislav Petkov
2023-03-01 18:14     ` Edgecombe, Rick P
2023-03-01 18:37       ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 09/41] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 10/41] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 11/41] mm: Introduce pte_mkwrite_kernel() Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 12/41] s390/mm: Introduce pmd_mkwrite_kernel() Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 13/41] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
2023-03-01  7:03   ` Christophe Leroy
2023-03-01  8:16     ` David Hildenbrand
2023-03-02 12:19   ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
2023-03-02 12:48   ` Borislav Petkov
2023-03-02 17:01     ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 15/41] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 16/41] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 17/41] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 19/41] x86/mm: Check shadow stack page fault errors Rick Edgecombe
2023-03-03 14:00   ` Borislav Petkov
2023-03-03 14:39     ` Dave Hansen
2023-02-27 22:29 ` [PATCH v7 20/41] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
2023-03-03 15:37   ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 21/41] mm: Add guard pages around a shadow stack Rick Edgecombe
2023-03-06  8:08   ` Borislav Petkov
2023-03-07  1:29     ` Edgecombe, Rick P
2023-03-07 10:32       ` Borislav Petkov
2023-03-07 10:44         ` David Hildenbrand
2023-03-08 22:48           ` Edgecombe, Rick P
2023-03-17 17:09   ` Deepak Gupta
2023-02-27 22:29 ` [PATCH v7 22/41] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
2023-03-06 13:01   ` Borislav Petkov
2023-03-06 18:11     ` Edgecombe, Rick P
2023-03-06 18:16       ` Borislav Petkov
2023-03-07 10:42   ` David Hildenbrand
2023-03-17 17:12   ` Deepak Gupta
2023-03-17 17:16     ` Dave Hansen
2023-03-17 17:28       ` Deepak Gupta
2023-03-17 17:42         ` Edgecombe, Rick P
2023-03-17 19:26           ` Deepak Gupta
2023-02-27 22:29 ` [PATCH v7 23/41] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 24/41] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
2023-03-06 13:10   ` Borislav Petkov
2023-03-06 18:15     ` Andy Lutomirski
2023-03-06 18:33       ` Edgecombe, Rick P
2023-03-06 18:57         ` Andy Lutomirski
2023-03-07  1:47           ` Edgecombe, Rick P
2023-03-17 17:05   ` Deepak Gupta
2023-02-27 22:29 ` [PATCH v7 25/41] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
2023-03-06 18:09   ` Borislav Petkov
2023-03-07  1:10     ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 26/41] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
2023-03-08  8:53   ` Borislav Petkov
2023-03-08 23:36     ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
2023-02-27 22:54   ` Kees Cook
2023-03-08  9:23   ` Borislav Petkov
2023-03-08 23:35     ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 28/41] x86: Introduce userspace API for shadow stack Rick Edgecombe
2023-03-08 10:27   ` Borislav Petkov
2023-03-08 23:32     ` Edgecombe, Rick P
2023-03-09 12:57       ` Borislav Petkov
2023-03-09 16:56         ` Edgecombe, Rick P
2023-03-09 23:51           ` Borislav Petkov
2023-03-10  1:13             ` Edgecombe, Rick P
2023-03-10  2:03               ` H.J. Lu
2023-03-10 20:00                 ` H.J. Lu
2023-03-10 20:27                   ` Edgecombe, Rick P
2023-03-10 20:43                     ` H.J. Lu
2023-03-10 21:01                       ` Edgecombe, Rick P
2023-03-10 11:40               ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 29/41] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 30/41] x86/shstk: Handle thread shadow stack Rick Edgecombe
2023-03-02 17:34   ` Szabolcs Nagy
2023-03-02 21:48     ` Edgecombe, Rick P
2023-03-08 15:26   ` Borislav Petkov
2023-03-08 20:03     ` Edgecombe, Rick P
2023-03-09 14:12       ` Borislav Petkov
2023-03-09 16:59         ` Edgecombe, Rick P
2023-03-09 17:04           ` Borislav Petkov
2023-03-09 20:29             ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 31/41] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
2023-03-09 16:48   ` Borislav Petkov
2023-03-09 17:03     ` Edgecombe, Rick P
2023-03-09 17:22       ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 32/41] x86/shstk: Handle signals for shadow stack Rick Edgecombe
2023-03-09 17:02   ` Borislav Petkov
2023-03-09 17:16     ` Edgecombe, Rick P
2023-03-09 23:35       ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
2023-03-02 17:22   ` Szabolcs Nagy
2023-03-02 21:21     ` Edgecombe, Rick P
2023-03-09 18:55     ` Deepak Gupta
2023-03-09 19:39       ` Edgecombe, Rick P
2023-03-09 21:08         ` Deepak Gupta
2023-03-10  0:14           ` Edgecombe, Rick P
2023-03-10 21:00             ` Deepak Gupta
2023-03-10 21:43               ` Edgecombe, Rick P
2023-03-16 20:07                 ` Deepak Gupta
2023-03-14  7:19       ` Mike Rapoport
2023-03-16 19:30         ` Deepak Gupta
2023-03-20 11:35           ` Szabolcs Nagy
2023-03-10 16:11   ` Borislav Petkov
2023-03-10 17:12     ` Edgecombe, Rick P
2023-03-10 20:05       ` Borislav Petkov
2023-03-10 20:19         ` Edgecombe, Rick P
2023-03-10 20:26           ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 34/41] x86/shstk: Support WRSS for userspace Rick Edgecombe
2023-03-10 16:44   ` Borislav Petkov
2023-03-10 17:16     ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 35/41] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 36/41] x86/shstk: Wire in shadow stack interface Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 37/41] selftests/x86: Add shadow stack test Rick Edgecombe
2023-02-27 22:29 ` [PATCH v7 38/41] x86/fpu: Add helper for initing features Rick Edgecombe
2023-03-11 12:54   ` Borislav Petkov
2023-03-13  2:45     ` Edgecombe, Rick P
2023-03-13 11:03       ` Borislav Petkov
2023-03-13 16:10         ` Edgecombe, Rick P
2023-03-13 17:10           ` Borislav Petkov
2023-03-13 23:31             ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 39/41] x86: Add PTRACE interface for shadow stack Rick Edgecombe
2023-03-11 15:06   ` Borislav Petkov
2023-03-13  2:53     ` Edgecombe, Rick P
2023-02-27 22:29 ` [PATCH v7 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
2023-03-11 15:11   ` Borislav Petkov
2023-03-13  3:04     ` Edgecombe, Rick P
2023-03-13 11:05       ` Borislav Petkov
2023-02-27 22:29 ` [PATCH v7 41/41] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).