linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/42] Shadow stacks for userspace
@ 2023-06-13  0:10 Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
                   ` (43 more replies)
  0 siblings, 44 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe

Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow 
Enforcement Technology (CET). CET consists of two related security
features: shadow stacks and indirect branch tracking. This series
implements just the  shadow stack part of this feature, and just for
userspace.

The main use case for shadow stack is providing protection against return 
oriented programming attacks. It works by maintaining a secondary (shadow) 
stack using a special memory type that has protections against 
modification. When executing a CALL instruction, the processor pushes the 
return address to both the normal stack and to the special permission 
shadow stack. Upon RET, the processor pops the shadow stack copy and 
compares it to the normal stack copy. For more details, see the 
coverletter from v1 [0].

Shadow Stack was rejected by Linus for 6.4 [1][2]. This is a new version 
that addresses his concerns. In the months since the series was queued in 
tip, there were also some non-critical things that turned up, that I was 
planning do as fast follow ups. Since we are doing a re-spin, I thought to 
just include them in the initial series. (see 3 and 4) Also, the whole 
series has been re-ordered, after some comments from Linus prompted some 
reflection.

Most of the patches are the same, so I’ll specifically list the patches 
with changes to help focus review.

1. Redo of the pte_mkwrite() refactoring patches
------------------------------------------------
The point of these patches was to make pte_mkwrite() take a VMA. 
Unfortunately the original version of this refactor had a bug. Linus 
suggested an alternate way of doing the refactor that would be less error 
prone. It sounds like this will be used by the riscv and possibly arm 
shadow stack features. It would be great to collect some Reviewed-by tags 
on these from anyone else that would depend on them.

Changed (and renamed) patches:
	mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
	mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
	mm: Make pte_mkwrite() take a VMA

2. SavedDirty overhaul
----------------------
The Shadow Stack PTEs are defined by the HW as Write=0,Dirty=1. Since 
Linux usually creates PTEs like that when it write-protects dirty memory, 
the series introduced a SavedDirty software bit. When a PTE gets 
write-protected the Dirty bit shifts to the SavedDirty bit so it won't
become shadow stack. When it is made writable, the opposite happens.

In the previously queued version, the SavedDirty bit was only used when 
shadow stack was configured and available on the CPU. But this created two 
versions of the Dirty bit behavior on x86, adding complexity. Linus 
objected to this, and also the use of conditional control flow logic 
instead of bit math. After some trial and error, I ended up with something 
that tries to incorporate the feedback, but with some adjustments on the 
specifics. I would like some feedback on the maintainability tradeoffs 
taken and some scrutiny on the correctness as well.


First of all, switching to bit math for the SavedDirty dance really seems 
to be a big improvement. The conditional part is now branchless and 
isolated to two functions. Descriptions of the other changes follows, and 
a little less of a clear win to me.


One thing that came up in this discussion was the performance impact of 
the CMPXCHG loop added to ptep_set_wrprotect() in order to atomically do 
the shift of Dirty to SavedDirty when a live PTE is being write-protected. 
The concern was that this could impact the performance of fork() where it 
is used to write-protect memory in the parent MM.

Linus had suggested to optimize the single threaded fork case to offset 
the concerns of a performance impact. However, trying to stress this case, 
I was unable to concoct a microbenchmark to show any slowdown of the LOCK 
CMPXCHG loop vs the original LOCK AND. Hypothetically the CMPXCHG could 
scale worse, but since I couldn’t actually entice it to show any slowdown, 
I was thinking that the original worries might have been misplaced and we 
could get away with the unconditional CMPXCHG loop.

As Dave previously mentioned, the other wrinkle in all of this is that on 
CPUs that don’t support shadow stack, a CPU could rarely set Dirty=1 on a 
PTE with Write=0. So the kernel logic needs to be robust to PTEs that have 
Write=0,Dirty=1 PTEs in non-shadow stack cases on these CPUs. And also 
should be able to handle Write=0,Dirty=1,SavedDirty=1 PTEs. Similarly, a 
KNL (Xeon Phi platform) erratum can result in Dirty=1 bits getting set on 
Present=0 PTEs.

In order to make the core-MM logic work correctly with shadow stack, 
pte_write() needs to also return true for Write=0,Dirty=1 memory. Since 
those older platforms can cause Dirty=1,Write=0 PTEs despite the kernels 
efforts to not create any itself, the kernel can’t have this shadow stack 
pte_write() logic when running on them. So the kernel can only check for 
shadow stack memory in pte_write() when shadow stack is supported by the 
CPU. Also, the kernel needs to make sure to not trigger any warnings when 
it sees a Dirty=0,Write=1 PTE in an unexpected place. So these are also 
only done when shadow stack is supported on the CPU. These checks are 
isolated to single place in pte_shstk().

So in the end, we can shift the Dirty bit around unconditionally, but the 
kernels behavior around the Dirty bit still needs to adjust depending on 
the CPU actually supporting shadow stack.

Refactoring the SavedDirty<->Dirty setting logic into bitmath made it so 
it would be a lot easier to compile the SavedDirty bit out if needed. 
Basically it can be removed with only two checks in 
mksaveddirty_shift()/clear_saveddirty_shift() (see “x86/mm: Introduce 
_PAGE_SAVED_DIRTY”).

That all makes me wonder if it would still be better to disable SavedDirty 
when shadow stack is not supported. One aspect of the idea to make 
SavedDirty  unconditional, was to unify the sets of rules to have to 
reason about. But due to the behavior of the pre-shadow stack CPUs, this 
also comes at the increased runtime behaviors we need to worry about. The 
other point brought up was the increased testing of having SavedDirty used 
more widely. But since we have to turn off the warnings and other logic, 
this testing isn’t fully happening on these older platforms either. So I’m 
not sure if the unconditional SavedDirty is really a win or not. I thought 
it was slightly.

So in this version SavedDirty is turned on universally for x86. Even for 
32 bit, which, while seeming a bit silly, allows there to be only one 
version of the ptep_set_wrprotect() logic.

Changed patches:
	x86/mm: Start actually marking _PAGE_SAVED_DIRTY
	x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
	x86/mm: Introduce _PAGE_SAVED_DIRTY
	mm: Warn on shadow stack memory in wrong vma
	x86/mm: Warn if create Write=0,Dirty=1 with raw prot

3. Shadow stack protections enhancements
----------------------------------------
The shadow stack signal frame format uses a special shadow stack frame 
pattern that should not occur naturally in order to avoid forgery on 
sigreturn. Two patches are added to strengthen the forgery checks. These 
could have been squashed into the signal patch, but I thought leaving them 
separate might make review easier for those familiar with the last series. 
I was waffling on whether to postpone them to minimize changes from the 
previously queued changed to v9. In the end, as long as the series was 
already getting re-spun, I thought the extra protections were worth 
starting with.

The new mmap maple tree code needs to be taught specifically about 
VM_SHADOW_STACK, instead of just relying on vm_start_gap() like the old RB 
stuff, so that is added as well to retain the start guard gap with maple
tree.

Added/changes patches:
	x86/shstk: Check that SSP is aligned on sigreturn
	x86/shstk: Check that signal frame is shadow stack mem
	mm: Add guard pages around a shadow stack

4. Selftest enhancements
------------------------
A few miscellaneous selftest enhancements that accumulated since the old 
series landed in tip. Added a test for the shadow stack guard gap and the 
shadow stack ptrace interface. Also fixed a race that caused the uffd test 
to sometimes hang.

Changed patches:
	selftests/x86: Add shadow stack test

Since some of the changes were extensive in the modified patches, I 
dropped some review tags. But I left testing tags, testers please retest.

Thanks,

Rick

[0] https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/CAHk-=wiuVXTfgapmjYQvrEDzn3naF2oYnHuky+feEJSj_G_yFQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAHk-=wiB0wy6oXOsPtYU4DSbqJAY8z5iNBKdjdOp2LP23khUoA@mail.gmail.com/

Mike Rapoport (1):
  x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (38):
  mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  mm: Make pte_mkwrite() take a VMA
  x86/shstk: Add Kconfig option for shadow stack
  x86/traps: Move control protection handler to separate file
  x86/cpufeatures: Add CPU feature flags for shadow stacks
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Introduce _PAGE_SAVED_DIRTY
  x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
  x86/mm: Start actually marking _PAGE_SAVED_DIRTY
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Check shadow stack page fault errors
  mm: Add guard pages around a shadow stack.
  mm: Warn on shadow stack memory in wrong vma
  x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  mm/mmap: Add shadow stack pages to memory accounting
  x86/mm: Introduce MAP_ABOVE4G
  x86/mm: Teach pte_mkwrite() about stack memory
  mm: Don't allow write GUPs to shadow stack memory
  Documentation/x86: Add CET shadow stack description
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/fpu: Add helper for modifying xstate
  x86: Introduce userspace API for shadow stack
  x86/shstk: Add user control-protection fault handler
  x86/shstk: Add user-mode shadow stack support
  x86/shstk: Handle thread shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle signals for shadow stack
  x86/shstk: Check that SSP is aligned on sigreturn
  x86/shstk: Check that signal frame is shadow stack mem
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Support WRSS for userspace
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Wire in shadow stack interface
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  selftests/x86: Add shadow stack test
  x86: Add PTRACE interface for shadow stack
  x86/shstk: Add ARCH_SHSTK_STATUS

Yu-cheng Yu (3):
  mm: Re-introduce vm_flags to do_mmap()
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory

 Documentation/arch/x86/index.rst              |   1 +
 Documentation/arch/x86/shstk.rst              | 179 ++++
 Documentation/filesystems/proc.rst            |   1 +
 Documentation/mm/arch_pgtable_helpers.rst     |  12 +-
 arch/Kconfig                                  |   3 +
 arch/alpha/include/asm/pgtable.h              |   2 +-
 arch/arc/include/asm/hugepage.h               |   2 +-
 arch/arc/include/asm/pgtable-bits-arcv2.h     |   2 +-
 arch/arm/include/asm/pgtable-3level.h         |   2 +-
 arch/arm/include/asm/pgtable.h                |   2 +-
 arch/arm/kernel/signal.c                      |   2 +-
 arch/arm64/include/asm/pgtable.h              |   4 +-
 arch/arm64/kernel/signal.c                    |   2 +-
 arch/arm64/kernel/signal32.c                  |   2 +-
 arch/arm64/mm/trans_pgd.c                     |   4 +-
 arch/csky/include/asm/pgtable.h               |   2 +-
 arch/hexagon/include/asm/pgtable.h            |   2 +-
 arch/ia64/include/asm/pgtable.h               |   2 +-
 arch/loongarch/include/asm/pgtable.h          |   4 +-
 arch/m68k/include/asm/mcf_pgtable.h           |   2 +-
 arch/m68k/include/asm/motorola_pgtable.h      |   2 +-
 arch/m68k/include/asm/sun3_pgtable.h          |   2 +-
 arch/microblaze/include/asm/pgtable.h         |   2 +-
 arch/mips/include/asm/pgtable.h               |   6 +-
 arch/nios2/include/asm/pgtable.h              |   2 +-
 arch/openrisc/include/asm/pgtable.h           |   2 +-
 arch/parisc/include/asm/pgtable.h             |   2 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h  |   2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   4 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |   4 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  |   4 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h  |   2 +-
 arch/riscv/include/asm/pgtable.h              |   6 +-
 arch/s390/include/asm/hugetlb.h               |   2 +-
 arch/s390/include/asm/pgtable.h               |   4 +-
 arch/s390/mm/pageattr.c                       |   4 +-
 arch/sh/include/asm/pgtable_32.h              |   4 +-
 arch/sparc/include/asm/pgtable_32.h           |   2 +-
 arch/sparc/include/asm/pgtable_64.h           |   6 +-
 arch/sparc/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal_64.c                 |   2 +-
 arch/um/include/asm/pgtable.h                 |   2 +-
 arch/x86/Kconfig                              |  24 +
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |  16 +-
 arch/x86/include/asm/fpu/api.h                |   9 +
 arch/x86/include/asm/fpu/regset.h             |   7 +-
 arch/x86/include/asm/fpu/sched.h              |   3 +-
 arch/x86/include/asm/fpu/types.h              |  16 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   2 +-
 arch/x86/include/asm/mmu_context.h            |   2 +
 arch/x86/include/asm/pgtable.h                | 302 +++++-
 arch/x86/include/asm/pgtable_types.h          |  46 +-
 arch/x86/include/asm/processor.h              |   8 +
 arch/x86/include/asm/shstk.h                  |  38 +
 arch/x86/include/asm/special_insns.h          |  13 +
 arch/x86/include/asm/tlbflush.h               |   3 +-
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/asm/traps.h                  |  12 +
 arch/x86/include/uapi/asm/mman.h              |   4 +
 arch/x86/include/uapi/asm/prctl.h             |  12 +
 arch/x86/kernel/Makefile                      |   4 +
 arch/x86/kernel/cet.c                         | 152 +++
 arch/x86/kernel/cpu/common.c                  |  35 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/cpu/proc.c                    |  23 +
 arch/x86/kernel/fpu/core.c                    |  54 +-
 arch/x86/kernel/fpu/regset.c                  |  81 ++
 arch/x86/kernel/fpu/xstate.c                  |  90 +-
 arch/x86/kernel/idt.c                         |   2 +-
 arch/x86/kernel/process.c                     |  21 +-
 arch/x86/kernel/process_64.c                  |   8 +
 arch/x86/kernel/ptrace.c                      |  12 +
 arch/x86/kernel/shstk.c                       | 529 +++++++++++
 arch/x86/kernel/signal.c                      |   1 +
 arch/x86/kernel/signal_32.c                   |   2 +-
 arch/x86/kernel/signal_64.c                   |   8 +-
 arch/x86/kernel/sys_x86_64.c                  |   6 +-
 arch/x86/kernel/traps.c                       |  87 --
 arch/x86/mm/fault.c                           |  22 +
 arch/x86/mm/pat/set_memory.c                  |   4 +-
 arch/x86/mm/pgtable.c                         |  40 +
 arch/x86/xen/enlighten_pv.c                   |   2 +-
 arch/x86/xen/mmu_pv.c                         |   2 +-
 arch/x86/xen/xen-asm.S                        |   2 +-
 arch/xtensa/include/asm/pgtable.h             |   2 +-
 fs/aio.c                                      |   2 +-
 fs/proc/array.c                               |   6 +
 fs/proc/task_mmu.c                            |   3 +
 include/asm-generic/hugetlb.h                 |   2 +-
 include/linux/mm.h                            |  67 +-
 include/linux/mman.h                          |   4 +
 include/linux/pgtable.h                       |  28 +
 include/linux/proc_fs.h                       |   2 +
 include/linux/syscalls.h                      |   1 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/asm-generic/unistd.h             |   2 +-
 include/uapi/linux/elf.h                      |   2 +
 ipc/shm.c                                     |   2 +-
 kernel/sys_ni.c                               |   1 +
 mm/debug_vm_pgtable.c                         |  12 +-
 mm/gup.c                                      |   2 +-
 mm/huge_memory.c                              |  11 +-
 mm/internal.h                                 |   4 +-
 mm/memory.c                                   |   5 +-
 mm/migrate.c                                  |   2 +-
 mm/migrate_device.c                           |   2 +-
 mm/mmap.c                                     |  14 +-
 mm/mprotect.c                                 |   2 +-
 mm/nommu.c                                    |   4 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |   2 +-
 tools/testing/selftests/x86/Makefile          |   2 +-
 .../testing/selftests/x86/test_shadow_stack.c | 884 ++++++++++++++++++
 117 files changed, 2789 insertions(+), 307 deletions(-)
 create mode 100644 Documentation/arch/x86/shstk.rst
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  7:19   ` Geert Uytterhoeven
                     ` (4 more replies)
  2023-06-13  0:10 ` [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma() Rick Edgecombe
                   ` (42 subsequent siblings)
  43 siblings, 5 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, linux-ia64, loongarch, linux-m68k,
	Michal Simek, Dinh Nguyen, linux-mips, openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, Linus Torvalds

The x86 Shadow stack feature includes a new type of memory called shadow
stack. This shadow stack memory has some unusual properties, which requires
some core mm changes to function properly.

One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
that the x86 implementation of it can know whether to create regular
writable memory or shadow stack memory.

But there are a couple of challenges to this. Modifying the signatures of
each arch pte_mkwrite() implementation would be error prone because some
are generated with macros and would need to be re-implemented. Also, some
pte_mkwrite() callers operate on kernel memory without a VMA.

So this can be done in a three step process. First pte_mkwrite() can be
renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
added that just calls pte_mkwrite_novma(). Next callers without a VMA can
be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
can be changed to take/pass a VMA.

Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and
adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(),
create a new arch config HAS_HUGE_PAGE that can be used to tell if
pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the
compiler would not be able to find pmd_mkwrite_novma().

No functional change.

Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-csky@vger.kernel.org
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linux-m68k@lists.linux-m68k.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code. Later Linus suggested a less error-prone way[1] to go
about this after the first attempt had a bug.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push arch memory details inside arch/x86 and other arch's
with upcoming shadow stack features.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/
[1] https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
---
 Documentation/mm/arch_pgtable_helpers.rst    |  6 ++++++
 arch/Kconfig                                 |  3 +++
 arch/alpha/include/asm/pgtable.h             |  2 +-
 arch/arc/include/asm/hugepage.h              |  2 +-
 arch/arc/include/asm/pgtable-bits-arcv2.h    |  2 +-
 arch/arm/include/asm/pgtable-3level.h        |  2 +-
 arch/arm/include/asm/pgtable.h               |  2 +-
 arch/arm64/include/asm/pgtable.h             |  4 ++--
 arch/csky/include/asm/pgtable.h              |  2 +-
 arch/hexagon/include/asm/pgtable.h           |  2 +-
 arch/ia64/include/asm/pgtable.h              |  2 +-
 arch/loongarch/include/asm/pgtable.h         |  4 ++--
 arch/m68k/include/asm/mcf_pgtable.h          |  2 +-
 arch/m68k/include/asm/motorola_pgtable.h     |  2 +-
 arch/m68k/include/asm/sun3_pgtable.h         |  2 +-
 arch/microblaze/include/asm/pgtable.h        |  2 +-
 arch/mips/include/asm/pgtable.h              |  6 +++---
 arch/nios2/include/asm/pgtable.h             |  2 +-
 arch/openrisc/include/asm/pgtable.h          |  2 +-
 arch/parisc/include/asm/pgtable.h            |  2 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h |  2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++--
 arch/powerpc/include/asm/nohash/32/pgtable.h |  4 ++--
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |  4 ++--
 arch/powerpc/include/asm/nohash/64/pgtable.h |  2 +-
 arch/riscv/include/asm/pgtable.h             |  6 +++---
 arch/s390/include/asm/hugetlb.h              |  2 +-
 arch/s390/include/asm/pgtable.h              |  4 ++--
 arch/sh/include/asm/pgtable_32.h             |  4 ++--
 arch/sparc/include/asm/pgtable_32.h          |  2 +-
 arch/sparc/include/asm/pgtable_64.h          |  6 +++---
 arch/um/include/asm/pgtable.h                |  2 +-
 arch/x86/include/asm/pgtable.h               |  4 ++--
 arch/xtensa/include/asm/pgtable.h            |  2 +-
 include/asm-generic/hugetlb.h                |  2 +-
 include/linux/pgtable.h                      | 14 ++++++++++++++
 36 files changed, 70 insertions(+), 47 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index af3891f895b0..69ce1f2aa4d1 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -48,6 +48,9 @@ PTE Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pte_mkwrite               | Creates a writable PTE                           |
 +---------------------------+--------------------------------------------------+
+| pte_mkwrite_novma         | Creates a writable PTE, of the conventional type |
+|                           | of writable.                                     |
++---------------------------+--------------------------------------------------+
 | pte_wrprotect             | Creates a write protected PTE                    |
 +---------------------------+--------------------------------------------------+
 | pte_mkspecial             | Creates a special PTE                            |
@@ -120,6 +123,9 @@ PMD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pmd_mkwrite               | Creates a writable PMD                           |
 +---------------------------+--------------------------------------------------+
+| pmd_mkwrite_novma         | Creates a writable PMD, of the conventional type |
+|                           | of writable.                                     |
++---------------------------+--------------------------------------------------+
 | pmd_wrprotect             | Creates a write protected PMD                    |
 +---------------------------+--------------------------------------------------+
 | pmd_mkspecial             | Creates a special PMD                            |
diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..3bc11c9a2ac1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -919,6 +919,9 @@ config HAVE_ARCH_HUGE_VMALLOC
 config ARCH_WANT_HUGE_PMD_SHARE
 	bool
 
+config HAS_HUGE_PAGE
+	def_bool HAVE_ARCH_HUGE_VMAP || TRANSPARENT_HUGEPAGE || HUGETLBFS
+
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index ba43cb841d19..af1a13ab3320 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -256,7 +256,7 @@ extern inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED;
 extern inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) |= _PAGE_FOW; return pte; }
 extern inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~(__DIRTY_BITS); return pte; }
 extern inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~(__ACCESS_BITS); return pte; }
-extern inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) &= ~_PAGE_FOW; return pte; }
+extern inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) &= ~_PAGE_FOW; return pte; }
 extern inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= __DIRTY_BITS; return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= __ACCESS_BITS; return pte; }
 
diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 5001b796fb8d..ef8d4166370c 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -21,7 +21,7 @@ static inline pmd_t pte_pmd(pte_t pte)
 }
 
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite_novma(pmd)	pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/asm/pgtable-bits-arcv2.h
index 6e9f8ca6d6a1..5c073d9f41c2 100644
--- a/arch/arc/include/asm/pgtable-bits-arcv2.h
+++ b/arch/arc/include/asm/pgtable-bits-arcv2.h
@@ -87,7 +87,7 @@
 
 PTE_BIT_FUNC(mknotpresent,     &= ~(_PAGE_PRESENT));
 PTE_BIT_FUNC(wrprotect,	&= ~(_PAGE_WRITE));
-PTE_BIT_FUNC(mkwrite,	|= (_PAGE_WRITE));
+PTE_BIT_FUNC(mkwrite_novma,	|= (_PAGE_WRITE));
 PTE_BIT_FUNC(mkclean,	&= ~(_PAGE_DIRTY));
 PTE_BIT_FUNC(mkdirty,	|= (_PAGE_DIRTY));
 PTE_BIT_FUNC(mkold,	&= ~(_PAGE_ACCESSED));
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 106049791500..71c3add6417f 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -202,7 +202,7 @@ static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
 
 PMD_BIT_FUNC(wrprotect,	|= L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
-PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
+PMD_BIT_FUNC(mkwrite_novma,   &= ~L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkclean,   &= ~L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a58ccbb406ad..f37ba2472eae 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -227,7 +227,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 	return set_pte_bit(pte, __pgprot(L_PTE_RDONLY));
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return clear_pte_bit(pte, __pgprot(L_PTE_RDONLY));
 }
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0bd18de9fd97..7a3d62cb9bee 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -180,7 +180,7 @@ static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
 	return pmd;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
 	pte = clear_pte_bit(pte, __pgprot(PTE_RDONLY));
@@ -487,7 +487,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #define pmd_cont(pmd)		pte_cont(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite_novma(pmd)	pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)))
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index d4042495febc..aa0cce4fc02f 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -176,7 +176,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	if (pte_val(pte) & _PAGE_MODIFIED)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index 59393613d086..fc2d2d83368d 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -300,7 +300,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 }
 
 /* pte_mkwrite - mark page as writable */
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	return pte;
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 21c97e31a28a..f80aba7cad99 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -268,7 +268,7 @@ ia64_phys_addr_valid (unsigned long addr)
  * access rights:
  */
 #define pte_wrprotect(pte)	(__pte(pte_val(pte) & ~_PAGE_AR_RW))
-#define pte_mkwrite(pte)	(__pte(pte_val(pte) | _PAGE_AR_RW))
+#define pte_mkwrite_novma(pte)	(__pte(pte_val(pte) | _PAGE_AR_RW))
 #define pte_mkold(pte)		(__pte(pte_val(pte) & ~_PAGE_A))
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index d28fb9dbec59..8245cf367b31 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -390,7 +390,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	if (pte_val(pte) & _PAGE_MODIFIED)
@@ -490,7 +490,7 @@ static inline int pmd_write(pmd_t pmd)
 	return !!(pmd_val(pmd) & _PAGE_WRITE);
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
 	pmd_val(pmd) |= _PAGE_WRITE;
 	if (pmd_val(pmd) & _PAGE_MODIFIED)
diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h
index d97fbb812f63..42ebea0488e3 100644
--- a/arch/m68k/include/asm/mcf_pgtable.h
+++ b/arch/m68k/include/asm/mcf_pgtable.h
@@ -211,7 +211,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= CF_PAGE_WRITABLE;
 	return pte;
diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h
index ec0dc19ab834..ba28ca4d219a 100644
--- a/arch/m68k/include/asm/motorola_pgtable.h
+++ b/arch/m68k/include/asm/motorola_pgtable.h
@@ -155,7 +155,7 @@ static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED;
 static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) |= _PAGE_RONLY; return pte; }
 static inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~_PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) &= ~_PAGE_RONLY; return pte; }
+static inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) &= ~_PAGE_RONLY; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mknocache(pte_t pte)
diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h
index e582b0484a55..4114eaff7404 100644
--- a/arch/m68k/include/asm/sun3_pgtable.h
+++ b/arch/m68k/include/asm/sun3_pgtable.h
@@ -143,7 +143,7 @@ static inline int pte_young(pte_t pte)		{ return pte_val(pte) & SUN3_PAGE_ACCESS
 static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; }
 static inline pte_t pte_mkclean(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; }
 static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~SUN3_PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; }
+static inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_MODIFIED; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mknocache(pte_t pte)	{ pte_val(pte) |= SUN3_PAGE_NOCACHE; return pte; }
diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h
index d1b8272abcd9..9108b33a7886 100644
--- a/arch/microblaze/include/asm/pgtable.h
+++ b/arch/microblaze/include/asm/pgtable.h
@@ -266,7 +266,7 @@ static inline pte_t pte_mkread(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_USER; return pte; }
 static inline pte_t pte_mkexec(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_USER | _PAGE_EXEC; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte) \
+static inline pte_t pte_mkwrite_novma(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte) \
 	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 574fa14ac8b2..40a54fd6e48d 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -309,7 +309,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte.pte_low |= _PAGE_WRITE;
 	if (pte.pte_low & _PAGE_MODIFIED) {
@@ -364,7 +364,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	if (pte_val(pte) & _PAGE_MODIFIED)
@@ -627,7 +627,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 	return pmd;
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
 	pmd_val(pmd) |= _PAGE_WRITE;
 	if (pmd_val(pmd) & _PAGE_MODIFIED)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 0f5c2564e9f5..cf1ffbc1a121 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -129,7 +129,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	return pte;
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 3eb9b9555d0d..828820c74fc5 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -250,7 +250,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte_val(pte) |= _PAGE_WRITE;
 	return pte;
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index e715df5385d6..79d1cef2fd7c 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -331,7 +331,7 @@ static inline pte_t pte_mkold(pte_t pte)	{ pte_val(pte) &= ~_PAGE_ACCESSED; retu
 static inline pte_t pte_wrprotect(pte_t pte)	{ pte_val(pte) &= ~_PAGE_WRITE; return pte; }
 static inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)	{ pte_val(pte) |= _PAGE_WRITE; return pte; }
+static inline pte_t pte_mkwrite_novma(pte_t pte)	{ pte_val(pte) |= _PAGE_WRITE; return pte; }
 static inline pte_t pte_mkspecial(pte_t pte)	{ pte_val(pte) |= _PAGE_SPECIAL; return pte; }
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 7bf1fe7297c6..67dfb674a4c1 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -498,7 +498,7 @@ static inline pte_t pte_mkpte(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_RW);
 }
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..0328d917494a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -600,7 +600,7 @@ static inline pte_t pte_mkexec(pte_t pte)
 	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC));
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	/*
 	 * write implies read, hence set both
@@ -1071,7 +1071,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite_novma(pmd)	pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)))
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index fec56d965f00..33213b31fcbb 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -170,8 +170,8 @@ void unmap_kernel_page(unsigned long va);
 #define pte_clear(mm, addr, ptep) \
 	do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0)
 
-#ifndef pte_mkwrite
-static inline pte_t pte_mkwrite(pte_t pte)
+#ifndef pte_mkwrite_novma
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_RW);
 }
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 1a89ebdc3acc..21f681ee535a 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -101,12 +101,12 @@ static inline int pte_write(pte_t pte)
 
 #define pte_write pte_write
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return __pte(pte_val(pte) & ~_PAGE_RO);
 }
 
-#define pte_mkwrite pte_mkwrite
+#define pte_mkwrite_novma pte_mkwrite_novma
 
 static inline bool pte_user(pte_t pte)
 {
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 287e25864ffa..abe4fd82721e 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -85,7 +85,7 @@
 #ifndef __ASSEMBLY__
 /* pte_clear moved to later in this file */
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_RW);
 }
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 2258b27173b0..b38faec98154 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -379,7 +379,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 /* static inline pte_t pte_mkread(pte_t pte) */
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_WRITE);
 }
@@ -665,9 +665,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
-	return pte_pmd(pte_mkwrite(pmd_pte(pmd)));
+	return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index ccdbccfde148..f07267875a19 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -104,7 +104,7 @@ static inline int huge_pte_dirty(pte_t pte)
 
 static inline pte_t huge_pte_mkwrite(pte_t pte)
 {
-	return pte_mkwrite(pte);
+	return pte_mkwrite_novma(pte);
 }
 
 static inline pte_t huge_pte_mkdirty(pte_t pte)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 6822a11c2c8a..699406036f30 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1005,7 +1005,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
 	return set_pte_bit(pte, __pgprot(_PAGE_PROTECT));
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	pte = set_pte_bit(pte, __pgprot(_PAGE_WRITE));
 	if (pte_val(pte) & _PAGE_DIRTY)
@@ -1488,7 +1488,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 	return set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_PROTECT));
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
 	pmd = set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_WRITE));
 	if (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY)
diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index 21952b094650..165b4fd08152 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -359,11 +359,11 @@ static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
  * kernel permissions), we attempt to couple them a bit more sanely here.
  */
 PTE_BIT_FUNC(high, wrprotect, &= ~(_PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE));
-PTE_BIT_FUNC(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
+PTE_BIT_FUNC(high, mkwrite_novma, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
 PTE_BIT_FUNC(high, mkhuge, |= _PAGE_SZHUGE);
 #else
 PTE_BIT_FUNC(low, wrprotect, &= ~_PAGE_RW);
-PTE_BIT_FUNC(low, mkwrite, |= _PAGE_RW);
+PTE_BIT_FUNC(low, mkwrite_novma, |= _PAGE_RW);
 PTE_BIT_FUNC(low, mkhuge, |= _PAGE_SZHUGE);
 #endif
 
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index d4330e3c57a6..a2d909446539 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -241,7 +241,7 @@ static inline pte_t pte_mkold(pte_t pte)
 	return __pte(pte_val(pte) & ~SRMMU_REF);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return __pte(pte_val(pte) | SRMMU_WRITE);
 }
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 5563efa1a19f..4dd4f6cdc670 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -517,7 +517,7 @@ static inline pte_t pte_mkclean(pte_t pte)
 	return __pte(val);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	unsigned long val = pte_val(pte), mask;
 
@@ -772,11 +772,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
 
-	pte = pte_mkwrite(pte);
+	pte = pte_mkwrite_novma(pte);
 
 	return __pmd(pte_val(pte));
 }
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index a70d1618eb35..46f59a8bc812 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -207,7 +207,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
 	return(pte);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	if (unlikely(pte_get_bits(pte,  _PAGE_RW)))
 		return pte;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 15ae4d6ba476..112e6060eafa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -352,7 +352,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
 	return pte_set_flags(pte, _PAGE_ACCESSED);
 }
 
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_RW);
 }
@@ -453,7 +453,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_ACCESSED);
 }
 
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index fc7a14884c6c..27e3ae38a5de 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -262,7 +262,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)
 	{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_novma(pte_t pte)
 	{ pte_val(pte) |= _PAGE_WRITABLE; return pte; }
 
 #define pgprot_noncached(prot) \
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index d7f6335d3999..4da02798a00b 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -22,7 +22,7 @@ static inline unsigned long huge_pte_dirty(pte_t pte)
 
 static inline pte_t huge_pte_mkwrite(pte_t pte)
 {
-	return pte_mkwrite(pte);
+	return pte_mkwrite_novma(pte);
 }
 
 #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..ae271a307584 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -507,6 +507,20 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma,
 			      pud_t *pudp);
 #endif
 
+#ifndef pte_mkwrite
+static inline pte_t pte_mkwrite(pte_t pte)
+{
+	return pte_mkwrite_novma(pte);
+}
+#endif
+
+#if defined(CONFIG_HAS_HUGE_PAGE) && !defined(pmd_mkwrite)
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_mkwrite_novma(pmd);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  7:44   ` Mike Rapoport
  2023-06-13 12:27   ` David Hildenbrand
  2023-06-13  0:10 ` [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
                   ` (41 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, linux-arm-kernel, linux-s390, xen-devel

The x86 Shadow stack feature includes a new type of memory called shadow
stack. This shadow stack memory has some unusual properties, which requires
some core mm changes to function properly.

One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
that the x86 implementation of it can know whether to create regular
writable memory or shadow stack memory.

But there are a couple of challenges to this. Modifying the signatures of
each arch pte_mkwrite() implementation would be error prone because some
are generated with macros and would need to be re-implemented. Also, some
pte_mkwrite() callers operate on kernel memory without a VMA.

So this can be done in a three step process. First pte_mkwrite() can be
renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
added that just calls pte_mkwrite_novma(). Next callers without a VMA can
be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
can be changed to take/pass a VMA.

Previous patches have done the first step, so next move the callers that
don't have a VMA to pte_mkwrite_novma(). Also do the same for
pmd_mkwrite(). This will be ok for the shadow stack feature, as these
callers are on kernel memory which will not need to be made shadow stack,
and the other architectures only currently support one type of memory
in pte_mkwrite()

Cc: linux-doc@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code. Later Linus suggested a less error-prone way[1] to go
about this after the first attempt had a bug.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push arch memory details inside arch/x86 and other arch's
with upcoming shadow stack features.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/
[1] https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
---
 arch/arm64/mm/trans_pgd.c | 4 ++--
 arch/s390/mm/pageattr.c   | 4 ++--
 arch/x86/xen/mmu_pv.c     | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 4ea2eefbc053..a01493f3a06f 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -40,7 +40,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite(pte));
+		set_pte(dst_ptep, pte_mkwrite_novma(pte));
 	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
 		/*
 		 * debug_pagealloc will removed the PTE_VALID bit if
@@ -53,7 +53,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
+		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
 	}
 }
 
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 5ba3bd8a7b12..6931d484d8a7 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -97,7 +97,7 @@ static int walk_pte_level(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		if (flags & SET_MEMORY_RO)
 			new = pte_wrprotect(new);
 		else if (flags & SET_MEMORY_RW)
-			new = pte_mkwrite(pte_mkdirty(new));
+			new = pte_mkwrite_novma(pte_mkdirty(new));
 		if (flags & SET_MEMORY_NX)
 			new = set_pte_bit(new, __pgprot(_PAGE_NOEXEC));
 		else if (flags & SET_MEMORY_X)
@@ -155,7 +155,7 @@ static void modify_pmd_page(pmd_t *pmdp, unsigned long addr,
 	if (flags & SET_MEMORY_RO)
 		new = pmd_wrprotect(new);
 	else if (flags & SET_MEMORY_RW)
-		new = pmd_mkwrite(pmd_mkdirty(new));
+		new = pmd_mkwrite_novma(pmd_mkdirty(new));
 	if (flags & SET_MEMORY_NX)
 		new = set_pmd_bit(new, __pgprot(_SEGMENT_ENTRY_NOEXEC));
 	else if (flags & SET_MEMORY_X)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index b3b8d289b9ab..63fced067057 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -150,7 +150,7 @@ void make_lowmem_page_readwrite(void *vaddr)
 	if (pte == NULL)
 		return;		/* vaddr missing */
 
-	ptev = pte_mkwrite(*pte);
+	ptev = pte_mkwrite_novma(*pte);
 
 	if (HYPERVISOR_update_va_mapping(address, ptev, 0))
 		BUG();
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma() Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  7:42   ` Mike Rapoport
  2023-06-13 12:28   ` David Hildenbrand
  2023-06-13  0:10 ` [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
                   ` (40 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe

The x86 Shadow stack feature includes a new type of memory called shadow
stack. This shadow stack memory has some unusual properties, which requires
some core mm changes to function properly.

One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
that the x86 implementation of it can know whether to create regular
writable memory or shadow stack memory.

But there are a couple of challenges to this. Modifying the signatures of
each arch pte_mkwrite() implementation would be error prone because some
are generated with macros and would need to be re-implemented. Also, some
pte_mkwrite() callers operate on kernel memory without a VMA.

So this can be done in a three step process. First pte_mkwrite() can be
renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
added that just calls pte_mkwrite_novma(). Next callers without a VMA can
be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
can be changed to take/pass a VMA.

In a previous patches, pte_mkwrite() was renamed pte_mkwrite_novma() and
callers that don't have a VMA were changed to use pte_mkwrite_novma(). So
now change pte_mkwrite() to take a VMA and change the remaining callers to
pass a VMA. Apply the same changes for pmd_mkwrite().

No functional change.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 Documentation/mm/arch_pgtable_helpers.rst |  6 ++++--
 include/linux/mm.h                        |  2 +-
 include/linux/pgtable.h                   |  4 ++--
 mm/debug_vm_pgtable.c                     | 12 ++++++------
 mm/huge_memory.c                          | 10 +++++-----
 mm/memory.c                               |  4 ++--
 mm/migrate.c                              |  2 +-
 mm/migrate_device.c                       |  2 +-
 mm/mprotect.c                             |  2 +-
 mm/userfaultfd.c                          |  2 +-
 10 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index 69ce1f2aa4d1..c82e3ee20e51 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -46,7 +46,8 @@ PTE Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pte_mkclean               | Creates a clean PTE                              |
 +---------------------------+--------------------------------------------------+
-| pte_mkwrite               | Creates a writable PTE                           |
+| pte_mkwrite               | Creates a writable PTE of the type specified by  |
+|                           | the VMA.                                         |
 +---------------------------+--------------------------------------------------+
 | pte_mkwrite_novma         | Creates a writable PTE, of the conventional type |
 |                           | of writable.                                     |
@@ -121,7 +122,8 @@ PMD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pmd_mkclean               | Creates a clean PMD                              |
 +---------------------------+--------------------------------------------------+
-| pmd_mkwrite               | Creates a writable PMD                           |
+| pmd_mkwrite               | Creates a writable PMD of the type specified by  |
+|                           | the VMA.                                         |
 +---------------------------+--------------------------------------------------+
 | pmd_mkwrite_novma         | Creates a writable PMD, of the conventional type |
 |                           | of writable.                                     |
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..43701bf223d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1284,7 +1284,7 @@ void free_compound_page(struct page *page);
 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
+		pte = pte_mkwrite(pte, vma);
 	return pte;
 }
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ae271a307584..0f3cf726812a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -508,14 +508,14 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma,
 #endif
 
 #ifndef pte_mkwrite
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	return pte_mkwrite_novma(pte);
 }
 #endif
 
 #if defined(CONFIG_HAS_HUGE_PAGE) && !defined(pmd_mkwrite)
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	return pmd_mkwrite_novma(pmd);
 }
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..107e293904d3 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -109,10 +109,10 @@ static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx)
 	WARN_ON(!pte_same(pte, pte));
 	WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte))));
 	WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte))));
-	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte))));
+	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma)));
 	WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte))));
 	WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte))));
-	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte))));
+	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma))));
 	WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte))));
 	WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte))));
 }
@@ -153,7 +153,7 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
 	pte = pte_mkclean(pte);
 	set_pte_at(args->mm, args->vaddr, args->ptep, pte);
 	flush_dcache_page(page);
-	pte = pte_mkwrite(pte);
+	pte = pte_mkwrite(pte, args->vma);
 	pte = pte_mkdirty(pte);
 	ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1);
 	pte = ptep_get(args->ptep);
@@ -199,10 +199,10 @@ static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx)
 	WARN_ON(!pmd_same(pmd, pmd));
 	WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd))));
 	WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd))));
-	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd))));
+	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma)));
 	WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd))));
 	WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd))));
-	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd))));
+	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma))));
 	WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd))));
 	WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd))));
 	/*
@@ -253,7 +253,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 	pmd = pmd_mkclean(pmd);
 	set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
 	flush_dcache_page(page);
-	pmd = pmd_mkwrite(pmd);
+	pmd = pmd_mkwrite(pmd, args->vma);
 	pmd = pmd_mkdirty(pmd);
 	pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1);
 	pmd = READ_ONCE(*args->pmdp);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..37dd56b7b3d1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -551,7 +551,7 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
-		pmd = pmd_mkwrite(pmd);
+		pmd = pmd_mkwrite(pmd, vma);
 	return pmd;
 }
 
@@ -1572,7 +1572,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
 	if (writable)
-		pmd = pmd_mkwrite(pmd);
+		pmd = pmd_mkwrite(pmd, vma);
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
@@ -1924,7 +1924,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	/* See change_pte_range(). */
 	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
 	    can_change_pmd_writable(vma, addr, entry))
-		entry = pmd_mkwrite(entry);
+		entry = pmd_mkwrite(entry, vma);
 
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
@@ -2234,7 +2234,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			if (write)
-				entry = pte_mkwrite(entry);
+				entry = pte_mkwrite(entry, vma);
 			if (anon_exclusive)
 				SetPageAnonExclusive(page + i);
 			if (!young)
@@ -3271,7 +3271,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
-		pmde = pmd_mkwrite(pmde);
+		pmde = pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_mkuffd_wp(pmde);
 	if (!is_migration_entry_young(entry))
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..c1b6fe944c20 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4100,7 +4100,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	entry = mk_pte(&folio->page, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
 	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
+		entry = pte_mkwrite(pte_mkdirty(entry), vma);
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
@@ -4796,7 +4796,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
 	if (writable)
-		pte = pte_mkwrite(pte);
+		pte = pte_mkwrite(pte, vma);
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index 01cac26a3127..8b46b722f1a4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -219,7 +219,7 @@ static bool remove_migration_pte(struct folio *folio,
 		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
 			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
-			pte = pte_mkwrite(pte);
+			pte = pte_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
 			pte = pte_mkuffd_wp(pte);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d30c9de60b0d..df3f5e9d5f76 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -646,7 +646,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		}
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
-			entry = pte_mkwrite(pte_mkdirty(entry));
+			entry = pte_mkwrite(pte_mkdirty(entry), vma);
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..afdb6723782e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -198,7 +198,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
 			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent);
+				ptent = pte_mkwrite(ptent, vma);
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_needs_flush(oldpte, ptent))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e97a0b4889fc..6dea7f57026e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -72,7 +72,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	if (page_in_cache && !vm_shared)
 		writable = false;
 	if (writable)
-		_dst_pte = pte_mkwrite(_dst_pte);
+		_dst_pte = pte_mkwrite(_dst_pte, dst_vma);
 	if (flags & MFILL_ATOMIC_WP)
 		_dst_pte = pte_mkuffd_wp(_dst_pte);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap()
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (2 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-14  8:49   ` David Hildenbrand
  2023-06-14 23:30   ` Mark Brown
  2023-06-13  0:10 ` [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
                   ` (39 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Peter Collingbourne, Pengfei Xu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

    commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap().  Thus, re-introduce vm_flags to do_mmap().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 fs/aio.c           |  2 +-
 include/linux/mm.h |  3 ++-
 ipc/shm.c          |  2 +-
 mm/mmap.c          | 10 +++++-----
 mm/nommu.c         |  4 ++--
 mm/util.c          |  2 +-
 6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b0b17bd098bb..4a7576989719 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -558,7 +558,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
 				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+				 MAP_SHARED, 0, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43701bf223d3..9ec20cbb20c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3133,7 +3133,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf);
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool downgrade);
diff --git a/ipc/shm.c b/ipc/shm.c
index 60e45e7045d4..576a543b7cff 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 13678edaa22c..afdf5f78432b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1221,11 +1221,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  */
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
-			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long flags, vm_flags_t vm_flags,
+			unsigned long pgoff, unsigned long *populate,
+			struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	vm_flags_t vm_flags;
 	int pkey = 0;
 
 	validate_mm(mm);
@@ -1286,7 +1286,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
@@ -2903,7 +2903,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, 0, pgoff, &populate, NULL);
 	fput(file);
 out:
 	mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index f670d9979a26..138826c4a872 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1002,6 +1002,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long len,
 			unsigned long prot,
 			unsigned long flags,
+			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
 			struct list_head *uf)
@@ -1009,7 +1010,6 @@ unsigned long do_mmap(struct file *file,
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	struct rb_node *rb;
-	vm_flags_t vm_flags;
 	unsigned long capabilities, result;
 	int ret;
 	VMA_ITERATOR(vmi, current->mm, 0);
@@ -1029,7 +1029,7 @@ unsigned long do_mmap(struct file *file,
 
 	/* we've determined that we can make the mapping, now translate what we
 	 * now know into VMA flags */
-	vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+	vm_flags |= determine_vm_flags(file, prot, flags, capabilities);
 
 
 	/* we're going to need to record the mapping */
diff --git a/mm/util.c b/mm/util.c
index dd12b9531ac4..8e7fc6cacab4 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -540,7 +540,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
 			      &uf);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (3 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-14  8:50   ` David Hildenbrand
  2023-06-13  0:10 ` [PATCH v9 06/42] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
                   ` (38 subsequent siblings)
  43 siblings, 1 reply; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Axel Rasmussen, Peter Xu, Pengfei Xu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Future patches will introduce a new VM flag VM_SHADOW_STACK that will be
VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are
bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake
of order, make all VM_HIGH_ARCH_BITs stay together by moving
VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be
introduced as 37.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9ec20cbb20c1..6f52c1e7c640 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,7 +370,7 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT	37
+# define VM_UFFD_MINOR_BIT	38
 # define VM_UFFD_MINOR		BIT(VM_UFFD_MINOR_BIT)	/* UFFD minor faults */
 #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 # define VM_UFFD_MINOR		VM_NONE
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 06/42] x86/shstk: Add Kconfig option for shadow stack
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (4 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 07/42] x86/traps: Move control protection handler to separate file Rick Edgecombe
                   ` (37 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Shadow stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-shadow stack applications continue to work, but
without protection.

Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with shadow stacks, create CONFIG_CET to signify
that at least one CET feature is configured.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig           | 24 ++++++++++++++++++++++++
 arch/x86/Kconfig.assembler |  5 +++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..ce460d6b4e25 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1852,6 +1852,11 @@ config CC_HAS_IBT
 		  (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
 		  $(as-instr,endbr64)
 
+config X86_CET
+	def_bool n
+	help
+	  CET features configured (Shadow stack or IBT)
+
 config X86_KERNEL_IBT
 	prompt "Indirect Branch Tracking"
 	def_bool y
@@ -1859,6 +1864,7 @@ config X86_KERNEL_IBT
 	# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
 	depends on !LD_IS_LLD || LLD_VERSION >= 140000
 	select OBJTOOL
+	select X86_CET
 	help
 	  Build the kernel with support for Indirect Branch Tracking, a
 	  hardware support course-grain forward-edge Control Flow Integrity
@@ -1952,6 +1958,24 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config X86_USER_SHADOW_STACK
+	bool "X86 userspace shadow stack"
+	depends on AS_WRUSS
+	depends on X86_64
+	select ARCH_USES_HIGH_VMA_FLAGS
+	select X86_CET
+	help
+	  Shadow stack protection is a hardware feature that detects function
+	  return address corruption.  This helps mitigate ROP attacks.
+	  Applications must be enabled to use it, and old userspace does not
+	  get protection "for free".
+
+	  CPUs supporting shadow stacks were first released in 2020.
+
+	  See Documentation/x86/shstk.rst for more information.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index b88f784cb02e..8ad41da301e5 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -24,3 +24,8 @@ config AS_GFNI
 	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
 	help
 	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_WRUSS
+	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+	help
+	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 07/42] x86/traps: Move control protection handler to separate file
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (5 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 06/42] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 08/42] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
                   ` (36 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

Today the control protection handler is defined in traps.c and used only
for the kernel IBT feature. To reduce ifdeffery, move it to it's own file.
In future patches, functionality will be added to make this handler also
handle user shadow stack faults. So name the file cet.c.

No functional change.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/Makefile |  2 ++
 arch/x86/kernel/cet.c    | 76 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c  | 75 ---------------------------------------
 3 files changed, 78 insertions(+), 75 deletions(-)
 create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4070a01c11b7..abee0564b750 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -145,6 +145,8 @@ obj-$(CONFIG_CFI_CLANG)			+= cfi.o
 
 obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
+obj-$(CONFIG_X86_CET)			+= cet.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..7ad22b705b64
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <asm/bugs.h>
+#include <asm/traps.h>
+
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
+
+enum cp_error_code {
+	CP_EC        = (1 << 15) - 1,
+
+	CP_RET       = 1,
+	CP_IRET      = 2,
+	CP_ENDBR     = 3,
+	CP_RSTRORSSP = 4,
+	CP_SETSSBSY  = 5,
+
+	CP_ENCL	     = 1 << 15,
+};
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
+		pr_err("Unexpected #CP\n");
+		BUG();
+	}
+
+	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+		return;
+
+	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
+		regs->ax = 0;
+		return;
+	}
+
+	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
+	if (!ibt_fatal) {
+		printk(KERN_DEFAULT CUT_HERE);
+		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+		return;
+	}
+	BUG();
+}
+
+/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
+noinline bool ibt_selftest(void)
+{
+	unsigned long ret;
+
+	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
+	     ANNOTATE_RETPOLINE_SAFE
+	     "	jmp *%%rax\n\t"
+	     "ibt_selftest_ip:\n\t"
+	     UNWIND_HINT_FUNC
+	     ANNOTATE_NOENDBR
+	     "	nop\n\t"
+
+	     : "=a" (ret) : : "memory");
+
+	return !ret;
+}
+
+static int __init ibt_setup(char *str)
+{
+	if (!strcmp(str, "off"))
+		setup_clear_cpu_cap(X86_FEATURE_IBT);
+
+	if (!strcmp(str, "warn"))
+		ibt_fatal = false;
+
+	return 1;
+}
+
+__setup("ibt=", ibt_setup);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 58b1f208eff5..6f666dfa97de 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -213,81 +213,6 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
-enum cp_error_code {
-	CP_EC        = (1 << 15) - 1,
-
-	CP_RET       = 1,
-	CP_IRET      = 2,
-	CP_ENDBR     = 3,
-	CP_RSTRORSSP = 4,
-	CP_SETSSBSY  = 5,
-
-	CP_ENCL	     = 1 << 15,
-};
-
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
-{
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
-	}
-
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
-		return;
-
-	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
-		regs->ax = 0;
-		return;
-	}
-
-	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
-	if (!ibt_fatal) {
-		printk(KERN_DEFAULT CUT_HERE);
-		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
-		return;
-	}
-	BUG();
-}
-
-/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
-noinline bool ibt_selftest(void)
-{
-	unsigned long ret;
-
-	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
-	     ANNOTATE_RETPOLINE_SAFE
-	     "	jmp *%%rax\n\t"
-	     "ibt_selftest_ip:\n\t"
-	     UNWIND_HINT_FUNC
-	     ANNOTATE_NOENDBR
-	     "	nop\n\t"
-
-	     : "=a" (ret) : : "memory");
-
-	return !ret;
-}
-
-static int __init ibt_setup(char *str)
-{
-	if (!strcmp(str, "off"))
-		setup_clear_cpu_cap(X86_FEATURE_IBT);
-
-	if (!strcmp(str, "warn"))
-		ibt_fatal = false;
-
-	return 1;
-}
-
-__setup("ibt=", ibt_setup);
-
-#endif /* CONFIG_X86_KERNEL_IBT */
-
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 08/42] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (6 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 07/42] x86/traps: Move control protection handler to separate file Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 09/42] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
                   ` (35 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/cpufeatures.h       | 2 ++
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/kernel/cpu/cpuid-deps.c         | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cb8ca46213be..d7215c8b7923 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,6 +308,7 @@
 #define X86_FEATURE_MSR_TSX_CTRL	(11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
 #define X86_FEATURE_SMBA		(11*32+21) /* "" Slow Memory Bandwidth Allocation */
 #define X86_FEATURE_BMEC		(11*32+22) /* "" Bandwidth Monitoring Event Configuration */
+#define X86_FEATURE_USER_SHSTK		(11*32+23) /* Shadow stack support for user mode applications */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
@@ -380,6 +381,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* "" Shadow stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index fafe9be7a6f4..b9c7eae2e70f 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK	0
+#else
+#define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -120,7 +126,7 @@
 #define DISABLED_MASK9	(DISABLE_SGX)
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
-			 DISABLE_CALL_DEPTH_TRACKING)
+			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	(DISABLE_LAM)
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index f6748c8bd647..e462c1d3800a 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 09/42] x86/mm: Move pmd_write(), pud_write() up in the file
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (7 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 08/42] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
                   ` (34 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

To prepare the introduction of _PAGE_SAVED_DIRTY, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below.  No functional changes.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 112e6060eafa..768ee46782c9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -160,6 +160,18 @@ static inline int pte_write(pte_t pte)
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_RW;
+}
+
 static inline int pte_huge(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_PSE;
@@ -1120,12 +1132,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pmd_t *pmdp)
@@ -1155,12 +1161,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
-	return pud_flags(pud) & _PAGE_RW;
-}
-
 #ifndef pmdp_establish
 #define pmdp_establish pmdp_establish
 static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (8 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 09/42] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13 16:01   ` Edgecombe, Rick P
  2023-06-13 17:58   ` Linus Torvalds
  2023-06-13  0:10 ` [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
                   ` (33 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To avoid
inadvertently created shadow stack memory, in places where Linux normally
creates Write=0,Dirty=1, it can use the software-defined _PAGE_SAVED_DIRTY
in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs
to create Write=0,Dirty=1, it instead creates Write=0,SavedDirty=1 except
for shadow stack, which is Write=0,Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_SAVED_DIRTY. For 32 bit, the same bit as
_PAGE_BIT_UFFD_WP is used, since user fault fd is not supported on 32
bit. This leaves one unused software bit on 32 bit (_PAGE_BIT_SOFT_DIRTY,
as this is also not supported on 32 bit).

Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to
actually begin creating _PAGE_SAVED_DIRTY PTEs will follow once other
pieces are in place.

Since this SavedDirty shifting is done for all x86 CPUs, this leaves
the possibility for the hardware oddity to still create Write=0,Dirty=1
PTEs in rare cases. Since these CPUs also don't support shadow stack, this
will be harmless as it was before the introduction of SavedDirty.

Implement the shifting logic to be branchless. Embed the logic of whether
to do the shifting (including checking the Write bits) so that it can be
called by future callers that would otherwise need additional branching
logic. This efficiency allows the logic of when to do the shifting to be
centralized, making the code easier to reason about.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Use bit shifting instead of conditionals (Linus)
 - Make saved dirty bit unconditional (Linus)
 - Add 32 bit support to make it extra unconditional
 - Don't re-order PAGE flags (Dave)
---
 arch/x86/include/asm/pgtable.h       | 83 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h | 38 +++++++++++--
 arch/x86/include/asm/tlbflush.h      |  3 +-
 3 files changed, 119 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 768ee46782c9..a95f872c7429 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -301,6 +301,53 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+/*
+ * Write protection operations can result in Dirty=1,Write=0 PTEs. But in the
+ * case of X86_FEATURE_USER_SHSTK, these PTEs denote shadow stack memory. So
+ * when creating dirty, write-protected memory, a software bit is used:
+ * _PAGE_BIT_SAVED_DIRTY. The following functions take a PTE and transition the
+ * Dirty bit to SavedDirty, and vice-vesra.
+ *
+ * This shifting is only done if needed. In the case of shifting
+ * Dirty->SavedDirty, the condition is if the PTE is Write=0. In the case of
+ * shifting SavedDirty->Dirty, the condition is Write=1.
+ */
+static inline unsigned long mksaveddirty_shift(unsigned long v)
+{
+	unsigned long cond = !(v & (1 << _PAGE_BIT_RW));
+
+	v |= ((v >> _PAGE_BIT_DIRTY) & cond) << _PAGE_BIT_SAVED_DIRTY;
+	v &= ~(cond << _PAGE_BIT_DIRTY);
+
+	return v;
+}
+
+static inline unsigned long clear_saveddirty_shift(unsigned long v)
+{
+	unsigned long cond = !!(v & (1 << _PAGE_BIT_RW));
+
+	v |= ((v >> _PAGE_BIT_SAVED_DIRTY) & cond) << _PAGE_BIT_DIRTY;
+	v &= ~(cond << _PAGE_BIT_SAVED_DIRTY);
+
+	return v;
+}
+
+static inline pte_t pte_mksaveddirty(pte_t pte)
+{
+	pteval_t v = native_pte_val(pte);
+
+	v = mksaveddirty_shift(v);
+	return native_make_pte(v);
+}
+
+static inline pte_t pte_clear_saveddirty(pte_t pte)
+{
+	pteval_t v = native_pte_val(pte);
+
+	v = clear_saveddirty_shift(v);
+	return native_make_pte(v);
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_RW);
@@ -413,6 +460,24 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+/* See comments above mksaveddirty_shift() */
+static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	v = mksaveddirty_shift(v);
+	return native_make_pmd(v);
+}
+
+/* See comments above mksaveddirty_shift() */
+static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	v = clear_saveddirty_shift(v);
+	return native_make_pmd(v);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_RW);
@@ -484,6 +549,24 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+/* See comments above mksaveddirty_shift() */
+static inline pud_t pud_mksaveddirty(pud_t pud)
+{
+	pudval_t v = native_pud_val(pud);
+
+	v = mksaveddirty_shift(v);
+	return native_make_pud(v);
+}
+
+/* See comments above mksaveddirty_shift() */
+static inline pud_t pud_clear_saveddirty(pud_t pud)
+{
+	pudval_t v = native_pud_val(pud);
+
+	v = clear_saveddirty_shift(v);
+	return native_make_pud(v);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 447d4bee25c4..ee6f8e57e115 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,13 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+#ifdef CONFIG_X86_64
+#define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit */
+#else
+/* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */
+#define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW2 /* Saved Dirty bit */
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +125,22 @@
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be Write=0,Dirty=1. However,
+ * there are valid cases where the kernel might create read-only PTEs that
+ * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In
+ * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit,
+ * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have
+ * (Write=0,SavedDirty=1,Dirty=0) set.
+ */
+#ifdef CONFIG_X86_64
+#define _PAGE_SAVED_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY)
+#else
+#define _PAGE_SAVED_DIRTY	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
@@ -125,9 +149,9 @@
  * instance, and is *not* included in this mask since
  * pte_modify() does modify it.
  */
-#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
-			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |  \
+#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		     \
+			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_BITS | \
+			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |	     \
 			 _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
@@ -188,10 +212,16 @@ enum page_cache_mode {
 
 #define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
 #define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
+
+/*
+ * Page tables needs to have Write=1 in order for any lower PTEs to be
+ * writable. This includes shadow stack memory (Write=0, Dirty=1)
+ */
 #define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
+
 #define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
 #define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 75bfaa421030..965659d2c965 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -293,7 +293,8 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
 					_PAGE_ACCESSED;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
-					_PAGE_SOFTW3 | _PAGE_SOFTW4;
+					_PAGE_SOFTW3 | _PAGE_SOFTW4 |
+					_PAGE_SAVED_DIRTY;
 	const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
 			  _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
 			  _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (9 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13 18:01   ` Linus Torvalds
  2023-06-13  0:10 ` [PATCH v9 12/42] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Rick Edgecombe
                   ` (32 subsequent siblings)
  43 siblings, 1 reply; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for
shadow stack. Copy-on-write PTEs then have Write=0,SavedDirty=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,SavedDirty=1, it could
become a transient shadow stack PTE in two cases:

1. Some processors can start a write but end up seeing a Write=0 PTE by
   the time they get to the Dirty bit, creating a transient shadow stack
   PTE. However, this will not occur on processors supporting shadow
   stack, and a TLB flush is not necessary.

2. When _PAGE_DIRTY is replaced with _PAGE_SAVED_DIRTY non-atomically, a
   transient shadow stack PTE can be created as a result.

Prevent the second case when doing a write protection and Dirty->SavedDirty
shift at the same time with a CMPXCHG loop. The first case

Note, in the PAE case CMPXCHG will need to operate on 8 byte, but
try_cmpxchg() will not use CMPXCHG8B, so it cannot operate on a full PAE
PTE. However the exiting logic is not operating on a full 8 byte region
either, and relies on the fact that the Write bit is in the first 4
bytes when doing the clear_bit(). Since both the Dirty, SavedDirty and
Write bits are in the first 4 bytes, casting to a long will be similar to
the existing behavior which also casts to a long.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the CMPXCHG solution.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Use bit shifting helpers that don't need any extra conditional
   logic. (Linus)
 - Always do the SavedDirty shifting (Linus)
---
 arch/x86/include/asm/pgtable.h | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a95f872c7429..99b54ab0a919 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1189,7 +1189,17 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
-	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	pte_t old_pte, new_pte;
+
+	old_pte = READ_ONCE(*ptep);
+	do {
+		new_pte = pte_wrprotect(old_pte);
+	} while (!try_cmpxchg((long *)&ptep->pte, (long *)&old_pte, *(long *)&new_pte));
 }
 
 #define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
@@ -1241,7 +1251,17 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
-	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	pmd_t old_pmd, new_pmd;
+
+	old_pmd = READ_ONCE(*pmdp);
+	do {
+		new_pmd = pmd_wrprotect(old_pmd);
+	} while (!try_cmpxchg((long *)pmdp, (long *)&old_pmd, *(long *)&new_pmd));
 }
 
 #ifndef pmdp_establish
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 12/42] x86/mm: Start actually marking _PAGE_SAVED_DIRTY
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (10 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 13/42] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
                   ` (31 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
the same changes to pmd and pud. Since there is no x86 version of
pte_mkwrite() to hold this arch specific logic, create one. Add it to
x86/mm/pgtable.c instead of x86/asm/include/pgtable.h as future patches
will require it to live in pgtable.c and it will make the diff easier
for reviewers.

Since CPUs without shadow stack support could create Write=0,Dirty=1
PTEs, only return true for pte_shstk() if the CPU also supports shadow
stack. This will prevent these HW creates PTEs as showing as true for
pte_write().

For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
was not necessarily created with any of the existing PTE bit helpers.
That means that it can return a pte_t with Write=0,Dirty=1, a shadow
stack PTE, when it did not intend to create one.

Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
1. Marking Write=0 PTEs Dirty=1
2. Marking Dirty=1 PTEs Write=0

The first case cannot happen as the existing behavior of pte_modify() is to
filter out any Dirty bit passed in newprot. Handle the second case by
shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
protected by the pte_modify() call. Apply the same changes to pmd_modify().

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Use bit shifting helpers that don't need any extra conditional
   logic. (Linus)
 - Handle the harmless Write=0->Write=1 pte_modify() case with the
   shifting helpers.
 - Don't ever return true for pte_shstk() if the CPU does not support
   shadow stack.
---
 arch/x86/include/asm/pgtable.h | 151 ++++++++++++++++++++++++++++-----
 arch/x86/mm/pgtable.c          |  14 +++
 2 files changed, 144 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 99b54ab0a919..d8724f5b1202 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,9 +124,15 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+	       (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -134,9 +140,16 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+	return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+	       (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+	       (_PAGE_DIRTY | _PAGE_PSE);
 }
 
 #define pmd_young pmd_young
@@ -145,9 +158,9 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -157,13 +170,21 @@ static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -350,7 +371,14 @@ static inline pte_t pte_clear_saveddirty(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
+	 * dirty value to the software bit, if present.
+	 */
+	return pte_mksaveddirty(pte);
 }
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
@@ -388,7 +416,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -403,7 +431,16 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pte = pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+
+	return pte_mksaveddirty(pte);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	return pte_set_flags(pte, _PAGE_DIRTY);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -416,6 +453,10 @@ static inline pte_t pte_mkwrite_novma(pte_t pte)
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
+struct vm_area_struct;
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);
+#define pte_mkwrite pte_mkwrite
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_PSE);
@@ -480,7 +521,14 @@ static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	return pmd_mksaveddirty(pmd);
 }
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
@@ -507,12 +555,21 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+
+	return pmd_mksaveddirty(pmd);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -535,6 +592,9 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
 
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#define pmd_mkwrite pmd_mkwrite
+
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
 {
 	pudval_t v = native_pud_val(pud);
@@ -574,17 +634,26 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	return pud_mksaveddirty(pud);
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pud = pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+
+	return pud_mksaveddirty(pud);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -604,7 +673,9 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	return pud_clear_saveddirty(pud);
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
@@ -721,6 +792,7 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pteval_t val = pte_val(pte), oldval = val;
+	pte_t pte_result;
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
@@ -729,17 +801,54 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
-	return __pte(val);
+
+	pte_result = __pte(val);
+
+	/*
+	 * To avoid creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
+	 *  1. Marking Write=0 PTEs Dirty=1
+	 *  2. Marking Dirty=1 PTEs Write=0
+	 *
+	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
+	 * out any Dirty bit passed in newprot. Handle the second case by
+	 * going through the mksaveddirty exercise. Only do this if the old
+	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
+	 */
+	if (oldval & _PAGE_RW)
+		pte_result = pte_mksaveddirty(pte_result);
+	else
+		pte_result = pte_clear_saveddirty(pte_result);
+
+	return pte_result;
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
 	pmdval_t val = pmd_val(pmd), oldval = val;
+	pmd_t pmd_result;
 
-	val &= _HPAGE_CHG_MASK;
+	val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY);
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
-	return __pmd(val);
+
+	pmd_result = __pmd(val);
+
+	/*
+	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
+	 *  1. Marking Write=0 PMDs Dirty=1
+	 *  2. Marking Dirty=1 PMDs Write=0
+	 *
+	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
+	 * out any Dirty bit passed in newprot. Handle the second case by
+	 * going through the mksaveddirty exercise. Only do this if the old
+	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
+	 */
+	if (oldval & _PAGE_RW)
+		pmd_result = pmd_mksaveddirty(pmd_result);
+	else
+		pmd_result = pmd_clear_saveddirty(pmd_result);
+
+	return pmd_result;
 }
 
 /*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..0ad2c62ac0a8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -880,3 +880,17 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #endif /* CONFIG_X86_64 */
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	pte = pte_mkwrite_novma(pte);
+
+	return pte_clear_saveddirty(pte);
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	pmd = pmd_mkwrite_novma(pmd);
+
+	return pmd_clear_saveddirty(pmd);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 13/42] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (11 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 12/42] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
                   ` (30 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when the memory gets written to. This
creates additional work for the CPU. So traditional wisdom was to simply
set the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read-only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pgtable_types.h | 8 +++++---
 arch/x86/mm/pat/set_memory.c         | 4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ee6f8e57e115..26f07d6d5758 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -222,10 +222,12 @@ enum page_cache_mode {
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
 
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
+#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 7159cf787613..fc627acfe40e 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2073,12 +2073,12 @@ int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
 }
 
 int set_memory_rox(unsigned long addr, int numpages)
 {
-	pgprot_t clr = __pgprot(_PAGE_RW);
+	pgprot_t clr = __pgprot(_PAGE_RW | _PAGE_DIRTY);
 
 	if (__supported_pte_mask & _PAGE_NX)
 		clr.pgprot |= _PAGE_NX;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (12 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 13/42] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-14  8:50   ` David Hildenbrand
  2023-06-14 23:31   ` Mark Brown
  2023-06-13  0:10 ` [PATCH v9 15/42] x86/mm: Check shadow stack page fault errors Rick Edgecombe
                   ` (29 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

New hardware extensions implement support for shadow stack memory, such
as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to
identify these areas, for example, to be used to properly indicate shadow
stack PTEs to the hardware.

Shadow stack VMA creation will be tightly controlled and limited to
anonymous memory to make the implementation simpler and since that is all
that is required. The solution will rely on pte_mkwrite() to create the
shadow stack PTEs, so it will not be required for vm_get_page_prot() to
learn how to create shadow stack memory. For this reason document that
VM_SHADOW_STACK should not be mixed with VM_SHARED.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 Documentation/filesystems/proc.rst | 1 +
 fs/proc/task_mmu.c                 | 3 +++
 include/linux/mm.h                 | 8 ++++++++
 3 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 7897a7dafcbc..6ccb57089a06 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -566,6 +566,7 @@ encoded manner. The codes are the following:
     mt    arm64 MTE allocation tags are enabled
     um    userfaultfd missing tracking
     uw    userfaultfd wr-protect tracking
+    ss    shadow stack page
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 420510f6a545..38b19a757281 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+		[ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f52c1e7c640..fb17cbd531ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -319,11 +319,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -339,6 +341,12 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+# define VM_SHADOW_STACK	VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
+#else
+# define VM_SHADOW_STACK	VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 15/42] x86/mm: Check shadow stack page fault errors
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (13 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 16/42] mm: Add guard pages around a shadow stack Rick Edgecombe
                   ` (28 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stack accesses to shadow-stack mappings can result in faults in
normal, valid operation just like regular accesses to regular mappings.
Shadow stacks need some of the same features like delayed allocation, swap
and copy-on-write. The kernel needs to use faults to implement those
features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

For the purpose of making this clearer, consider the following example.
If a process has a shadow stack, and forks, the shadow stack PTEs will
become read-only due to COW. If the CPU in one process performs a shadow
stack read access to the shadow stack, for example executing a RET and
causing the CPU to read the shadow stack copy of the return address, then
in order for the fault to be resolved the PTE will need to be set with
shadow stack permissions. But then the memory would be changeable from
userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
COW, otherwise the shared page would be changeable from both processes.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/trap_pf.h |  2 ++
 arch/x86/mm/fault.c            | 22 ++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  *   bit 15 ==				1: SGX MMU page-fault
  */
 enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 	X86_PF_SGX	=		1 << 15,
 };
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e4399983c50c..fe68119ce2cc 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1118,8 +1118,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Shadow stack accesses (PF_SHSTK=1) are only permitted to
+	 * shadow stack VMAs. All other accesses result in an error.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
+			return 1;
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
+		if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
+			return 1;
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
 		return 0;
@@ -1311,6 +1325,14 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * Read-only permissions can not be expressed in shadow stack PTEs.
+	 * Treat all shadow stack accesses as WRITE faults. This ensures
+	 * that the MM will prepare everything (e.g., break COW) such that
+	 * maybe_mkwrite() can create a proper shadow stack PTE.
+	 */
+	if (error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_INSTR)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (14 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 15/42] x86/mm: Check shadow stack page fault errors Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-14 23:34   ` Mark Brown
  2023-06-22 18:21   ` Matthew Wilcox
  2023-06-13  0:10 ` [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
                   ` (27 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to prevent corrupting or
switching to other shadow stacks. The RSTORSSP instruction can move the
SSP to different shadow stacks, but it requires a specially placed token
in order to do this. However, the architecture does not prevent
incrementing the stack pointer to wander onto an adjacent shadow stack. To
prevent this in software, enforce guard pages at the beginning of shadow
stack VMAs, such that there will always be a gap between adjacent shadow
stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can be incremented or decremented by CALL, RET  and INCSSP. CALL and
RET can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

        addq    $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp);       // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8);     // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore, a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stacks to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

In the maple tree code, there is some logic for retrying the unmapped
area search if a guard gap is violated. This retry should happen for
shadow stack guard gap violations as well. This logic currently only
checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have
a start gap as well, create an new define VM_STARTGAP_FLAGS to hold
all the VM flag bits that have start gaps, and make mmap use it.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Add logic needed to still have guard gaps with maple tree.
---
 include/linux/mm.h | 54 ++++++++++++++++++++++++++++++++++++++++------
 mm/mmap.c          |  4 ++--
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb17cbd531ac..535c58d3b2e4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void *objp);
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
-# define VM_SHADOW_STACK	VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
+/*
+ * This flag should not be set with VM_SHARED because of lack of support
+ * core mm. It will also get a guard page. This helps userspace protect
+ * itself from attacks. The reasoning is as follows:
+ *
+ * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
+ * INCSSP instruction can increment the shadow stack pointer. It is the
+ * shadow stack analog of an instruction like:
+ *
+ *   addq $0x80, %rsp
+ *
+ * However, there is one important difference between an ADD on %rsp
+ * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
+ * memory of the first and last elements that were "popped". It can be
+ * thought of as acting like this:
+ *
+ * READ_ONCE(ssp);       // read+discard top element on stack
+ * ssp += nr_to_pop * 8; // move the shadow stack
+ * READ_ONCE(ssp-8);     // read+discard last popped stack element
+ *
+ * The maximum distance INCSSP can move the SSP is 2040 bytes, before
+ * it would read the memory. Therefore a single page gap will be enough
+ * to prevent any operation from shifting the SSP to an adjacent stack,
+ * since it would have to land in the gap at least once, causing a
+ * fault.
+ *
+ * Prevent using INCSSP to move the SSP between shadow stacks by
+ * having a PAGE_SIZE guard gap.
+ */
+# define VM_SHADOW_STACK	VM_HIGH_ARCH_5
 #else
 # define VM_SHADOW_STACK	VM_NONE
 #endif
@@ -405,6 +434,8 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
 
+#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
+
 #ifdef CONFIG_STACK_GROWSUP
 #define VM_STACK	VM_GROWSUP
 #else
@@ -3235,15 +3266,26 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
 	return mtree_load(&mm->mm_mt, addr);
 }
 
+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_GROWSDOWN)
+		return stack_guard_gap;
+
+	/* See reasoning around the VM_SHADOW_STACK definition */
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return PAGE_SIZE;
+
+	return 0;
+}
+
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
+	unsigned long gap = stack_guard_start_gap(vma);
 	unsigned long vm_start = vma->vm_start;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
-		if (vm_start > vma->vm_start)
-			vm_start = 0;
-	}
+	vm_start -= gap;
+	if (vm_start > vma->vm_start)
+		vm_start = 0;
 	return vm_start;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index afdf5f78432b..d4793600a8d4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1570,7 +1570,7 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 	gap = mas.index;
 	gap += (info->align_offset - gap) & info->align_mask;
 	tmp = mas_next(&mas, ULONG_MAX);
-	if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */
+	if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
 		if (vm_start_gap(tmp) < gap + length - 1) {
 			low_limit = tmp->vm_end;
 			mas_reset(&mas);
@@ -1622,7 +1622,7 @@ static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
 	gap -= (gap - info->align_offset) & info->align_mask;
 	gap_end = mas.last;
 	tmp = mas_next(&mas, ULONG_MAX);
-	if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */
+	if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
 		if (vm_start_gap(tmp) <= gap_end) {
 			high_limit = vm_start_gap(tmp);
 			mas_reset(&mas);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (15 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 16/42] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-14 23:35   ` Mark Brown
  2023-06-13  0:10 ` [PATCH v9 18/42] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
                   ` (26 subsequent siblings)
  43 siblings, 1 reply; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage.

In order to check if a PTE is shadow stack in core mm code, add two arch
breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack
specific code to be kept in arch/x86.

Only do the check if shadow stack is supported by the CPU and configured
because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on
older CPUs. This check is handled in pte_shstk()/pmd_shstk().

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Add comments about not doing the check on non-shadow stack CPUs
---
 arch/x86/include/asm/pgtable.h |  6 ++++++
 arch/x86/mm/pgtable.c          | 20 ++++++++++++++++++++
 include/linux/pgtable.h        | 14 ++++++++++++++
 mm/huge_memory.c               |  1 +
 mm/memory.c                    |  1 +
 5 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d8724f5b1202..89cfa93d0ad6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1664,6 +1664,12 @@ static inline bool arch_has_hw_pte_young(void)
 	return true;
 }
 
+#define arch_check_zapped_pte arch_check_zapped_pte
+void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
+
+#define arch_check_zapped_pmd arch_check_zapped_pmd
+void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
+
 #ifdef CONFIG_XEN_PV
 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
 static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0ad2c62ac0a8..101e721d74aa 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -894,3 +894,23 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 	return pmd_clear_saveddirty(pmd);
 }
+
+void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
+{
+	/*
+	 * Hardware before shadow stack can (rarely) set Dirty=1
+	 * on a Write=0 PTE. So the below condition
+	 * only indicates a software bug when shadow stack is
+	 * supported by the HW. This checking is covered in
+	 * pte_shstk().
+	 */
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pte_shstk(pte));
+}
+
+void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
+{
+	/* See note in arch_check_zapped_pte() */
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pmd_shstk(pmd));
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0f3cf726812a..feb1fd2c814f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,6 +291,20 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_check_zapped_pte
+static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
+					 pte_t pte)
+{
+}
+#endif
+
+#ifndef arch_check_zapped_pmd
+static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
+					 pmd_t pmd)
+{
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 37dd56b7b3d1..c3cc20c1b26c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1681,6 +1681,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
 						tlb->fullmm);
+	arch_check_zapped_pmd(vma, orig_pmd);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	if (vma_is_special_huge(vma)) {
 		if (arch_needs_pgtable_deposit())
diff --git a/mm/memory.c b/mm/memory.c
index c1b6fe944c20..40c0b233b61d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1412,6 +1412,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			arch_check_zapped_pte(vma, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
 						      ptent);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 18/42] x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (16 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 19/42] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
                   ` (25 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

When user shadow stack is in use, Write=0,Dirty=1 is treated by the CPU as
shadow stack memory. So for shadow stack memory this bit combination is
valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
write protected, the kernel has been taught to transition the Dirty=1
bit to SavedDirty=1, to avoid inadvertently creating shadow stack
memory. It does this inside pte_wrprotect() because it knows the PTE is
not intended to be a writable shadow stack entry, it is supposed to be
write protected.

However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
distinguish between the caller intending to create a shadow stack PTE or
needing the SavedDirty shift.

The kernel has been updated to not do this, and so Write=0,Dirty=1
memory should only be created by the pte_mkfoo() helpers. Add a warning
to make sure no new mk_pte() start doing this, like, for example,
set_memory_rox() did.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Always do the check since 32 bit now supports SavedDirty
---
 arch/x86/include/asm/pgtable.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 89cfa93d0ad6..5383f7282f89 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1032,7 +1032,14 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
  * (Currently stuck as a macro because of indirect forward reference
  * to linux/mm.h:page_to_nid())
  */
-#define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_pte(page, pgprot)						  \
+({									  \
+	pgprot_t __pgprot = pgprot;					  \
+									  \
+	WARN_ON_ONCE((pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == \
+		    _PAGE_DIRTY);					  \
+	pfn_pte(page_to_pfn(page), __pgprot);				  \
+})
 
 static inline int pmd_bad(pmd_t pmd)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 19/42] mm/mmap: Add shadow stack pages to memory accounting
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (17 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 18/42] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 20/42] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
                   ` (24 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 mm/internal.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 68410c6d97ac..dd2ded32d3d5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -535,14 +535,14 @@ static inline bool is_exec_mapping(vm_flags_t flags)
 }
 
 /*
- * Stack area - automatically grows in one direction
+ * Stack area (including shadow stacks)
  *
  * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
  * do_mmap() forbids all other combinations.
  */
 static inline bool is_stack_mapping(vm_flags_t flags)
 {
-	return (flags & VM_STACK) == VM_STACK;
+	return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 20/42] x86/mm: Introduce MAP_ABOVE4G
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (18 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 19/42] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 21/42] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
                   ` (23 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which require some core mm changes to function
properly.

One of the properties is that the shadow stack pointer (SSP), which is a
CPU register that points to the shadow stack like the stack pointer points
to the stack, can't be pointing outside of the 32 bit address space when
the CPU is executing in 32 bit mode. It is desirable to prevent executing
in 32 bit mode when shadow stack is enabled because the kernel can't easily
support 32 bit signals.

On x86 it is possible to transition to 32 bit mode without any special
interaction with the kernel, by doing a "far call" to a 32 bit segment.
So the shadow stack implementation can use this address space behavior
as a feature, by enforcing that shadow stack memory is always mapped
outside of the 32 bit address space. This way userspace will trigger a
general protection fault which will in turn trigger a segfault if it
tries to transition to 32 bit mode with shadow stack enabled.

This provides a clean error generating border for the user if they try
attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
half working state for userspace to be surprised by.

So to allow future shadow stack enabling patches to map shadow stacks
out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
is pretty much like MAP_32BIT, except that it has the opposite address
range. The are a few differences though.

If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
syscall.

Since the default search behavior is top down, the normal kaslr base can
be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its
own randomization in the bottom up case.

For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
both are potentially valid, so both are used. In the bottomup search
path, the default behavior is already consistent with MAP_ABOVE4G since
mmap base should be above 4GB.

Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
with shadow stack enabled would usually segfault anyway. This is already
pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
complexity spent to make it make it more complete.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/uapi/asm/mman.h | 1 +
 arch/x86/kernel/sys_x86_64.c     | 6 +++++-
 include/linux/mman.h             | 4 ++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 775dbd3aff73..5a0256e73f1e 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMAN_H
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
+#define MAP_ABOVE4G	0x80		/* only map above 4GB */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 #define arch_calc_vm_prot_bits(prot, key) (		\
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 8cc653ffdccd..c783aeb37dce 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
-	info.low_limit = PAGE_SIZE;
+	if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
+		info.low_limit = SZ_4G;
+	else
+		info.low_limit = PAGE_SIZE;
+
 	info.high_limit = get_mmap_base(0);
 
 	/*
diff --git a/include/linux/mman.h b/include/linux/mman.h
index cee1e4b566d8..40d94411d492 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -15,6 +15,9 @@
 #ifndef MAP_32BIT
 #define MAP_32BIT 0
 #endif
+#ifndef MAP_ABOVE4G
+#define MAP_ABOVE4G 0
+#endif
 #ifndef MAP_HUGE_2MB
 #define MAP_HUGE_2MB 0
 #endif
@@ -50,6 +53,7 @@
 		| MAP_STACK \
 		| MAP_HUGETLB \
 		| MAP_32BIT \
+		| MAP_ABOVE4G \
 		| MAP_HUGE_2MB \
 		| MAP_HUGE_1GB)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 21/42] x86/mm: Teach pte_mkwrite() about stack memory
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (19 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 20/42] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 22/42] mm: Don't allow write GUPs to shadow " Rick Edgecombe
                   ` (22 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
when it is made writable with pte_mkwrite(), it should create shadow
stack memory, not conventionally writable memory. Now that all the places
where shadow stack memory might be created pass a VMA into pte_mkwrite(),
it can know when it should do this.

So make pte_mkwrite() create shadow stack memory when the VMA has the
VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().

This requires referencing VM_SHADOW_STACK in these functions, which are
currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
located) can't be pulled in without causing problems for files that
reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
they can safely reference VM_SHADOW_STACK.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Deepak Gupta <debug@rivosinc.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/mm/pgtable.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 101e721d74aa..c4b222d3b1b4 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -883,6 +883,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return pte_mkwrite_shstk(pte);
+
 	pte = pte_mkwrite_novma(pte);
 
 	return pte_clear_saveddirty(pte);
@@ -890,6 +893,9 @@ pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return pmd_mkwrite_shstk(pmd);
+
 	pmd = pmd_mkwrite_novma(pmd);
 
 	return pmd_clear_saveddirty(pmd);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 22/42] mm: Don't allow write GUPs to shadow stack memory
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (20 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 21/42] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description Rick Edgecombe
                   ` (21 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

The x86 Control-flow Enforcement Technology (CET) feature includes a
new type of memory called shadow stack. This shadow stack memory has
some unusual properties, which requires some core mm changes to
function properly.

In userspace, shadow stack memory is writable only in very specific,
controlled ways. However, since userspace can, even in the limited
ways, modify shadow stack contents, the kernel treats it as writable
memory. As a result, without additional work there would remain many
ways for userspace to trigger the kernel to write arbitrary data to
shadow stacks via get_user_pages(, FOLL_WRITE) based operations. To
help userspace protect their shadow stacks, make this a little less
exposed by blocking writable get_user_pages() operations for shadow
stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections. This is required for debugging use
cases.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 mm/gup.c                       | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5383f7282f89..fce35f5d4a4e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1630,6 +1630,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
 {
 	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
 
+	/*
+	 * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
+	 * shouldn't generally allow access to, but since they
+	 * are already Write=0, the below logic covers both cases.
+	 */
 	if (write)
 		need_pte_bits |= _PAGE_RW;
 
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..cc0dd5267509 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -978,7 +978,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;
 
 	if (write) {
-		if (!(vm_flags & VM_WRITE)) {
+		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
 			/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (21 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 22/42] mm: Don't allow write GUPs to shadow " Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13 11:55   ` Mark Brown
  2023-07-18 19:32   ` Szabolcs Nagy
  2023-06-13  0:10 ` [PATCH v9 24/42] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
                   ` (20 subsequent siblings)
  43 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Introduce a new document on Control-flow Enforcement Technology (CET).

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 Documentation/arch/x86/index.rst |   1 +
 Documentation/arch/x86/shstk.rst | 169 +++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)
 create mode 100644 Documentation/arch/x86/shstk.rst

diff --git a/Documentation/arch/x86/index.rst b/Documentation/arch/x86/index.rst
index c73d133fd37c..8ac64d7de4dc 100644
--- a/Documentation/arch/x86/index.rst
+++ b/Documentation/arch/x86/index.rst
@@ -22,6 +22,7 @@ x86-specific Documentation
    mtrr
    pat
    intel-hfi
+   shstk
    iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shstk.rst
new file mode 100644
index 000000000000..f09afa504ec0
--- /dev/null
+++ b/Documentation/arch/x86/shstk.rst
@@ -0,0 +1,169 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Control-flow Enforcement Technology (CET) Shadow Stack
+======================================================
+
+CET Background
+==============
+
+Control-flow Enforcement Technology (CET) covers several related x86 processor
+features that provide protection against control flow hijacking attacks. CET
+can protect both applications and the kernel.
+
+CET introduces shadow stack and indirect branch tracking (IBT). A shadow stack
+is a secondary stack allocated from memory which cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+shadow stack and kernel IBT are supported.
+
+Requirements to use Shadow Stack
+================================
+
+To use userspace shadow stack you need HW that supports it, a kernel
+configured with it and userspace libraries compiled with it.
+
+The kernel Kconfig option is X86_USER_SHADOW_STACK.  When compiled in, shadow
+stacks can be disabled at runtime with the kernel parameter: nousershstk.
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "user_shstk" means that userspace shadow stack is supported on the current
+kernel and HW.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output::
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+Enabling arch_prctl()'s
+=======================
+
+Elf features should be enabled by the loader using the below arch_prctl's. They
+are only supported in 64 bit user applications. These operate on the features
+on a per-thread basis. The enablement status is inherited on clone, so if the
+feature is enabled on the first thread, it will propagate to all the thread's
+in an app.
+
+arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
+    Enable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
+    Disable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
+    Lock in features at their current enabled or disabled status. 'features'
+    is a mask of all features to lock. All bits set are processed, unset bits
+    are ignored. The mask is ORed with the existing value. So any feature bits
+    set here cannot be enabled or disabled afterwards.
+
+The return values are as follows. On success, return 0. On error, errno can
+be::
+
+        -EPERM if any of the passed feature are locked.
+        -ENOTSUPP if the feature is not supported by the hardware or
+         kernel.
+        -EINVAL arguments (non existing feature, etc)
+
+The feature's bits supported are::
+
+    ARCH_SHSTK_SHSTK - Shadow stack
+    ARCH_SHSTK_WRSS  - WRSS
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc Status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+    x86_Thread_features: shstk wrss
+    x86_Thread_features_locked: shstk wrss
+
+Implementation of the Shadow Stack
+==================================
+
+Shadow Stack Size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. In the case
+of the clone3 syscall, there is a stack size passed in and shadow stack
+uses this instead of the rlimit.
+
+Signal
+------
+
+The main program and its signal handlers use the same shadow stack. Because
+the shadow stack stores only return addresses, a large shadow stack covers
+the condition that both the program stack and the signal alternate stack run
+out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+    |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+                    (bit 63 set to 1)
+    |        ...| - Other state may be added in the future
+
+
+32 bit ABI signals are not supported in shadow stack processes. Linux prevents
+32 bit execution while shadow stack is enabled by the allocating shadow stacks
+outside of the 32 bit address space. When execution enters 32 bit mode, either
+via far call or returning to userspace, a #GP is generated by the hardware
+which, will be delivered to the process as a segfault. When transitioning to
+userspace the register's state will be as if the userspace ip being returned to
+caused the segfault.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack creation behaves like mmap() with respect
+to ASLR behavior. Similarly, on thread exit the thread's shadow stack is
+disabled.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 24/42] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (22 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 25/42] x86/fpu: Add helper for modifying xstate Rick Edgecombe
                   ` (19 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Shadow stack register state can be managed with XSAVE. The registers
can logically be separated into two groups:
        * Registers controlling user-mode operation
        * Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for each group
of those groups of registers. This lets an OS manage them separately if
it chooses. Future patches for host userspace and KVM guests will only
utilize the user-mode registers, so only configure XSAVE to save
user-mode registers. This state will add 16 bytes to the xsave buffer
size.

Future patches will use the user-mode XSAVE area to save guest user-mode
CET state. However, VMCS includes new fields for guest CET supervisor
states. KVM can use these to save and restore guest supervisor state, so
host supervisor XSAVE support is not required.

Adding this exacerbates the already unwieldy if statement in
check_xstate_against_struct() that handles warning about unimplemented
xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
it actually check's the xfeature. This ends up exceeding 80 chars, but was
better on balance than other options explored. Pass the bool as pointer to
make it clear that XCHECK_SZ() can change the variable.

While configuring user-mode XSAVE, clarify kernel-mode registers are not
managed by XSAVE by defining the xfeature in
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
This serves more of a documentation as code purpose, and functionally,
only enables a few safety checks.

Both XSAVE state components are supervisor states, even the state
controlling user-mode operation. This is a departure from earlier features
like protection keys where the PKRU state is a normal user
(non-supervisor) state. Having the user state be supervisor-managed
ensures there is no direct, unprivileged access to it, making it harder
for an attacker to subvert CET.

To facilitate this privileged access, define the two user-mode CET MSRs,
and the bits defined in those MSRs relevant to future shadow stack
enablement patches.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/fpu/types.h  | 16 +++++-
 arch/x86/include/asm/fpu/xstate.h |  6 ++-
 arch/x86/kernel/fpu/xstate.c      | 90 +++++++++++++++----------------
 3 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 7f6d858ff47a..eb810074f1e7 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
 	XFEATURE_PASID,
-	XFEATURE_RSRVD_COMP_11,
-	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL_UNUSED,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL_UNUSED)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 #define XFEATURE_MASK_XTILE_CFG		(1 << XFEATURE_XTILE_CFG)
 #define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,16 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	/* user control-flow settings */
+	u64 user_cet;
+	/* user shadow stack pointer */
+	u64 user_ssp;
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
 #define XFEATURE_MASK_USER_DYNAMIC	XFEATURE_MASK_XTILE_DATA
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+					    XFEATURE_MASK_CET_USER)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+					      XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 0bab497c9436..4fa4751912d9 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,26 +39,26 @@
  */
 static const char *xfeature_names[] =
 {
-	"x87 floating point registers"	,
-	"SSE registers"			,
-	"AVX registers"			,
-	"MPX bounds registers"		,
-	"MPX CSR"			,
-	"AVX-512 opmask"		,
-	"AVX-512 Hi256"			,
-	"AVX-512 ZMM_Hi256"		,
-	"Processor Trace (unused)"	,
+	"x87 floating point registers",
+	"SSE registers",
+	"AVX registers",
+	"MPX bounds registers",
+	"MPX CSR",
+	"AVX-512 opmask",
+	"AVX-512 Hi256",
+	"AVX-512 ZMM_Hi256",
+	"Processor Trace (unused)",
 	"Protection Keys User registers",
 	"PASID state",
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"AMX Tile config"		,
-	"AMX Tile data"			,
-	"unknown xstate feature"	,
+	"Control-flow User registers",
+	"Control-flow Kernel registers (unused)",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"unknown xstate feature",
+	"AMX Tile config",
+	"AMX Tile data",
+	"unknown xstate feature",
 };
 
 static unsigned short xsave_cpuid_features[] __initdata = {
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_PKU,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
+	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
 	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
 	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
@@ -276,6 +277,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
 	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
 	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
@@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
 	 XFEATURE_MASK_PASID |			\
+	 XFEATURE_MASK_CET_USER |		\
 	 XFEATURE_MASK_XTILE)
 
 /*
@@ -446,14 +449,15 @@ static void __init __xstate_dump_leaves(void)
 	}									\
 } while (0)
 
-#define XCHECK_SZ(sz, nr, nr_macro, __struct) do {			\
-	if ((nr == nr_macro) &&						\
-	    WARN_ONCE(sz != sizeof(__struct),				\
-		"%s: struct is %zu bytes, cpu state %d bytes\n",	\
-		__stringify(nr_macro), sizeof(__struct), sz)) {		\
+#define XCHECK_SZ(sz, nr, __struct) ({					\
+	if (WARN_ONCE(sz != sizeof(__struct),				\
+	    "[%s]: struct is %zu bytes, cpu state %d bytes\n",		\
+	    xfeature_names[nr], sizeof(__struct), sz)) {		\
 		__xstate_dump_leaves();					\
 	}								\
-} while (0)
+	true;								\
+})
+
 
 /**
  * check_xtile_data_against_struct - Check tile data state size.
@@ -527,36 +531,28 @@ static bool __init check_xstate_against_struct(int nr)
 	 * Ask the CPU for the size of the state.
 	 */
 	int sz = xfeature_size(nr);
+
 	/*
 	 * Match each CPU state with the corresponding software
 	 * structure.
 	 */
-	XCHECK_SZ(sz, nr, XFEATURE_YMM,       struct ymmh_struct);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDREGS,   struct mpx_bndreg_state);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDCSR,    struct mpx_bndcsr_state);
-	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
-	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
-	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
-	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
-
-	/* The tile data size varies between implementations. */
-	if (nr == XFEATURE_XTILE_DATA)
-		check_xtile_data_against_struct(sz);
-
-	/*
-	 * Make *SURE* to add any feature numbers in below if
-	 * there are "holes" in the xsave state component
-	 * numbers.
-	 */
-	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX) ||
-	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+	switch (nr) {
+	case XFEATURE_YMM:	  return XCHECK_SZ(sz, nr, struct ymmh_struct);
+	case XFEATURE_BNDREGS:	  return XCHECK_SZ(sz, nr, struct mpx_bndreg_state);
+	case XFEATURE_BNDCSR:	  return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state);
+	case XFEATURE_OPMASK:	  return XCHECK_SZ(sz, nr, struct avx_512_opmask_state);
+	case XFEATURE_ZMM_Hi256:  return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state);
+	case XFEATURE_Hi16_ZMM:	  return XCHECK_SZ(sz, nr, struct avx_512_hi16_state);
+	case XFEATURE_PKRU:	  return XCHECK_SZ(sz, nr, struct pkru_state);
+	case XFEATURE_PASID:	  return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
+	case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
+	case XFEATURE_CET_USER:	  return XCHECK_SZ(sz, nr, struct cet_user_state);
+	case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
+	default:
 		XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
 		return false;
 	}
+
 	return true;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 25/42] x86/fpu: Add helper for modifying xstate
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (23 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 24/42] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 26/42] x86: Introduce userspace API for shadow stack Rick Edgecombe
                   ` (18 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.

When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.

The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.

Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/fpu/api.h |  9 +++++++++
 arch/x86/kernel/fpu/core.c     | 18 ++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..aadc6893dcaa 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,15 @@ static inline void fpregs_unlock(void)
 		preempt_enable();
 }
 
+/*
+ * FPU state gets lazily restored before returning to userspace. So when in the
+ * kernel, the valid FPU state may be kept in the buffer. This function will force
+ * restore all the fpu state to the registers early if needed, and lock them from
+ * being automatically saved/restored. Then FPU state can be modified safely in the
+ * registers, before unlocking with fpregs_unlock().
+ */
+void fpregs_lock_and_load(void);
+
 #ifdef CONFIG_X86_DEBUG_FPU
 extern void fpregs_assert_state_consistent(void);
 #else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..f851558b673f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -753,6 +753,24 @@ void switch_fpu_return(void)
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
 
+void fpregs_lock_and_load(void)
+{
+	/*
+	 * fpregs_lock() only disables preemption (mostly). So modifying state
+	 * in an interrupt could screw up some in progress fpregs operation.
+	 * Warn about it.
+	 */
+	WARN_ON_ONCE(!irq_fpu_usable());
+	WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+	fpregs_lock();
+
+	fpregs_assert_state_consistent();
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		fpregs_restore_userregs();
+}
+
 #ifdef CONFIG_X86_DEBUG_FPU
 /*
  * If current FPU state according to its tracking (loaded FPU context on this
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 26/42] x86: Introduce userspace API for shadow stack
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (24 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 25/42] x86/fpu: Add helper for modifying xstate Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 27/42] x86/shstk: Add user control-protection fault handler Rick Edgecombe
                   ` (17 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

Add three new arch_prctl() handles:

 - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
   feature. Returns 0 on success or a negative value on error.

 - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
   specified feature. Returns 0 on success or a negative value
   on error.

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/processor.h  |  6 +++++
 arch/x86/include/asm/shstk.h      | 21 +++++++++++++++
 arch/x86/include/uapi/asm/prctl.h |  6 +++++
 arch/x86/kernel/Makefile          |  2 ++
 arch/x86/kernel/process_64.c      |  6 +++++
 arch/x86/kernel/shstk.c           | 44 +++++++++++++++++++++++++++++++
 6 files changed, 85 insertions(+)
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a1e4fa58b357..407d5551b6a7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -28,6 +28,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/shstk.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -475,6 +476,11 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	unsigned long		features;
+	unsigned long		features_locked;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
new file mode 100644
index 000000000000..ec753809f074
--- /dev/null
+++ b/arch/x86/include/asm/shstk.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHSTK_H
+#define _ASM_X86_SHSTK_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+void reset_thread_features(void);
+#else
+static inline long shstk_prctl(struct task_struct *task, int option,
+			       unsigned long arg2) { return -EINVAL; }
+static inline void reset_thread_features(void) {}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_SHSTK_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index e8d7ebbca1a4..1cd44ecc9ce0 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -23,9 +23,15 @@
 #define ARCH_MAP_VDSO_32		0x2002
 #define ARCH_MAP_VDSO_64		0x2003
 
+/* Don't use 0x3001-0x3004 because of old glibcs */
+
 #define ARCH_GET_UNTAG_MASK		0x4001
 #define ARCH_ENABLE_TAGGED_ADDR		0x4002
 #define ARCH_GET_MAX_TAG_BITS		0x4003
 #define ARCH_FORCE_TAGGED_SVA		0x4004
 
+#define ARCH_SHSTK_ENABLE		0x5001
+#define ARCH_SHSTK_DISABLE		0x5002
+#define ARCH_SHSTK_LOCK			0x5003
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index abee0564b750..6b6bf47652ee 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -147,6 +147,8 @@ obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
 obj-$(CONFIG_X86_CET)			+= cet.o
 
+obj-$(CONFIG_X86_USER_SHADOW_STACK)	+= shstk.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d181c16a2f6..0f89aa0186d1 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -515,6 +515,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		load_gs_index(__USER_DS);
 	}
 
+	reset_thread_features();
+
 	loadsegment(fs, 0);
 	loadsegment(es, _ds);
 	loadsegment(ds, _ds);
@@ -894,6 +896,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 		else
 			return put_user(LAM_U57_BITS, (unsigned long __user *)arg2);
 #endif
+	case ARCH_SHSTK_ENABLE:
+	case ARCH_SHSTK_DISABLE:
+	case ARCH_SHSTK_LOCK:
+		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..41ed6552e0a5
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <asm/prctl.h>
+
+void reset_thread_features(void)
+{
+	current->thread.features = 0;
+	current->thread.features_locked = 0;
+}
+
+long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+{
+	if (option == ARCH_SHSTK_LOCK) {
+		task->thread.features_locked |= features;
+		return 0;
+	}
+
+	/* Don't allow via ptrace */
+	if (task != current)
+		return -EINVAL;
+
+	/* Do not allow to change locked features */
+	if (features & task->thread.features_locked)
+		return -EPERM;
+
+	/* Only support enabling/disabling one feature at a time. */
+	if (hweight_long(features) > 1)
+		return -EINVAL;
+
+	if (option == ARCH_SHSTK_DISABLE) {
+		return -EINVAL;
+	}
+
+	/* Handle ARCH_SHSTK_ENABLE */
+	return -EINVAL;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 27/42] x86/shstk: Add user control-protection fault handler
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (25 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 26/42] x86: Introduce userspace API for shadow stack Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/arm/kernel/signal.c                 |  2 +-
 arch/arm64/kernel/signal.c               |  2 +-
 arch/arm64/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal_64.c            |  2 +-
 arch/x86/include/asm/disabled-features.h |  8 +-
 arch/x86/include/asm/idtentry.h          |  2 +-
 arch/x86/include/asm/traps.h             | 12 +++
 arch/x86/kernel/cet.c                    | 94 +++++++++++++++++++++---
 arch/x86/kernel/idt.c                    |  2 +-
 arch/x86/kernel/signal_32.c              |  2 +-
 arch/x86/kernel/signal_64.c              |  2 +-
 arch/x86/kernel/traps.c                  | 12 ---
 arch/x86/xen/enlighten_pv.c              |  2 +-
 arch/x86/xen/xen-asm.S                   |  2 +-
 include/uapi/asm-generic/siginfo.h       |  3 +-
 16 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 2cfc810d0a5b..06d31731c8ed 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1343,7 +1343,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index b9c7eae2e70f..702d93fdd10e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -111,6 +111,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -134,7 +140,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK20	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b241af4ce9b4..61e0e6301f09 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -614,7 +614,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_disable();
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 7ad22b705b64..cc10d8be9d74 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -4,10 +4,6 @@
 #include <asm/bugs.h>
 #include <asm/traps.h>
 
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -20,15 +16,80 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char cp_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(cp_err))
+		cpec = 0;
+	return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		  user_mode(regs) ? "user mode" : "kernel mode",
+		  cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
 		return;
+	}
 
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
@@ -74,3 +135,18 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6f666dfa97de..f358350624b2 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@
 
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_disable();
-}
-
 __always_inline int is_valid_bugaddr(unsigned long addr)
 {
 	if (addr < TASK_SIZE_MAX)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 093b78c8bbec..5a034a994682 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 08f1ceb9eb81..9e5e68008785 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (26 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 27/42] x86/shstk: Add user control-protection fault handler Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-27 17:20   ` Mark Brown
  2023-06-13  0:10 ` [PATCH v9 29/42] x86/shstk: Handle thread shadow stack Rick Edgecombe
                   ` (15 subsequent siblings)
  43 siblings, 1 reply; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

32 bit shadow stack is not expected to have many users and it will
complicate the signal implementation. So do not support IA32 emulation
or x32.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/processor.h  |   2 +
 arch/x86/include/asm/shstk.h      |   7 ++
 arch/x86/include/uapi/asm/prctl.h |   3 +
 arch/x86/kernel/shstk.c           | 145 ++++++++++++++++++++++++++++++
 4 files changed, 157 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 407d5551b6a7..2a5ec5750ba7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -479,6 +479,8 @@ struct thread_struct {
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 	unsigned long		features;
 	unsigned long		features_locked;
+
+	struct thread_shstk	shstk;
 #endif
 
 	/* Floating point and extended processor state */
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ec753809f074..2b1f7c9b9995 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -8,12 +8,19 @@
 struct task_struct;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
+struct thread_shstk {
+	u64	base;
+	u64	size;
+};
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 1cd44ecc9ce0..6a8e0e1bff4a 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -34,4 +34,7 @@
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
 
+/* ARCH_SHSTK_ features bits */
+#define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 41ed6552e0a5..3cb85224d856 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,14 +8,159 @@
 
 #include <linux/sched.h>
 #include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/shstk.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+static bool features_enabled(unsigned long features)
+{
+	return current->thread.features & features;
+}
+
+static void features_set(unsigned long features)
+{
+	current->thread.features |= features;
+}
+
+static void features_clr(unsigned long features)
+{
+	current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, unused;
+
+	mmap_write_lock(mm);
+	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+	mmap_write_unlock(mm);
+
+	return addr;
+}
+
+static unsigned long adjust_shstk_size(unsigned long size)
+{
+	if (size)
+		return PAGE_ALIGN(size);
+
+	return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+	while (1) {
+		int r;
+
+		r = vm_munmap(base, size);
+
+		/*
+		 * vm_munmap() returns -EINTR when mmap_lock is held by
+		 * something else, and that lock should not be held for a
+		 * long time.  Retry it for the case.
+		 */
+		if (r == -EINTR) {
+			cond_resched();
+			continue;
+		}
+
+		/*
+		 * For all other types of vm_munmap() failure, either the
+		 * system is out of memory or there is bug.
+		 */
+		WARN_ON_ONCE(r);
+		break;
+	}
+}
+
+static int shstk_setup(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long addr, size;
+
+	/* Already enabled */
+	if (features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/* Also not supported for 32 bit and x32 */
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
+		return -EOPNOTSUPP;
+
+	size = adjust_shstk_size(0);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+	fpregs_unlock();
+
+	shstk->base = addr;
+	shstk->size = size;
+	features_set(ARCH_SHSTK_SHSTK);
+
+	return 0;
+}
+
 void reset_thread_features(void)
 {
+	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
 	current->thread.features = 0;
 	current->thread.features_locked = 0;
 }
 
+void shstk_free(struct task_struct *tsk)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return;
+
+	if (!tsk->mm)
+		return;
+
+	unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+static int shstk_disable(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	/* Already disabled? */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	fpregs_lock_and_load();
+	/* Disable WRSS too when disabling shadow stack */
+	wrmsrl(MSR_IA32_U_CET, 0);
+	wrmsrl(MSR_IA32_PL3_SSP, 0);
+	fpregs_unlock();
+
+	shstk_free(current);
+	features_clr(ARCH_SHSTK_SHSTK);
+
+	return 0;
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 29/42] x86/shstk: Handle thread shadow stack
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (27 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 30/42] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-CET case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/fpu/sched.h   |  3 ++-
 arch/x86/include/asm/mmu_context.h |  2 ++
 arch/x86/include/asm/shstk.h       |  5 ++++
 arch/x86/kernel/fpu/core.c         | 36 +++++++++++++++++++++++++-
 arch/x86/kernel/process.c          | 21 ++++++++++++++-
 arch/x86/kernel/shstk.c            | 41 ++++++++++++++++++++++++++++--
 6 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..3c2903bbb456 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@
 
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
 extern void fpu__drop(struct fpu *fpu);
-extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+		      unsigned long shstk_addr);
 extern void fpu_flush_thread(void);
 
 /*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1d29dc791f5a..416901d406f8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -186,6 +186,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 2b1f7c9b9995..d4a5c7b10cb5 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -15,11 +15,16 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+				       unsigned long stack_size);
 void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+						     unsigned long clone_flags,
+						     unsigned long stack_size) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index f851558b673f..aa4856b236b8 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -552,8 +552,36 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
 	}
 }
 
+/* A passed ssp of zero will not cause any update */
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	struct cet_user_state *xstate;
+
+	/* If ssp update is not needed. */
+	if (!ssp)
+		return 0;
+
+	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+				XFEATURE_CET_USER);
+
+	/*
+	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+	 * stack and the fpu state should be up to date since it was just copied
+	 * from the parent in fpu_clone(). So there must be a valid non-init CET
+	 * state location in the buffer.
+	 */
+	if (WARN_ON_ONCE(!xstate))
+		return 1;
+
+	xstate->user_ssp = (u64)ssp;
+#endif
+	return 0;
+}
+
 /* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+	      unsigned long ssp)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
@@ -613,6 +641,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
 	if (use_xsave())
 		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
 
+	/*
+	 * Update shadow stack pointer, in case it changed during clone.
+	 */
+	if (update_fpu_shstk(dst, ssp))
+		return 1;
+
 	trace_x86_fpu_copy_src(src_fpu);
 	trace_x86_fpu_copy_dst(dst_fpu);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dac41a0072ea..3ab62ac98c2c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -50,6 +50,7 @@
 #include <asm/unwind.h>
 #include <asm/tdx.h>
 #include <asm/mmu_context.h>
+#include <asm/shstk.h>
 
 #include "process.h"
 
@@ -121,6 +122,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -142,6 +144,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
+	unsigned long new_ssp;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -179,7 +182,16 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-	fpu_clone(p, clone_flags, args->fn);
+	/*
+	 * Allocate a new shadow stack for thread if needed. If shadow stack,
+	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
+	 * update it.
+	 */
+	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	if (IS_ERR_VALUE(new_ssp))
+		return PTR_ERR((void *)new_ssp);
+
+	fpu_clone(p, clone_flags, args->fn, new_ssp);
 
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
@@ -225,6 +237,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
+	/*
+	 * If copy_thread() if failing, don't leak the shadow stack possibly
+	 * allocated in shstk_alloc_thread_stack() above.
+	 */
+	if (ret)
+		shstk_free(p);
+
 	return ret;
 }
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3cb85224d856..bd9cdc3a7338 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size)
 	unsigned long addr, unused;
 
 	mmap_write_lock(mm);
-	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
 		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 
 	mmap_write_unlock(mm);
@@ -126,6 +126,37 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+				       unsigned long stack_size)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long addr, size;
+
+	/*
+	 * If shadow stack is not enabled on the new thread, skip any
+	 * switch to a new shadow stack.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+	size = adjust_shstk_size(stack_size);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
+	shstk->base = addr;
+	shstk->size = size;
+
+	return addr + size;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -134,7 +165,13 @@ void shstk_free(struct task_struct *tsk)
 	    !features_enabled(ARCH_SHSTK_SHSTK))
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	unmap_shadow_stack(shstk->base, shstk->size);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 30/42] x86/shstk: Introduce routines modifying shstk
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (28 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 29/42] x86/shstk: Handle thread shadow stack Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 31/42] x86/shstk: Handle signals for shadow stack Rick Edgecombe
                   ` (13 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Shadow stacks are normally written to via CALL/RET or specific CET
instructions like RSTORSSP/SAVEPREVSSP. However, sometimes the kernel will
need to write to the shadow stack directly using the ring-0 only WRUSS
instruction.

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.

In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.

To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It also can't be a valid
restore token.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/special_insns.h | 13 +++++
 arch/x86/kernel/shstk.c              | 75 ++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index de48d1389936..d6cd9344f6c7 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: [addr] "r" (addr), [val] "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EFAULT;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
 #define nop() asm volatile ("nop")
 
 static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index bd9cdc3a7338..e22928c63ffc 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
 #include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+#define SS_FRAME_SIZE 8
+
 static bool features_enabled(unsigned long features)
 {
 	return current->thread.features & features;
@@ -40,6 +42,35 @@ static void features_clr(unsigned long features)
 	current->thread.features &= ~features;
 }
 
+/*
+ * Create a restore token on the shadow stack.  A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+	unsigned long addr;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(ssp, 8))
+		return -EINVAL;
+
+	addr = ssp - SS_FRAME_SIZE;
+
+	/*
+	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+	 * the token 64-bit.
+	 */
+	ssp |= BIT(0);
+
+	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+		return -EFAULT;
+
+	if (token_addr)
+		*token_addr = addr;
+
+	return 0;
+}
+
 static unsigned long alloc_shstk(unsigned long size)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
@@ -157,6 +188,50 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
 	return addr + size;
 }
 
+static unsigned long get_user_shstk_addr(void)
+{
+	unsigned long long ssp;
+
+	fpregs_lock_and_load();
+
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	fpregs_unlock();
+
+	return ssp;
+}
+
+#define SHSTK_DATA_BIT BIT(63)
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+	if (WARN_ON_ONCE(data & SHSTK_DATA_BIT))
+		return -EINVAL;
+
+	/*
+	 * Mark the high bit so that the sigframe can't be processed as a
+	 * return address.
+	 */
+	if (write_user_shstk_64(addr, data | SHSTK_DATA_BIT))
+		return -EFAULT;
+	return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+	unsigned long ldata;
+
+	if (unlikely(get_user(ldata, addr)))
+		return -EFAULT;
+
+	if (!(ldata & SHSTK_DATA_BIT))
+		return -EINVAL;
+
+	*data = ldata & ~SHSTK_DATA_BIT;
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 31/42] x86/shstk: Handle signals for shadow stack
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (29 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 30/42] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 32/42] x86/shstk: Check that SSP is aligned on sigreturn Rick Edgecombe
                   ` (12 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

When a signal is handled, the context is pushed to the stack before
handling it. For shadow stacks, since the shadow stack only tracks return
addresses, there isn't any state that needs to be pushed. However, there
are still a few things that need to be done. These things are visible to
userspace and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.

For this token the shadow stack data format defined earlier can be used.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.

So, when handling a signal push
 - the SSP pointing in the shadow stack data format
 - the restorer address below the restore token.

In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/shstk.h |  5 ++
 arch/x86/kernel/shstk.c      | 95 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c     |  1 +
 arch/x86/kernel/signal_64.c  |  6 +++
 4 files changed, 107 insertions(+)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index d4a5c7b10cb5..ecb23a8ca47d 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 struct thread_shstk {
@@ -18,6 +19,8 @@ void reset_thread_features(void);
 unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 				       unsigned long stack_size);
 void shstk_free(struct task_struct *p);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
@@ -26,6 +29,8 @@ static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
 						     unsigned long clone_flags,
 						     unsigned long stack_size) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index e22928c63ffc..f02e8ea4f1b5 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -232,6 +232,101 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
 	return 0;
 }
 
+static int shstk_push_sigframe(unsigned long *ssp)
+{
+	unsigned long target_ssp = *ssp;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(target_ssp, 8))
+		return -EINVAL;
+
+	*ssp -= SS_FRAME_SIZE;
+	if (put_shstk_data((void *__user)*ssp, target_ssp))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+	unsigned long token_addr;
+	int err;
+
+	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Restore SSP aligned? */
+	if (unlikely(!IS_ALIGNED(token_addr, 8)))
+		return -EINVAL;
+
+	/* SSP in userspace? */
+	if (unlikely(token_addr >= TASK_SIZE_MAX))
+		return -EINVAL;
+
+	*ssp = token_addr;
+
+	return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+	void __user *restorer = ksig->ka.sa.sa_restorer;
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	if (!restorer)
+		return -EINVAL;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_push_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Push restorer address */
+	ssp -= SS_FRAME_SIZE;
+	err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+	if (unlikely(err))
+		return -EFAULT;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_pop_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 004cb30b7419..356253e85ce9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -40,6 +40,7 @@
 #include <asm/syscall.h>
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/shstk.h>
 
 static inline int is_ia32_compat_frame(struct ksignal *ksig)
 {
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 0e808c72bf7e..cacf2ede6217 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -175,6 +175,9 @@ int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	frame = get_sigframe(ksig, regs, sizeof(struct rt_sigframe), &fp);
 	uc_flags = frame_uc_flags(regs);
 
+	if (setup_signal_shadow_stack(ksig))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -260,6 +263,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 32/42] x86/shstk: Check that SSP is aligned on sigreturn
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (30 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 31/42] x86/shstk: Handle signals for shadow stack Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:10 ` [PATCH v9 33/42] x86/shstk: Check that signal frame is shadow stack mem Rick Edgecombe
                   ` (11 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe

The shadow stack signal frame is read by the kernel on sigreturn. It
relies on shadow stack memory protections to prevent forgeries of this
signal frame (which included the pre-signal SSP). It also relies on the
shadow stack signal frame to have bit 63 set. Since this bit would not be
set via typical shadow stack operations, so the kernel can assume it was a
value it placed there.

However, in order to support 32 bit shadow stack, the INCSSPD instruction
can increment the shadow stack by 4 bytes. In this case SSP might be
pointing to a region spanning two 8 byte shadow stack frames. It could
confuse the checks described above.

Since the kernel only supports shadow stack in 64 bit, just check that
the SSP is 8 byte aligned in the sigreturn path.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v9:
 - New patch
---
 arch/x86/kernel/shstk.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index f02e8ea4f1b5..a8705f7d966c 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -252,6 +252,9 @@ static int shstk_pop_sigframe(unsigned long *ssp)
 	unsigned long token_addr;
 	int err;
 
+	if (!IS_ALIGNED(*ssp, 8))
+		return -EINVAL;
+
 	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
 	if (unlikely(err))
 		return err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 33/42] x86/shstk: Check that signal frame is shadow stack mem
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (31 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 32/42] x86/shstk: Check that SSP is aligned on sigreturn Rick Edgecombe
@ 2023-06-13  0:10 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 34/42] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
                   ` (10 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:10 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe

The shadow stack signal frame is read by the kernel on sigreturn. It
relies on shadow stack memory protections to prevent forgeries of this
signal frame (which included the pre-signal SSP). This behavior helps
userspace protect itself. However, using the INCSSP instruction userspace
can adjust the SSP to 8 bytes beyond the end of a shadow stack. INCSSP
performs shadow stack reads to make sure it doesn’t increment off of the
shadow stack, but on the end position it actually reads 8 bytes below the
new SSP.

For the shadow stack HW operations, this situation (INCSSP off the end
of a shadow stack by 8 bytes) would be fine. If the a RET is executed, the
push to the shadow stack would fail to write to the shadow stack. If a
CALL is executed, the SSP will be incremented back onto the stack and the
return address will be written successfully to the very end. That is
expected behavior around shadow stack underflow.

However, the kernel doesn’t have a way to read shadow stack memory using
shadow stack accesses. WRUSS can write to shadow stack memory with a
shadow stack access which ensures the access is to shadow stack memory.
But unfortunately for this case, there is no equivalent instruction for
shadow stack reads. So when reading the shadow stack signal frames, the
kernel currently assumes the SSP is pointing to the shadow stack and uses
a normal read.

The SSP pointing to shadow stack memory will be true in most cases, but as
described above, in can be untrue by 8 bytes. So lookup the VMA of the
shadow stack sigframe being read to verify it is shadow stack.

Since the SSP can only be beyond the shadow stack by 8 bytes, and
shadow stack memory is page aligned, this check only needs to be done
when this type of relative position to a page boundary is encountered.
So skip the extra work otherwise.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v9:
 - New patch
---
 arch/x86/kernel/shstk.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index a8705f7d966c..50733a510446 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -249,15 +249,38 @@ static int shstk_push_sigframe(unsigned long *ssp)
 
 static int shstk_pop_sigframe(unsigned long *ssp)
 {
+	struct vm_area_struct *vma;
 	unsigned long token_addr;
-	int err;
+	bool need_to_check_vma;
+	int err = 1;
 
+	/*
+	 * It is possible for the SSP to be off the end of a shadow stack by 4
+	 * or 8 bytes. If the shadow stack is at the start of a page or 4 bytes
+	 * before it, it might be this case, so check that the address being
+	 * read is actually shadow stack.
+	 */
 	if (!IS_ALIGNED(*ssp, 8))
 		return -EINVAL;
 
+	need_to_check_vma = PAGE_ALIGN(*ssp) == *ssp;
+
+	if (need_to_check_vma)
+		mmap_read_lock_killable(current->mm);
+
 	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
 	if (unlikely(err))
-		return err;
+		goto out_err;
+
+	if (need_to_check_vma) {
+		vma = find_vma(current->mm, *ssp);
+		if (!vma || !(vma->vm_flags & VM_SHADOW_STACK)) {
+			err = -EFAULT;
+			goto out_err;
+		}
+
+		mmap_read_unlock(current->mm);
+	}
 
 	/* Restore SSP aligned? */
 	if (unlikely(!IS_ALIGNED(token_addr, 8)))
@@ -270,6 +293,10 @@ static int shstk_pop_sigframe(unsigned long *ssp)
 	*ssp = token_addr;
 
 	return 0;
+out_err:
+	if (need_to_check_vma)
+		mmap_read_unlock(current->mm);
+	return err;
 }
 
 int setup_signal_shadow_stack(struct ksignal *ksig)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 34/42] x86/shstk: Introduce map_shadow_stack syscall
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (32 preceding siblings ...)
  2023-06-13  0:10 ` [PATCH v9 33/42] x86/shstk: Check that signal frame is shadow stack mem Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 35/42] x86/shstk: Support WRSS for userspace Rick Edgecombe
                   ` (9 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stacks is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other threads could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple of downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup its own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/include/uapi/asm/mman.h       |  3 ++
 arch/x86/kernel/shstk.c                | 59 ++++++++++++++++++++++----
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +-
 kernel/sys_ni.c                        |  1 +
 6 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..f65c671ce3b1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	64	map_shadow_stack	sys_map_shadow_stack
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 5a0256e73f1e..8148bdddbd2c 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -13,6 +13,9 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+/* Flags for map_shadow_stack(2) */
+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)	/* Set up a restore token in the shadow stack */
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 50733a510446..04c37b33a625 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@
 #include <linux/compat.h>
 #include <linux/sizes.h>
 #include <linux/user.h>
+#include <linux/syscalls.h>
 #include <asm/msr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
 	return 0;
 }
 
-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+				 unsigned long token_offset, bool set_res_tok)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, unused;
+	unsigned long mapped_addr, unused;
+
+	if (addr)
+		flags |= MAP_FIXED_NOREPLACE;
 
 	mmap_write_lock(mm);
-	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
-		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
-
+	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 
-	return addr;
+	if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
+		goto out;
+
+	if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+		vm_munmap(mapped_addr, size);
+		return -EINVAL;
+	}
+
+out:
+	return mapped_addr;
 }
 
 static unsigned long adjust_shstk_size(unsigned long size)
@@ -134,7 +147,7 @@ static int shstk_setup(void)
 		return -EOPNOTSUPP;
 
 	size = adjust_shstk_size(0);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -178,7 +191,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
 		return 0;
 
 	size = adjust_shstk_size(stack_size);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return addr;
 
@@ -398,6 +411,36 @@ static int shstk_disable(void)
 	return 0;
 }
 
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+	bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+	unsigned long aligned_size;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	if (flags & ~SHADOW_STACK_SET_TOKEN)
+		return -EINVAL;
+
+	/* If there isn't space for a token */
+	if (set_tok && size < 8)
+		return -ENOSPC;
+
+	if (addr && addr < SZ_4G)
+		return -ERANGE;
+
+	/*
+	 * An overflow would result in attempting to write the restore token
+	 * to the wrong location. Not catastrophic, but just return the right
+	 * error code and block it.
+	 */
+	aligned_size = PAGE_ALIGN(size);
+	if (aligned_size < size)
+		return -EOVERFLOW;
+
+	return alloc_shstk(addr, aligned_size, size, set_tok);
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..392dc11e3556 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
 COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 35/42] x86/shstk: Support WRSS for userspace
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (33 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 34/42] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 36/42] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, WRSS, which
can be enabled to write directly to shadow stack memory from userspace.
Allow it to get enabled via the prctl interface.

Only enable the userspace WRSS instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

>From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/uapi/asm/prctl.h |  1 +
 arch/x86/kernel/shstk.c           | 43 ++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 6a8e0e1bff4a..eedfde3b63be 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -36,5 +36,6 @@
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+#define ARCH_SHSTK_WRSS			(1ULL <<  1)
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 04c37b33a625..ea0bf113f9cf 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -390,6 +390,47 @@ void shstk_free(struct task_struct *tsk)
 	unmap_shadow_stack(shstk->base, shstk->size);
 }
 
+static int wrss_control(bool enable)
+{
+	u64 msrval;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Only enable WRSS if shadow stack is enabled. If shadow stack is not
+	 * enabled, WRSS will already be disabled, so don't bother clearing it
+	 * when disabling.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return -EPERM;
+
+	/* Already enabled/disabled? */
+	if (features_enabled(ARCH_SHSTK_WRSS) == enable)
+		return 0;
+
+	fpregs_lock_and_load();
+	rdmsrl(MSR_IA32_U_CET, msrval);
+
+	if (enable) {
+		features_set(ARCH_SHSTK_WRSS);
+		msrval |= CET_WRSS_EN;
+	} else {
+		features_clr(ARCH_SHSTK_WRSS);
+		if (!(msrval & CET_WRSS_EN))
+			goto unlock;
+
+		msrval &= ~CET_WRSS_EN;
+	}
+
+	wrmsrl(MSR_IA32_U_CET, msrval);
+
+unlock:
+	fpregs_unlock();
+
+	return 0;
+}
+
 static int shstk_disable(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
@@ -406,7 +447,7 @@ static int shstk_disable(void)
 	fpregs_unlock();
 
 	shstk_free(current);
-	features_clr(ARCH_SHSTK_SHSTK);
+	features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 36/42] x86: Expose thread features in /proc/$PID/status
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (34 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 35/42] x86/shstk: Support WRSS for userspace Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 37/42] x86/shstk: Wire in shadow stack interface Rick Edgecombe
                   ` (7 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

Applications and loaders can have logic to decide whether to enable
shadow stack. They usually don't report whether shadow stack has been
enabled or not, so there is no way to verify whether an application
actually is protected by shadow stack.

Add two lines in /proc/$PID/status to report enabled and locked features.

Since, this involves referring to arch specific defines in asm/prctl.h,
implement an arch breakout to emit the feature lines.

[Switched to CET, added to commit log]

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++
 fs/proc/array.c            |  6 ++++++
 include/linux/proc_fs.h    |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 099b6f0d96bd..31c0e68f6227 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -4,6 +4,8 @@
 #include <linux/string.h>
 #include <linux/seq_file.h>
 #include <linux/cpufreq.h>
+#include <asm/prctl.h>
+#include <linux/proc_fs.h>
 
 #include "cpu.h"
 
@@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op = {
 	.stop	= c_stop,
 	.show	= show_cpuinfo,
 };
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static void dump_x86_features(struct seq_file *m, unsigned long features)
+{
+	if (features & ARCH_SHSTK_SHSTK)
+		seq_puts(m, "shstk ");
+	if (features & ARCH_SHSTK_WRSS)
+		seq_puts(m, "wrss ");
+}
+
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task)
+{
+	seq_puts(m, "x86_Thread_features:\t");
+	dump_x86_features(m, task->thread.features);
+	seq_putc(m, '\n');
+
+	seq_puts(m, "x86_Thread_features_locked:\t");
+	dump_x86_features(m, task->thread.features_locked);
+	seq_putc(m, '\n');
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d35bbf35a874..2c2efbe685d8 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -431,6 +431,11 @@ static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm)
 	seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm));
 }
 
+__weak void arch_proc_pid_thread_features(struct seq_file *m,
+					  struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -455,6 +460,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_thread_features(m, task);
 	return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 0260f5ea98fe..80ff8e533cbd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task);
 #endif /* CONFIG_PROC_PID_ARCH_STATUS */
 
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task);
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 37/42] x86/shstk: Wire in shadow stack interface
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (35 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 36/42] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 38/42] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
                   ` (6 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

The kernel now has the main shadow stack functionality to support
applications. Wire in the WRSS and shadow stack enable/disable functions
into the existing shadow stack API skeleton.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/shstk.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index ea0bf113f9cf..d723cdc93474 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -502,9 +502,17 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 		return -EINVAL;
 
 	if (option == ARCH_SHSTK_DISABLE) {
+		if (features & ARCH_SHSTK_WRSS)
+			return wrss_control(false);
+		if (features & ARCH_SHSTK_SHSTK)
+			return shstk_disable();
 		return -EINVAL;
 	}
 
 	/* Handle ARCH_SHSTK_ENABLE */
+	if (features & ARCH_SHSTK_SHSTK)
+		return shstk_setup();
+	if (features & ARCH_SHSTK_WRSS)
+		return wrss_control(true);
 	return -EINVAL;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 38/42] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (36 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 37/42] x86/shstk: Wire in shadow stack interface Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 39/42] selftests/x86: Add shadow stack test Rick Edgecombe
                   ` (5 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Setting CR4.CET is a prerequisite for utilizing any CET features, most of
which also require setting MSRs.

Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
and is configured with kernel IBT. However, future patches that enable
userspace shadow stack support will need the bit set as well. So change
the logic to enable it in either case.

Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/cpu/common.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 80710a68ef7d..3ea06b0b4570 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -601,27 +601,43 @@ __noendbr void ibt_restore(u64 save)
 
 static __always_inline void setup_cet(struct cpuinfo_x86 *c)
 {
-	u64 msr = CET_ENDBR_EN;
+	bool user_shstk, kernel_ibt;
 
-	if (!HAS_KERNEL_IBT ||
-	    !cpu_feature_enabled(X86_FEATURE_IBT))
+	if (!IS_ENABLED(CONFIG_X86_CET))
 		return;
 
-	wrmsrl(MSR_IA32_S_CET, msr);
+	kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
+	user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+		     IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK);
+
+	if (!kernel_ibt && !user_shstk)
+		return;
+
+	if (user_shstk)
+		set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
+
+	if (kernel_ibt)
+		wrmsrl(MSR_IA32_S_CET, CET_ENDBR_EN);
+	else
+		wrmsrl(MSR_IA32_S_CET, 0);
+
 	cr4_set_bits(X86_CR4_CET);
 
-	if (!ibt_selftest()) {
+	if (kernel_ibt && !ibt_selftest()) {
 		pr_err("IBT selftest: Failed!\n");
 		wrmsrl(MSR_IA32_S_CET, 0);
 		setup_clear_cpu_cap(X86_FEATURE_IBT);
-		return;
 	}
 }
 
 __noendbr void cet_disable(void)
 {
-	if (cpu_feature_enabled(X86_FEATURE_IBT))
-		wrmsrl(MSR_IA32_S_CET, 0);
+	if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
+	      cpu_feature_enabled(X86_FEATURE_SHSTK)))
+		return;
+
+	wrmsrl(MSR_IA32_S_CET, 0);
+	wrmsrl(MSR_IA32_U_CET, 0);
 }
 
 /*
@@ -1483,6 +1499,9 @@ static void __init cpu_parse_early_param(void)
 	if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
 		setup_clear_cpu_cap(X86_FEATURE_XSAVES);
 
+	if (cmdline_find_option_bool(boot_command_line, "nousershstk"))
+		setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK);
+
 	arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg));
 	if (arglen <= 0)
 		return;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 39/42] selftests/x86: Add shadow stack test
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (37 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 38/42] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 40/42] x86: Add PTRACE interface for shadow stack Rick Edgecombe
                   ` (4 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Add a simple selftest for exercising some shadow stack behavior:
 - map_shadow_stack syscall and pivot
 - Faulting in shadow stack memory
 - Handling shadow stack violations
 - GUP of shadow stack memory
 - mprotect() of shadow stack memory
 - Userfaultfd on shadow stack memory
 - 32 bit segmentation
 - Guard gap test
 - Ptrace test

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Fix race in userfaultfd test
 - Add guard gap test
 - Add ptrace test
---
 tools/testing/selftests/x86/Makefile          |   2 +-
 .../testing/selftests/x86/test_shadow_stack.c | 884 ++++++++++++++++++
 2 files changed, 885 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 598135d3162b..7e8c937627dd 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
-			corrupt_xstate_header amx lam
+			corrupt_xstate_header amx lam test_shadow_stack
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
new file mode 100644
index 000000000000..5ab788dc1ce6
--- /dev/null
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -0,0 +1,884 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program test's basic kernel shadow stack support. It enables shadow
+ * stack manual via the arch_prctl(), instead of relying on glibc. It's
+ * Makefile doesn't compile with shadow stack support, so it doesn't rely on
+ * any particular glibc. As a result it can't do any operations that require
+ * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just
+ * stick to the basics and hope the compiler doesn't do anything strange.
+ */
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <asm/mman.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <stdint.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <setjmp.h>
+#include <sys/ptrace.h>
+#include <sys/signal.h>
+#include <linux/elf.h>
+
+/*
+ * Define the ABI defines if needed, so people can run the tests
+ * without building the headers.
+ */
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack	451
+
+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)
+
+#define ARCH_SHSTK_ENABLE	0x5001
+#define ARCH_SHSTK_DISABLE	0x5002
+#define ARCH_SHSTK_LOCK		0x5003
+#define ARCH_SHSTK_UNLOCK	0x5004
+#define ARCH_SHSTK_STATUS	0x5005
+
+#define ARCH_SHSTK_SHSTK	(1ULL <<  0)
+#define ARCH_SHSTK_WRSS		(1ULL <<  1)
+
+#define NT_X86_SHSTK	0x204
+#endif
+
+#define SS_SIZE 0x200000
+#define PAGE_SIZE 0x1000
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+	printf("[SKIP]\tCompiler does not support CET.\n");
+	return 0;
+}
+#else
+void write_shstk(unsigned long *addr, unsigned long val)
+{
+	asm volatile("wrssq %[val], (%[addr])\n"
+		     : "=m" (addr)
+		     : [addr] "r" (addr), [val] "r" (val));
+}
+
+static inline unsigned long __attribute__((always_inline)) get_ssp(void)
+{
+	unsigned long ret = 0;
+
+	asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
+	return ret;
+}
+
+/*
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack gets enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull in all
+ * of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2)					\
+({								\
+	long _ret;						\
+	register long _num  asm("eax") = __NR_arch_prctl;	\
+	register long _arg1 asm("rdi") = (long)(arg1);		\
+	register long _arg2 asm("rsi") = (long)(arg2);		\
+								\
+	asm volatile (						\
+		"syscall\n"					\
+		: "=a"(_ret)					\
+		: "r"(_arg1), "r"(_arg2),			\
+		  "0"(_num)					\
+		: "rcx", "r11", "memory", "cc"			\
+	);							\
+	_ret;							\
+})
+
+void *create_shstk(void *addr)
+{
+	return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+void *create_normal_mem(void *addr)
+{
+	return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
+		    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+}
+
+void free_shstk(void *shstk)
+{
+	munmap(shstk, SS_SIZE);
+}
+
+int reset_shstk(void *shstk)
+{
+	return madvise(shstk, SS_SIZE, MADV_DONTNEED);
+}
+
+void try_shstk(unsigned long new_ssp)
+{
+	unsigned long ssp;
+
+	printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
+	       new_ssp, *((unsigned long *)new_ssp));
+
+	ssp = get_ssp();
+	printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
+
+	asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+	asm volatile("saveprevssp");
+	printf("[INFO]\tssp is now %lx\n", get_ssp());
+
+	/* Switch back to original shadow stack */
+	ssp -= 8;
+	asm volatile("rstorssp (%0)\n":: "r" (ssp));
+	asm volatile("saveprevssp");
+}
+
+int test_shstk_pivot(void)
+{
+	void *shstk = create_shstk(0);
+
+	if (shstk == MAP_FAILED) {
+		printf("[FAIL]\tError creating shadow stack: %d\n", errno);
+		return 1;
+	}
+	try_shstk((unsigned long)shstk + SS_SIZE - 8);
+	free_shstk(shstk);
+
+	printf("[OK]\tShadow stack pivot\n");
+	return 0;
+}
+
+int test_shstk_faults(void)
+{
+	unsigned long *shstk = create_shstk(0);
+
+	/* Read shadow stack, test if it's zero to not get read optimized out */
+	if (*shstk != 0)
+		goto err;
+
+	/* Wrss memory that was already read. */
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	/* Page out memory, so we can wrss it again. */
+	if (reset_shstk((void *)shstk))
+		goto err;
+
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	printf("[OK]\tShadow stack faults\n");
+	return 0;
+
+err:
+	return 1;
+}
+
+unsigned long saved_ssp;
+unsigned long saved_ssp_val;
+volatile bool segv_triggered;
+
+void __attribute__((noinline)) violate_ss(void)
+{
+	saved_ssp = get_ssp();
+	saved_ssp_val = *(unsigned long *)saved_ssp;
+
+	/* Corrupt shadow stack */
+	printf("[INFO]\tCorrupting shadow stack\n");
+	write_shstk((void *)saved_ssp, 0);
+}
+
+void segv_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tGenerated shadow stack violation successfully\n");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	write_shstk((void *)saved_ssp, saved_ssp_val);
+}
+
+int test_shstk_violation(void)
+{
+	struct sigaction sa = {};
+
+	sa.sa_sigaction = segv_handler;
+	sa.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before violate_ss() */
+	asm volatile("" : : : "memory");
+
+	violate_ss();
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow stack violation test\n");
+
+	return !segv_triggered;
+}
+
+/* Gup test state */
+#define MAGIC_VAL 0x12345678
+bool is_shstk_access;
+void *shstk_ptr;
+int fd;
+
+void reset_test_shstk(void *addr)
+{
+	if (shstk_ptr)
+		free_shstk(shstk_ptr);
+	shstk_ptr = create_shstk(addr);
+}
+
+void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	if (is_shstk_access) {
+		reset_test_shstk(shstk_ptr);
+		return;
+	}
+
+	free_shstk(shstk_ptr);
+	create_normal_mem(shstk_ptr);
+}
+
+bool test_shstk_access(void *ptr)
+{
+	is_shstk_access = true;
+	segv_triggered = false;
+	write_shstk(ptr, MAGIC_VAL);
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool test_write_access(void *ptr)
+{
+	is_shstk_access = false;
+	segv_triggered = false;
+	*(unsigned long *)ptr = MAGIC_VAL;
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool gup_write(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (write(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+bool gup_read(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (read(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+int test_gup(void)
+{
+	struct sigaction sa = {};
+	int status;
+	pid_t pid;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	sa.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+
+	segv_triggered = false;
+
+	fd = open("/proc/self/mem", O_RDWR);
+	if (fd == -1)
+		return 1;
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> write access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> write access success\n");
+
+	close(fd);
+
+	/* COW/gup test */
+	reset_test_shstk(0);
+	pid = fork();
+	if (!pid) {
+		fd = open("/proc/self/mem", O_RDWR);
+		if (fd == -1)
+			exit(1);
+
+		if (gup_write(shstk_ptr)) {
+			close(fd);
+			exit(1);
+		}
+		close(fd);
+		exit(0);
+	}
+	waitpid(pid, &status, 0);
+	if (WEXITSTATUS(status)) {
+		printf("[FAIL]\tWrite in child failed\n");
+		return 1;
+	}
+	if (*(unsigned long *)shstk_ptr == MAGIC_VAL) {
+		printf("[FAIL]\tWrite in child wrote through to shared memory\n");
+		return 1;
+	}
+
+	printf("[INFO]\tCow gup write -> write access success\n");
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow gup test\n");
+
+	return 0;
+}
+
+int test_mprotect(void)
+{
+	struct sigaction sa = {};
+
+	sa.sa_sigaction = test_access_fix_handler;
+	sa.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+
+	segv_triggered = false;
+
+	/* mprotect a shadow stack as read only */
+	reset_test_shstk(0);
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+		return 1;
+	}
+
+	/* try to wrss it and fail */
+	if (!test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to read-only memory succeeded\n");
+		return 1;
+	}
+
+	/*
+	 * The shadow stack was reset above to resolve the fault, make the new one
+	 * read-only.
+	 */
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+		return 1;
+	}
+
+	/* then back to writable */
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_WRITE) failed\n");
+		return 1;
+	}
+
+	/* then wrss to it and succeed */
+	if (test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n");
+		return 1;
+	}
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tmprotect() test\n");
+
+	return 0;
+}
+
+char zero[4096];
+
+static void *uffd_thread(void *arg)
+{
+	struct uffdio_copy req;
+	int uffd = *(int *)arg;
+	struct uffd_msg msg;
+	int ret;
+
+	while (1) {
+		ret = read(uffd, &msg, sizeof(msg));
+		if (ret > 0)
+			break;
+		else if (errno == EAGAIN)
+			continue;
+		return (void *)1;
+	}
+
+	req.dst = msg.arg.pagefault.address;
+	req.src = (__u64)zero;
+	req.len = 4096;
+	req.mode = 0;
+
+	if (ioctl(uffd, UFFDIO_COPY, &req))
+		return (void *)1;
+
+	return (void *)0;
+}
+
+int test_userfaultfd(void)
+{
+	struct uffdio_register uffdio_register;
+	struct uffdio_api uffdio_api;
+	struct sigaction sa = {};
+	pthread_t thread;
+	void *res;
+	int uffd;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	sa.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		printf("[SKIP]\tUserfaultfd unavailable.\n");
+		return 0;
+	}
+
+	reset_test_shstk(0);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+		goto err;
+
+	uffdio_register.range.start = (__u64)shstk_ptr;
+	uffdio_register.range.len = 4096;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		goto err;
+
+	if (pthread_create(&thread, NULL, &uffd_thread, &uffd))
+		goto err;
+
+	reset_shstk(shstk_ptr);
+	test_shstk_access(shstk_ptr);
+
+	if (pthread_join(thread, &res))
+		goto err;
+
+	if (test_shstk_access(shstk_ptr))
+		goto err;
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	if (!res)
+		printf("[OK]\tUserfaultfd test\n");
+	return !!res;
+err:
+	free_shstk(shstk_ptr);
+	close(uffd);
+	signal(SIGSEGV, SIG_DFL);
+	return 1;
+}
+
+/* Simple linked list for keeping track of mappings in test_guard_gap() */
+struct node {
+	struct node *next;
+	void *mapping;
+};
+
+/*
+ * This tests whether mmap will place other mappings in a shadow stack's guard
+ * gap. The steps are:
+ *   1. Finds an empty place by mapping and unmapping something.
+ *   2. Map a shadow stack in the middle of the known empty area.
+ *   3. Map a bunch of PAGE_SIZE mappings. These will use the search down
+ *      direction, filling any gaps until it encounters the shadow stack's
+ *      guard gap.
+ *   4. When a mapping lands below the shadow stack from step 2, then all
+ *      of the above gaps are filled. The search down algorithm will have
+ *      looked at the shadow stack gaps.
+ *   5. See if it landed in the gap.
+ */
+int test_guard_gap(void)
+{
+	void *free_area, *shstk, *test_map = (void *)0xFFFFFFFFFFFFFFFF;
+	struct node *head = NULL, *cur;
+
+	free_area = mmap(0, SS_SIZE * 3, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(free_area, SS_SIZE * 3);
+
+	shstk = create_shstk(free_area + SS_SIZE);
+	if (shstk == MAP_FAILED)
+		return 1;
+
+	while (test_map > shstk) {
+		test_map = mmap(0, PAGE_SIZE, PROT_READ | PROT_WRITE,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+		if (test_map == MAP_FAILED)
+			return 1;
+		cur = malloc(sizeof(*cur));
+		cur->mapping = test_map;
+
+		cur->next = head;
+		head = cur;
+	}
+
+	while (head) {
+		cur = head;
+		head = cur->next;
+		munmap(cur->mapping, PAGE_SIZE);
+		free(cur);
+	}
+
+	free_shstk(shstk);
+
+	if (shstk - test_map - PAGE_SIZE != PAGE_SIZE)
+		return 1;
+
+	printf("[OK]\tGuard gap test\n");
+
+	return 0;
+}
+
+/*
+ * Too complicated to pull it out of the 32 bit header, but also get the
+ * 64 bit one needed above. Just define a copy here.
+ */
+#define __NR_compat_sigaction 67
+
+/*
+ * Call 32 bit signal handler to get 32 bit signals ABI. Make sure
+ * to push the registers that will get clobbered.
+ */
+int sigaction32(int signum, const struct sigaction *restrict act,
+		struct sigaction *restrict oldact)
+{
+	register long syscall_reg asm("eax") = __NR_compat_sigaction;
+	register long signum_reg asm("ebx") = signum;
+	register long act_reg asm("ecx") = (long)act;
+	register long oldact_reg asm("edx") = (long)oldact;
+	int ret = 0;
+
+	asm volatile ("int $0x80;"
+		      : "=a"(ret), "=m"(oldact)
+		      : "r"(syscall_reg), "r"(signum_reg), "r"(act_reg),
+			"r"(oldact_reg)
+		      : "r8", "r9", "r10", "r11"
+		     );
+
+	return ret;
+}
+
+sigjmp_buf jmp_buffer;
+
+void segv_gp_handler(int signum, siginfo_t *si, void *uc)
+{
+	segv_triggered = true;
+
+	/*
+	 * To work with old glibc, this can't rely on siglongjmp working with
+	 * shadow stack enabled, so disable shadow stack before siglongjmp().
+	 */
+	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
+	siglongjmp(jmp_buffer, -1);
+}
+
+/*
+ * Transition to 32 bit mode and check that a #GP triggers a segfault.
+ */
+int test_32bit(void)
+{
+	struct sigaction sa = {};
+	struct sigaction *sa32;
+
+	/* Create sigaction in 32 bit address range */
+	sa32 = mmap(0, 4096, PROT_READ | PROT_WRITE,
+		    MAP_32BIT | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	sa32->sa_flags = SA_SIGINFO;
+
+	sa.sa_sigaction = segv_gp_handler;
+	sa.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before triggering the #GP */
+	asm volatile("" : : : "memory");
+
+	/*
+	 * Set handler to somewhere in 32 bit address space
+	 */
+	sa32->sa_handler = (void *)sa32;
+	if (sigaction32(SIGUSR1, sa32, NULL))
+		return 1;
+
+	if (!sigsetjmp(jmp_buffer, 1))
+		raise(SIGUSR1);
+
+	if (segv_triggered)
+		printf("[OK]\t32 bit test\n");
+
+	return !segv_triggered;
+}
+
+void segv_handler_ptrace(int signum, siginfo_t *si, void *uc)
+{
+	/* The SSP adjustment caused a segfault. */
+	exit(0);
+}
+
+int test_ptrace(void)
+{
+	unsigned long saved_ssp, ssp = 0;
+	struct sigaction sa= {};
+	struct iovec iov;
+	int status;
+	int pid;
+
+	iov.iov_base = &ssp;
+	iov.iov_len = sizeof(ssp);
+
+	pid = fork();
+	if (!pid) {
+		ssp = get_ssp();
+
+		sa.sa_sigaction = segv_handler_ptrace;
+		sa.sa_flags = SA_SIGINFO;
+		if (sigaction(SIGSEGV, &sa, NULL))
+			return 1;
+
+		ptrace(PTRACE_TRACEME, NULL, NULL, NULL);
+		/*
+		 * The parent will tweak the SSP and return from this function
+		 * will #CP.
+		 */
+		raise(SIGTRAP);
+
+		exit(1);
+	}
+
+	while (waitpid(pid, &status, 0) != -1 && WSTOPSIG(status) != SIGTRAP);
+
+	if (ptrace(PTRACE_GETREGSET, pid, NT_X86_SHSTK, &iov)) {
+		printf("[INFO]\tFailed to PTRACE_GETREGS\n");
+		goto out_kill;
+	}
+
+	if (!ssp) {
+		printf("[INFO]\tPtrace child SSP was 0\n");
+		goto out_kill;
+	}
+
+	saved_ssp = ssp;
+
+	iov.iov_len = 0;
+	if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) {
+		printf("[INFO]\tToo small size accepted via PTRACE_SETREGS\n");
+		goto out_kill;
+	}
+
+	iov.iov_len = sizeof(ssp) + 1;
+	if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) {
+		printf("[INFO]\tToo large size accepted via PTRACE_SETREGS\n");
+		goto out_kill;
+	}
+
+	ssp += 1;
+	if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) {
+		printf("[INFO]\tUnaligned SSP written via PTRACE_SETREGS\n");
+		goto out_kill;
+	}
+
+	ssp = 0xFFFFFFFFFFFF0000;
+	if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) {
+		printf("[INFO]\tKernel range SSP written via PTRACE_SETREGS\n");
+		goto out_kill;
+	}
+
+	/*
+	 * Tweak the SSP so the child with #CP when it resumes and returns
+	 * from raise()
+	 */
+	ssp = saved_ssp + 8;
+	iov.iov_len = sizeof(ssp);
+	if (ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) {
+		printf("[INFO]\tFailed to PTRACE_SETREGS\n");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_DETACH, pid, NULL, NULL)) {
+		printf("[INFO]\tFailed to PTRACE_DETACH\n");
+		goto out_kill;
+	}
+
+	waitpid(pid, &status, 0);
+	if (WEXITSTATUS(status))
+		return 1;
+
+	printf("[OK]\tPtrace test\n");
+	return 0;
+
+out_kill:
+	kill(pid, SIGKILL);
+	return 1;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = 0;
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+		printf("[SKIP]\tCould not enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+		printf("[SKIP]\tCould not re-enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_WRSS)) {
+		printf("[SKIP]\tCould not enable WRSS\n");
+		ret = 1;
+		goto out;
+	}
+
+	/* Should have succeeded if here, but this is a test, so double check. */
+	if (!get_ssp()) {
+		printf("[FAIL]\tShadow stack disabled\n");
+		return 1;
+	}
+
+	if (test_shstk_pivot()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack pivot\n");
+		goto out;
+	}
+
+	if (test_shstk_faults()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack fault test\n");
+		goto out;
+	}
+
+	if (test_shstk_violation()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack violation test\n");
+		goto out;
+	}
+
+	if (test_gup()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow stack gup\n");
+		goto out;
+	}
+
+	if (test_mprotect()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow mprotect test\n");
+		goto out;
+	}
+
+	if (test_userfaultfd()) {
+		ret = 1;
+		printf("[FAIL]\tUserfaultfd test\n");
+		goto out;
+	}
+
+	if (test_guard_gap()) {
+		ret = 1;
+		printf("[FAIL]\tGuard gap test\n");
+		goto out;
+	}
+
+	if (test_ptrace()) {
+		ret = 1;
+		printf("[FAIL]\tptrace test\n");
+	}
+
+	if (test_32bit()) {
+		ret = 1;
+		printf("[FAIL]\t32 bit test\n");
+		goto out;
+	}
+
+	return ret;
+
+out:
+	/*
+	 * Disable shadow stack before the function returns, or there will be a
+	 * shadow stack violation.
+	 */
+	if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	return ret;
+}
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 40/42] x86: Add PTRACE interface for shadow stack
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (38 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 39/42] selftests/x86: Add shadow stack test Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 41/42] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
                   ` (3 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Yu-cheng Yu, Pengfei Xu

Some applications (like GDB) would like to tweak shadow stack state via
ptrace. This allows for existing functionality to continue to work for
seized shadow stack applications. Provide a regset interface for
manipulating the shadow stack pointer (SSP).

There is already ptrace functionality for accessing xstate, but this
does not include supervisor xfeatures. So there is not a completely
clear place for where to put the shadow stack state. Adding it to the
user xfeatures regset would complicate that code, as it currently shares
logic with signals which should not have supervisor features.

Don't add a general supervisor xfeature regset like the user one,
because it is better to maintain flexibility for other supervisor
xfeatures to define their own interface. For example, an xfeature may
decide not to expose all of it's state to userspace, as is actually the
case for  shadow stack ptrace functionality. A lot of enum values remain
to be used, so just put it in dedicated shadow stack regset.

The only downside to not having a generic supervisor xfeature regset,
is that apps need to be enlightened of any new supervisor xfeature
exposed this way (i.e. they can't try to have generic save/restore
logic). But maybe that is a good thing, because they have to think
through each new xfeature instead of encountering issues when a new
supervisor xfeature was added.

By adding a shadow stack regset, it also has the effect of including the
shadow stack state in a core dump, which could be useful for debugging.

The shadow stack specific xstate includes the SSP, and the shadow stack
and WRSS enablement status. Enabling shadow stack or WRSS in the kernel
involves more than just flipping the bit. The kernel is made aware that
it has to do extra things when cloning or handling signals. That logic
is triggered off of separate feature enablement state kept in the task
struct. So the flipping on HW shadow stack enforcement without notifying
the kernel to change its behavior would severely limit what an application
could do without crashing, and the results would depend on kernel
internal implementation details. There is also no known use for controlling
this state via ptrace today. So only expose the SSP, which is something
that userspace already has indirect control over.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
v9:
 - Squash "Enforce only whole copies for ssp_set()" fix that previously
   was in tip.
---
 arch/x86/include/asm/fpu/regset.h |  7 +--
 arch/x86/kernel/fpu/regset.c      | 81 +++++++++++++++++++++++++++++++
 arch/x86/kernel/ptrace.c          | 12 +++++
 include/uapi/linux/elf.h          |  2 +
 4 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index 4f928d6a367b..697b77e96025 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@
 
 #include <linux/regset.h>
 
-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+				ssp_active;
 extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get,
-				 xstateregs_get;
+				 xstateregs_get, ssp_get;
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
-				 xstateregs_set;
+				 xstateregs_set, ssp_set;
 
 /*
  * xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 6d056b68f4ed..6bc1eb2a21bd 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -8,6 +8,7 @@
 #include <asm/fpu/api.h>
 #include <asm/fpu/signal.h>
 #include <asm/fpu/regset.h>
+#include <asm/prctl.h>
 
 #include "context.h"
 #include "internal.h"
@@ -174,6 +175,86 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	return ret;
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+int ssp_active(struct task_struct *target, const struct user_regset *regset)
+{
+	if (target->thread.features & ARCH_SHSTK_SHSTK)
+		return regset->n;
+
+	return 0;
+}
+
+int ssp_get(struct task_struct *target, const struct user_regset *regset,
+	    struct membuf to)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct cet_user_state *cetregs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -ENODEV;
+
+	sync_fpstate(fpu);
+	cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER);
+	if (WARN_ON(!cetregs)) {
+		/*
+		 * This shouldn't ever be NULL because shadow stack was
+		 * verified to be enabled above. This means
+		 * MSR_IA32_U_CET.CET_SHSTK_EN should be 1 and so
+		 * XFEATURE_CET_USER should not be in the init state.
+		 */
+		return -ENODEV;
+	}
+
+	return membuf_write(&to, (unsigned long *)&cetregs->user_ssp,
+			    sizeof(cetregs->user_ssp));
+}
+
+int ssp_set(struct task_struct *target, const struct user_regset *regset,
+	    unsigned int pos, unsigned int count,
+	    const void *kbuf, const void __user *ubuf)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct xregs_state *xsave = &fpu->fpstate->regs.xsave;
+	struct cet_user_state *cetregs;
+	unsigned long user_ssp;
+	int r;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+	    !ssp_active(target, regset))
+		return -ENODEV;
+
+	if (pos != 0 || count != sizeof(user_ssp))
+		return -EINVAL;
+
+	r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_ssp, 0, -1);
+	if (r)
+		return r;
+
+	/*
+	 * Some kernel instructions (IRET, etc) can cause exceptions in the case
+	 * of disallowed CET register values. Just prevent invalid values.
+	 */
+	if (user_ssp >= TASK_SIZE_MAX || !IS_ALIGNED(user_ssp, 8))
+		return -EINVAL;
+
+	fpu_force_restore(fpu);
+
+	cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
+	if (WARN_ON(!cetregs)) {
+		/*
+		 * This shouldn't ever be NULL because shadow stack was
+		 * verified to be enabled above. This means
+		 * MSR_IA32_U_CET.CET_SHSTK_EN should be 1 and so
+		 * XFEATURE_CET_USER should not be in the init state.
+		 */
+		return -ENODEV;
+	}
+
+	cetregs->user_ssp = user_ssp;
+	return 0;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 
 /*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index dfaa270a7cc9..095f04bdabdc 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -58,6 +58,7 @@ enum x86_regset_64 {
 	REGSET64_FP,
 	REGSET64_IOPERM,
 	REGSET64_XSTATE,
+	REGSET64_SSP,
 };
 
 #define REGSET_GENERAL \
@@ -1267,6 +1268,17 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
 		.active		= ioperm_active,
 		.regset_get	= ioperm_get
 	},
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+	[REGSET64_SSP] = {
+		.core_note_type	= NT_X86_SHSTK,
+		.n		= 1,
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= ssp_active,
+		.regset_get	= ssp_get,
+		.set		= ssp_set
+	},
+#endif
 };
 
 static const struct user_regset_view user_x86_64_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index ac3da855fb19..fa1ceeae2596 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -406,6 +406,8 @@ typedef struct elf64_shdr {
 #define NT_386_TLS	0x200		/* i386 TLS slots (struct user_desc) */
 #define NT_386_IOPERM	0x201		/* x86 io permission bitmap (1=deny) */
 #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
+/* Old binutils treats 0x203 as a CET state */
+#define NT_X86_SHSTK	0x204		/* x86 SHSTK state */
 #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
 #define NT_S390_TIMER	0x301		/* s390 timer register */
 #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 41/42] x86/shstk: Add ARCH_SHSTK_UNLOCK
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (39 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 40/42] x86: Add PTRACE interface for shadow stack Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  0:11 ` [PATCH v9 42/42] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
                   ` (2 subsequent siblings)
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Mike Rapoport, Pengfei Xu

From: Mike Rapoport <rppt@linux.ibm.com>

Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other shadow stack operations, but restrict it being
called by the ptrace arch_pctl() interface.

[Merged into recent API changes, added commit log and docs]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 Documentation/arch/x86/shstk.rst  | 4 ++++
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 9 +++++++--
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shstk.rst
index f09afa504ec0..f3553cc8c758 100644
--- a/Documentation/arch/x86/shstk.rst
+++ b/Documentation/arch/x86/shstk.rst
@@ -75,6 +75,10 @@ arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
     are ignored. The mask is ORed with the existing value. So any feature bits
     set here cannot be enabled or disabled afterwards.
 
+arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
+    Unlock features. 'features' is a mask of all features to unlock. All
+    bits set are processed, unset bits are ignored. Only works via ptrace.
+
 The return values are as follows. On success, return 0. On error, errno can
 be::
 
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index eedfde3b63be..3189c4a96468 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -33,6 +33,7 @@
 #define ARCH_SHSTK_ENABLE		0x5001
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
+#define ARCH_SHSTK_UNLOCK		0x5004
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 0f89aa0186d1..e6db21c470aa 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -899,6 +899,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_SHSTK_ENABLE:
 	case ARCH_SHSTK_DISABLE:
 	case ARCH_SHSTK_LOCK:
+	case ARCH_SHSTK_UNLOCK:
 		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index d723cdc93474..d43b7a9c57ce 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -489,9 +489,14 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 		return 0;
 	}
 
-	/* Don't allow via ptrace */
-	if (task != current)
+	/* Only allow via ptrace */
+	if (task != current) {
+		if (option == ARCH_SHSTK_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
+			task->thread.features_locked &= ~features;
+			return 0;
+		}
 		return -EINVAL;
+	}
 
 	/* Do not allow to change locked features */
 	if (features & task->thread.features_locked)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH v9 42/42] x86/shstk: Add ARCH_SHSTK_STATUS
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (40 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 41/42] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
@ 2023-06-13  0:11 ` Rick Edgecombe
  2023-06-13  1:34 ` [PATCH v9 00/42] Shadow stacks for userspace Linus Torvalds
  2023-06-14 23:45 ` Mark Brown
  43 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-06-13  0:11 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie
  Cc: rick.p.edgecombe, Pengfei Xu

CRIU and GDB need to get the current shadow stack and WRSS enablement
status. This information is already available via /proc/pid/status, but
this is inconvenient for CRIU because it involves parsing the text output
in an area of the code where this is difficult. Provide a status
arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
userspace address, and make the new arch_prctl simply copy the features
out to userspace.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 Documentation/arch/x86/shstk.rst  | 6 ++++++
 arch/x86/include/asm/shstk.h      | 2 +-
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 8 +++++++-
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shstk.rst
index f3553cc8c758..60260e809baf 100644
--- a/Documentation/arch/x86/shstk.rst
+++ b/Documentation/arch/x86/shstk.rst
@@ -79,6 +79,11 @@ arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
     Unlock features. 'features' is a mask of all features to unlock. All
     bits set are processed, unset bits are ignored. Only works via ptrace.
 
+arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr)
+    Copy the currently enabled features to the address passed in addr. The
+    features are described using the bits passed into the others in
+    'features'.
+
 The return values are as follows. On success, return 0. On error, errno can
 be::
 
@@ -86,6 +91,7 @@ be::
         -ENOTSUPP if the feature is not supported by the hardware or
          kernel.
         -EINVAL arguments (non existing feature, etc)
+        -EFAULT if could not copy information back to userspace
 
 The feature's bits supported are::
 
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ecb23a8ca47d..42fee8959df7 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -14,7 +14,7 @@ struct thread_shstk {
 	u64	size;
 };
 
-long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
 unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 				       unsigned long stack_size);
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 3189c4a96468..384e2cc6ac19 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -34,6 +34,7 @@
 #define ARCH_SHSTK_DISABLE		0x5002
 #define ARCH_SHSTK_LOCK			0x5003
 #define ARCH_SHSTK_UNLOCK		0x5004
+#define ARCH_SHSTK_STATUS		0x5005
 
 /* ARCH_SHSTK_ features bits */
 #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e6db21c470aa..33b268747bb7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -900,6 +900,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_SHSTK_DISABLE:
 	case ARCH_SHSTK_LOCK:
 	case ARCH_SHSTK_UNLOCK:
+	case ARCH_SHSTK_STATUS:
 		return shstk_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index d43b7a9c57ce..b26810c7cd1c 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -482,8 +482,14 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi
 	return alloc_shstk(addr, aligned_size, size, set_tok);
 }
 
-long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
 {
+	unsigned long features = arg2;
+
+	if (option == ARCH_SHSTK_STATUS) {
+		return put_user(task->thread.features, (unsigned long __user *)arg2);
+	}
+
 	if (option == ARCH_SHSTK_LOCK) {
 		task->thread.features_locked |= features;
 		return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 00/42] Shadow stacks for userspace
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (41 preceding siblings ...)
  2023-06-13  0:11 ` [PATCH v9 42/42] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
@ 2023-06-13  1:34 ` Linus Torvalds
  2023-06-13  3:12   ` Edgecombe, Rick P
  2023-06-14 23:45 ` Mark Brown
  43 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2023-06-13  1:34 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, broonie

On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> This series implements Shadow Stacks for userspace using x86's Control-flow
> Enforcement Technology (CET).

Do you have this in a git tree somewhere? For series with this many
patches, I find it easier to just do a "git fetch" and "gitk
..FETCH_HEAD" these days, and then reply by email on anything I find.

That's partly because it makes it really easy to zoom in on some
particular area (eg "let's look at just mm/ and the generic include
files")

                  Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 00/42] Shadow stacks for userspace
  2023-06-13  1:34 ` [PATCH v9 00/42] Shadow stacks for userspace Linus Torvalds
@ 2023-06-13  3:12   ` Edgecombe, Rick P
  2023-06-13 17:44     ` Linus Torvalds
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13  3:12 UTC (permalink / raw)
  To: Torvalds, Linus
  Cc: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, peterz, corbet,
	linux-kernel, jannh, dethoma, broonie, mike.kravetz, pavel, bp,
	rdunlap, linux-api, john.allen, arnd, jamorris, rppt,
	bsingharora, x86, oleg, fweimer, keescook, gorcunov,
	andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools, debug,
	linux-mm, Syromiatnikov, Eugene, Yang, Weijiang, linux-doc,
	dave.hansen, Eranian, Stephane

On Mon, 2023-06-12 at 18:34 -0700, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > This series implements Shadow Stacks for userspace using x86's
> > Control-flow
> > Enforcement Technology (CET).
> 
> Do you have this in a git tree somewhere? For series with this many
> patches, I find it easier to just do a "git fetch" and "gitk
> ..FETCH_HEAD" these days, and then reply by email on anything I find.
> 
> That's partly because it makes it really easy to zoom in on some
> particular area (eg "let's look at just mm/ and the generic include
> files")

Sure. I probably should have included that upfront. Here is a github
repo:
https://github.com/rpedgeco/linux/tree/user_shstk_v9

I went ahead and included the tags[0] from last time in case that's
useful, but unfortunately the github web interface is not very
conducive to viewing the tag-based segmentation of the series. If
having it in a korg repo would be useful, please let me know.

[0]
https://lore.kernel.org/lkml/4433c3595db23f7c779b69b222958151b69ddd70.camel@intel.com/

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
@ 2023-06-13  7:19   ` Geert Uytterhoeven
  2023-06-13 16:14     ` Edgecombe, Rick P
  2023-06-13  7:43   ` Mike Rapoport
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 151+ messages in thread
From: Geert Uytterhoeven @ 2023-06-13  7:19 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-ia64, loongarch, linux-m68k, Michal Simek, Dinh Nguyen,
	linux-mips, openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, Linus Torvalds

On Tue, Jun 13, 2023 at 2:13 AM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
>
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.
>
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
>
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
>
> Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and
> adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
> pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(),
> create a new arch config HAS_HUGE_PAGE that can be used to tell if
> pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the
> compiler would not be able to find pmd_mkwrite_novma().
>
> No functional change.
>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: linux-hexagon@vger.kernel.org
> Cc: linux-ia64@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-m68k@lists.linux-m68k.org
> Cc: Michal Simek <monstr@monstr.eu>
> Cc: Dinh Nguyen <dinguyen@kernel.org>
> Cc: linux-mips@vger.kernel.org
> Cc: openrisc@lists.librecores.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-um@lists.infradead.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/

>  arch/m68k/include/asm/mcf_pgtable.h          |  2 +-
>  arch/m68k/include/asm/motorola_pgtable.h     |  2 +-
>  arch/m68k/include/asm/sun3_pgtable.h         |  2 +-

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA
  2023-06-13  0:10 ` [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
@ 2023-06-13  7:42   ` Mike Rapoport
  2023-06-13 16:20     ` Edgecombe, Rick P
  2023-06-13 12:28   ` David Hildenbrand
  1 sibling, 1 reply; 151+ messages in thread
From: Mike Rapoport @ 2023-06-13  7:42 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie

On Mon, Jun 12, 2023 at 05:10:29PM -0700, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.

Nit:                             ^ mappings?
 
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
> 
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
> 
> In a previous patches, pte_mkwrite() was renamed pte_mkwrite_novma() and
> callers that don't have a VMA were changed to use pte_mkwrite_novma(). So

Maybe
This is the third step so pte_mkwrite() was renamed to pte_mkwrite_novma()
and ...

> now change pte_mkwrite() to take a VMA and change the remaining callers to
> pass a VMA. Apply the same changes for pmd_mkwrite().
> 
> No functional change.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
  2023-06-13  7:19   ` Geert Uytterhoeven
@ 2023-06-13  7:43   ` Mike Rapoport
  2023-06-13 16:14     ` Edgecombe, Rick P
  2023-06-13 12:26   ` David Hildenbrand
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 151+ messages in thread
From: Mike Rapoport @ 2023-06-13  7:43 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-ia64, loongarch, linux-m68k, Michal Simek, Dinh Nguyen,
	linux-mips, openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, Linus Torvalds

On Mon, Jun 12, 2023 at 05:10:27PM -0700, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.

Nit:                            ^ mapping?

> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
> 
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
> 
> Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and
> adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
> pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(),
> create a new arch config HAS_HUGE_PAGE that can be used to tell if
> pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the
> compiler would not be able to find pmd_mkwrite_novma().
> 
> No functional change.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: linux-hexagon@vger.kernel.org
> Cc: linux-ia64@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-m68k@lists.linux-m68k.org
> Cc: Michal Simek <monstr@monstr.eu>
> Cc: Dinh Nguyen <dinguyen@kernel.org>
> Cc: linux-mips@vger.kernel.org
> Cc: openrisc@lists.librecores.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-um@lists.infradead.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/

Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13  0:10 ` [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma() Rick Edgecombe
@ 2023-06-13  7:44   ` Mike Rapoport
  2023-06-13 16:19     ` Edgecombe, Rick P
  2023-06-13 12:27   ` David Hildenbrand
  1 sibling, 1 reply; 151+ messages in thread
From: Mike Rapoport @ 2023-06-13  7:44 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie, linux-arm-kernel,
	linux-s390, xen-devel

On Mon, Jun 12, 2023 at 05:10:28PM -0700, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.

Nit:                            ^ mappings?
 
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
> 
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
> 
> Previous patches have done the first step, so next move the callers that
> don't have a VMA to pte_mkwrite_novma(). Also do the same for

I hear x86 maintainers asking to drop "previous patches" ;-)

Maybe
This is the second step of the conversion that moves the callers ...

> pmd_mkwrite(). This will be ok for the shadow stack feature, as these
> callers are on kernel memory which will not need to be made shadow stack,
> and the other architectures only currently support one type of memory
> in pte_mkwrite()
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13  0:10 ` [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description Rick Edgecombe
@ 2023-06-13 11:55   ` Mark Brown
  2023-06-13 12:37     ` Florian Weimer
  2023-07-18 19:32   ` Szabolcs Nagy
  1 sibling, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-06-13 11:55 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Yu-cheng Yu, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

On Mon, Jun 12, 2023 at 05:10:49PM -0700, Rick Edgecombe wrote:

> +Enabling arch_prctl()'s
> +=======================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's. They
> +are only supported in 64 bit user applications. These operate on the features
> +on a per-thread basis. The enablement status is inherited on clone, so if the
> +feature is enabled on the first thread, it will propagate to all the thread's
> +in an app.

I appreciate it's very late in the development of this series but given
that there are very similar features on both arm64 and riscv would it
make sense to make these just regular prctl()s, arch_prctl() isn't used
on other architectures and it'd reduce the amount of arch specific work
that userspace needs to do if the interface is shared.

It should also be possible to support both interfaces for x86 I guess,
though that feels like askng for trouble.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
  2023-06-13  7:19   ` Geert Uytterhoeven
  2023-06-13  7:43   ` Mike Rapoport
@ 2023-06-13 12:26   ` David Hildenbrand
  2023-06-13 16:14     ` Edgecombe, Rick P
  2023-06-19  4:27   ` Helge Deller
  2023-07-14 22:57   ` Mark Brown
  4 siblings, 1 reply; 151+ messages in thread
From: David Hildenbrand @ 2023-06-13 12:26 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug, szabolcs.nagy, torvalds, broonie
  Cc: linux-alpha, linux-snps-arc, linux-arm-kernel, linux-csky,
	linux-hexagon, linux-ia64, loongarch, linux-m68k, Michal Simek,
	Dinh Nguyen, linux-mips, openrisc, linux-parisc, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-um,
	Linus Torvalds

On 13.06.23 02:10, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.
> 
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
> 
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
> 
> Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and
> adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
> pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(),
> create a new arch config HAS_HUGE_PAGE that can be used to tell if
> pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the
> compiler would not be able to find pmd_mkwrite_novma().
> 
> No functional change.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: linux-hexagon@vger.kernel.org
> Cc: linux-ia64@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-m68k@lists.linux-m68k.org
> Cc: Michal Simek <monstr@monstr.eu>
> Cc: Dinh Nguyen <dinguyen@kernel.org>
> Cc: linux-mips@vger.kernel.org
> Cc: openrisc@lists.librecores.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-um@lists.infradead.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
> ---
> Hi Non-x86 Arch’s,
> 
> x86 has a feature that allows for the creation of a special type of
> writable memory (shadow stack) that is only writable in limited specific
> ways. Previously, changes were proposed to core MM code to teach it to
> decide when to create normally writable memory or the special shadow stack
> writable memory, but David Hildenbrand suggested[0] to change
> pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
> moved into x86 code. Later Linus suggested a less error-prone way[1] to go
> about this after the first attempt had a bug.
> 
> Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
> changes. So that is why you are seeing some patches out of a big x86
> series pop up in your arch mailing list. There is no functional change.
> After this refactor, the shadow stack series goes on to use the arch
> helpers to push arch memory details inside arch/x86 and other arch's
> with upcoming shadow stack features.
> 
> Testing was just 0-day build testing.
> 
> Hopefully that is enough context. Thanks!
> 
> [0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/
> [1] https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13  0:10 ` [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma() Rick Edgecombe
  2023-06-13  7:44   ` Mike Rapoport
@ 2023-06-13 12:27   ` David Hildenbrand
  2023-06-13 16:20     ` Edgecombe, Rick P
  1 sibling, 1 reply; 151+ messages in thread
From: David Hildenbrand @ 2023-06-13 12:27 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug, szabolcs.nagy, torvalds, broonie
  Cc: linux-arm-kernel, linux-s390, xen-devel

On 13.06.23 02:10, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.
> 
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
> 
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
> 
> Previous patches have done the first step, so next move the callers that
> don't have a VMA to pte_mkwrite_novma(). Also do the same for
> pmd_mkwrite(). This will be ok for the shadow stack feature, as these
> callers are on kernel memory which will not need to be made shadow stack,
> and the other architectures only currently support one type of memory
> in pte_mkwrite()
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA
  2023-06-13  0:10 ` [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
  2023-06-13  7:42   ` Mike Rapoport
@ 2023-06-13 12:28   ` David Hildenbrand
  2023-06-13 16:21     ` Edgecombe, Rick P
  1 sibling, 1 reply; 151+ messages in thread
From: David Hildenbrand @ 2023-06-13 12:28 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug, szabolcs.nagy, torvalds, broonie

On 13.06.23 02:10, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
> 
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.
> 
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
> 
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
> 
> In a previous patches, pte_mkwrite() was renamed pte_mkwrite_novma() and
> callers that don't have a VMA were changed to use pte_mkwrite_novma(). So
> now change pte_mkwrite() to take a VMA and change the remaining callers to
> pass a VMA. Apply the same changes for pmd_mkwrite().
> 
> No functional change.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

Hopefully we'll get this landed soon :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 11:55   ` Mark Brown
@ 2023-06-13 12:37     ` Florian Weimer
  2023-06-13 15:15       ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Florian Weimer @ 2023-06-13 12:37 UTC (permalink / raw)
  To: Mark Brown
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Weijiang Yang, Kirill A . Shutemov,
	John Allen, kcc, eranian, rppt, jamorris, dethoma, akpm,
	Andrew.Cooper3, christina.schimpe, david, debug, szabolcs.nagy,
	torvalds, Yu-cheng Yu, Pengfei Xu

* Mark Brown:

> On Mon, Jun 12, 2023 at 05:10:49PM -0700, Rick Edgecombe wrote:
>
>> +Enabling arch_prctl()'s
>> +=======================
>> +
>> +Elf features should be enabled by the loader using the below arch_prctl's. They
>> +are only supported in 64 bit user applications. These operate on the features
>> +on a per-thread basis. The enablement status is inherited on clone, so if the
>> +feature is enabled on the first thread, it will propagate to all the thread's
>> +in an app.
>
> I appreciate it's very late in the development of this series but given
> that there are very similar features on both arm64 and riscv would it
> make sense to make these just regular prctl()s, arch_prctl() isn't used
> on other architectures and it'd reduce the amount of arch specific work
> that userspace needs to do if the interface is shared.

Has the Arm feature been fully disclosed?

I would expect the integration with stack switching and unwinding
differs between architectures even if the core mechanism is similar.
It's probably tempting to handle shadow stack placement differently,
too.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 12:37     ` Florian Weimer
@ 2023-06-13 15:15       ` Mark Brown
  2023-06-13 17:11         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-06-13 15:15 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Weijiang Yang, Kirill A . Shutemov,
	John Allen, kcc, eranian, rppt, jamorris, dethoma, akpm,
	Andrew.Cooper3, christina.schimpe, david, debug, szabolcs.nagy,
	torvalds, Yu-cheng Yu, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 1183 bytes --]

On Tue, Jun 13, 2023 at 02:37:18PM +0200, Florian Weimer wrote:

> > I appreciate it's very late in the development of this series but given
> > that there are very similar features on both arm64 and riscv would it
> > make sense to make these just regular prctl()s, arch_prctl() isn't used
> > on other architectures and it'd reduce the amount of arch specific work
> > that userspace needs to do if the interface is shared.

> Has the Arm feature been fully disclosed?

Unfortunately no, it's not yet been folded into the ARM.  The system
registers and instructions are in the latest XML releases but that's not
the full story.

> I would expect the integration with stack switching and unwinding
> differs between architectures even if the core mechanism is similar.
> It's probably tempting to handle shadow stack placement differently,
> too.

Yeah, there's likely to be some differences (though given the amount of
discussion on the x86 implementation I'm trying to follow the decisions
there as much as reasonable on the basis that we should hopefully come
to the same conclusions).  It seemed worth mentioning as a needless
bump, OTOH I defninitely don't see it as critical.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-06-13  0:10 ` [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-06-13 16:01   ` Edgecombe, Rick P
  2023-06-13 17:58   ` Linus Torvalds
  1 sibling, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:01 UTC (permalink / raw)
  To: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane
  Cc: Yu, Yu-cheng, Xu, Pengfei

On Mon, 2023-06-12 at 17:10 -0700, Rick Edgecombe wrote:
> +#ifdef CONFIG_X86_64
> +#define _PAGE_SAVED_DIRTY      (_AT(pteval_t, 1) <<
> _PAGE_BIT_SAVED_DIRTY)
> +#else
> +#define _PAGE_SAVED_DIRTY      (_AT(pteval_t, 0))
> +#endif

Argh, the !CONFIG_X86_64 case here needs to be dropped now.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  7:19   ` Geert Uytterhoeven
@ 2023-06-13 16:14     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:14 UTC (permalink / raw)
  To: geert
  Cc: Schimpe, Christina, Yang, Weijiang, hjl.tools, x86, monstr, rppt,
	dave.hansen, linux-snps-arc, Torvalds, Linus, kirill.shutemov,
	linux-api, dinguyen, rdunlap, tglx, sparclinux, arnd, linux-ia64,
	Lutomirski, Andy, szabolcs.nagy, linux-kernel, linux-parisc,
	akpm, pavel, keescook, linuxppc-dev, gorcunov, andrew.cooper3,
	david, hpa, loongarch, peterz, linux-sh, nadav.amit, broonie,
	linux-m68k, linux-doc, openrisc, jamorris, mike.kravetz, debug,
	fweimer, kcc, linux-arch, mingo, linux-csky, linux-mips,
	john.allen, Eranian, Stephane, bsingharora, linux-alpha,
	linux-s390, linux-riscv, linux-um, linux-arm-kernel, torvalds,
	bp, corbet, linux-hexagon, dethoma, jannh, Syromiatnikov, Eugene,
	oleg, linux-mm

On Tue, 2023-06-13 at 09:19 +0200, Geert Uytterhoeven wrote:
> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  7:43   ` Mike Rapoport
@ 2023-06-13 16:14     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:14 UTC (permalink / raw)
  To: rppt
  Cc: Schimpe, Christina, Yang, Weijiang, hjl.tools, x86, monstr,
	dave.hansen, linux-snps-arc, Torvalds, Linus, kirill.shutemov,
	linux-api, dinguyen, rdunlap, tglx, sparclinux, arnd, linux-ia64,
	Lutomirski, Andy, szabolcs.nagy, linux-kernel, linux-parisc,
	akpm, pavel, keescook, linuxppc-dev, gorcunov, andrew.cooper3,
	david, hpa, loongarch, peterz, linux-sh, nadav.amit, broonie,
	linux-m68k, linux-doc, openrisc, jamorris, mike.kravetz, debug,
	fweimer, kcc, linux-arch, mingo, linux-csky, linux-mips,
	john.allen, Eranian, Stephane, bsingharora, linux-alpha,
	linux-s390, linux-riscv, linux-um, linux-arm-kernel, torvalds,
	bp, corbet, linux-hexagon, dethoma, jannh, Syromiatnikov, Eugene,
	oleg, linux-mm

On Tue, 2023-06-13 at 10:43 +0300, Mike Rapoport wrote:
> On Mon, Jun 12, 2023 at 05:10:27PM -0700, Rick Edgecombe wrote:
> > The x86 Shadow stack feature includes a new type of memory called
> > shadow
> > stack. This shadow stack memory has some unusual properties, which
> > requires
> > some core mm changes to function properly.
> > 
> > One of these unusual properties is that shadow stack memory is
> > writable,
> > but only in limited ways. These limits are applied via a specific
> > PTE
> > bit combination. Nevertheless, the memory is writable, and core mm
> > code
> > will need to apply the writable permissions in the typical paths
> > that
> > call pte_mkwrite(). Future patches will make pte_mkwrite() take a
> > VMA, so
> > that the x86 implementation of it can know whether to create
> > regular
> > writable memory or shadow stack memory.
> 
> Nit:                            ^ mapping?

Hmm, sure.

> 
> > But there are a couple of challenges to this. Modifying the
> > signatures of
> > each arch pte_mkwrite() implementation would be error prone because
> > some
> > are generated with macros and would need to be re-implemented.
> > Also, some
> > pte_mkwrite() callers operate on kernel memory without a VMA.
> > 
> > So this can be done in a three step process. First pte_mkwrite()
> > can be
> > renamed to pte_mkwrite_novma() in each arch, with a generic
> > pte_mkwrite()
> > added that just calls pte_mkwrite_novma(). Next callers without a
> > VMA can
> > be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all
> > callers
> > can be changed to take/pass a VMA.
> > 
> > Start the process by renaming pte_mkwrite() to pte_mkwrite_novma()
> > and
> > adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
> > pattern for pmd_mkwrite(). Since not all archs have a
> > pmd_mkwrite_novma(),
> > create a new arch config HAS_HUGE_PAGE that can be used to tell if
> > pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE
> > cases the
> > compiler would not be able to find pmd_mkwrite_novma().
> > 
> > No functional change.
> > 
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-alpha@vger.kernel.org
> > Cc: linux-snps-arc@lists.infradead.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-csky@vger.kernel.org
> > Cc: linux-hexagon@vger.kernel.org
> > Cc: linux-ia64@vger.kernel.org
> > Cc: loongarch@lists.linux.dev
> > Cc: linux-m68k@lists.linux-m68k.org
> > Cc: Michal Simek <monstr@monstr.eu>
> > Cc: Dinh Nguyen <dinguyen@kernel.org>
> > Cc: linux-mips@vger.kernel.org
> > Cc: openrisc@lists.librecores.org
> > Cc: linux-parisc@vger.kernel.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-riscv@lists.infradead.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: linux-sh@vger.kernel.org
> > Cc: sparclinux@vger.kernel.org
> > Cc: linux-um@lists.infradead.org
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Link:
> > https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
> 
> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13 12:26   ` David Hildenbrand
@ 2023-06-13 16:14     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:14 UTC (permalink / raw)
  To: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane
  Cc: linux-ia64, monstr, linux-hexagon, linux-s390, linux-csky,
	loongarch, openrisc, linux-m68k, linux-snps-arc, linux-parisc,
	sparclinux, linux-sh, dinguyen, linuxppc-dev, linux-arm-kernel,
	linux-mips, linux-alpha, torvalds, linux-um, linux-riscv

On Tue, 2023-06-13 at 14:26 +0200, David Hildenbrand wrote:
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13  7:44   ` Mike Rapoport
@ 2023-06-13 16:19     ` Edgecombe, Rick P
  2023-06-13 17:00       ` David Hildenbrand
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:19 UTC (permalink / raw)
  To: rppt
  Cc: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-s390, Torvalds,
	Linus, peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, bsingharora, x86, oleg, fweimer, keescook, gorcunov,
	andrew.cooper3, xen-devel, hpa, mingo, szabolcs.nagy, hjl.tools,
	linux-arm-kernel, debug, linux-mm, Syromiatnikov, Eugene, Yang,
	Weijiang, linux-doc, dave.hansen, Eranian, Stephane

On Tue, 2023-06-13 at 10:44 +0300, Mike Rapoport wrote:
> > Previous patches have done the first step, so next move the callers
> > that
> > don't have a VMA to pte_mkwrite_novma(). Also do the same for
> 
> I hear x86 maintainers asking to drop "previous patches" ;-)
> 
> Maybe
> This is the second step of the conversion that moves the callers ...

Really? I've not heard that. Just a strong aversion to "this patch".
I've got feedback to say "previous patches" and not "the last patch" so
it doesn't get stale. I guess it could be "previous changes".

> 
> > pmd_mkwrite(). This will be ok for the shadow stack feature, as
> > these
> > callers are on kernel memory which will not need to be made shadow
> > stack,
> > and the other architectures only currently support one type of
> > memory
> > in pte_mkwrite()
> > 
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: xen-devel@lists.xenproject.org
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13 12:27   ` David Hildenbrand
@ 2023-06-13 16:20     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:20 UTC (permalink / raw)
  To: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane
  Cc: linux-arm-kernel, linux-s390, xen-devel

On Tue, 2023-06-13 at 14:27 +0200, David Hildenbrand wrote:
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA
  2023-06-13  7:42   ` Mike Rapoport
@ 2023-06-13 16:20     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:20 UTC (permalink / raw)
  To: rppt
  Cc: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, bsingharora, x86, oleg, fweimer, keescook, gorcunov,
	andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools, debug,
	linux-mm, Syromiatnikov, Eugene, Yang, Weijiang, linux-doc,
	dave.hansen, Eranian, Stephane

On Tue, 2023-06-13 at 10:42 +0300, Mike Rapoport wrote:
> > In a previous patches, pte_mkwrite() was renamed
> > pte_mkwrite_novma() and
> > callers that don't have a VMA were changed to use
> > pte_mkwrite_novma(). So
> 
> Maybe
> This is the third step so pte_mkwrite() was renamed to
> pte_mkwrite_novma()
> and ...

Hmm, yea.

> 
> > now change pte_mkwrite() to take a VMA and change the remaining
> > callers to
> > pass a VMA. Apply the same changes for pmd_mkwrite().
> > 
> > No functional change.
> > 
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA
  2023-06-13 12:28   ` David Hildenbrand
@ 2023-06-13 16:21     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 16:21 UTC (permalink / raw)
  To: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane

On Tue, 2023-06-13 at 14:28 +0200, David Hildenbrand wrote:
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> 
> Hopefully we'll get this landed soon :)

Me too. Thanks for your help on coming up with the idea to pass the VMA
in.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13 16:19     ` Edgecombe, Rick P
@ 2023-06-13 17:00       ` David Hildenbrand
  2023-06-14 17:00         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: David Hildenbrand @ 2023-06-13 17:00 UTC (permalink / raw)
  To: Edgecombe, Rick P, rppt
  Cc: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, Schimpe, Christina, linux-s390, Torvalds, Linus,
	peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, bsingharora, x86, oleg, fweimer, keescook, gorcunov,
	andrew.cooper3, xen-devel, hpa, mingo, szabolcs.nagy, hjl.tools,
	linux-arm-kernel, debug, linux-mm, Syromiatnikov, Eugene, Yang,
	Weijiang, linux-doc, dave.hansen, Eranian, Stephane

On 13.06.23 18:19, Edgecombe, Rick P wrote:
> On Tue, 2023-06-13 at 10:44 +0300, Mike Rapoport wrote:
>>> Previous patches have done the first step, so next move the callers
>>> that
>>> don't have a VMA to pte_mkwrite_novma(). Also do the same for
>>
>> I hear x86 maintainers asking to drop "previous patches" ;-)
>>
>> Maybe
>> This is the second step of the conversion that moves the callers ...
> 
> Really? I've not heard that. Just a strong aversion to "this patch".
> I've got feedback to say "previous patches" and not "the last patch" so
> it doesn't get stale. I guess it could be "previous changes".

Talking about patches make sense when discussing literal patches sent to 
the mailing list. In the git log, it's commit, and "future commits" or 
"follow-up work".

Yes, we use "patches" all of the time in commit logs, especially when we 
  include the cover letter in the commit message (as done frequently in 
the -mm tree).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 15:15       ` Mark Brown
@ 2023-06-13 17:11         ` Edgecombe, Rick P
  2023-06-13 17:57           ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 17:11 UTC (permalink / raw)
  To: fweimer, broonie
  Cc: akpm, Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina, Torvalds,
	Linus, peterz, corbet, linux-kernel, jannh, dethoma,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, andrew.cooper3, keescook,
	gorcunov, Yu, Yu-cheng, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane

On Tue, 2023-06-13 at 16:15 +0100, Mark Brown wrote:
> > I would expect the integration with stack switching and unwinding
> > differs between architectures even if the core mechanism is
> > similar.
> > It's probably tempting to handle shadow stack placement
> > differently,
> > too.
> 
> Yeah, there's likely to be some differences (though given the amount
> of
> discussion on the x86 implementation I'm trying to follow the
> decisions
> there as much as reasonable on the basis that we should hopefully
> come
> to the same conclusions).  It seemed worth mentioning as a needless
> bump, OTOH I defninitely don't see it as critical.

Two things that came up as far as unifying the interface were:
1. The map_shadow_stack syscall
x86 shadow stack does some optional pre-populating of the shadow stack
memory. And in additional not all types of memory are supported
(private anonymous only). This is partly to strengthen the security
(which might be a cross-arch thing) and also partly due to x86's
Write=0,Dirty=1 PTE bit combination. So a new syscall fit better. Some
core-mm folks were not super keen on overloading mmap() to start doing
things like writing to the memory being mapped, as well.

2. The arch_prctl() interface
While enable and disable might be shared, there are some arch-specific
stuff for x86 like enabling the WRSS instruction.

For x86 all of the exercising of the kernel interface was in arch
specific code, so unifying the kernel interface didn't save much on the
user side. If there turns out to be some unification opportunities when
everything is explored and decided on, we could have the option of
tying x86's feature into it later.

I think the map_shadow_stack syscall had the most debate. But the
arch_prctl() was mostly agreed on IIRC. The debate was mostly with
glibc folks and the riscv shadow stack developer.


For my part, the thing I would really like to see unified as much as
possible is at the app developer's interface (glibc/gcc). The idea
would be to make it easy for app developers to know if their app
supports shadow stack. There will probably be some differences, but it
would be great if there was mostly the same behavior and a small list
of differences. I'm thinking about the behavior of longjmp(),
swapcontext(), etc.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 00/42] Shadow stacks for userspace
  2023-06-13  3:12   ` Edgecombe, Rick P
@ 2023-06-13 17:44     ` Linus Torvalds
  2023-06-13 18:27       ` Linus Torvalds
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2023-06-13 17:44 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, peterz, corbet,
	linux-kernel, jannh, dethoma, broonie, mike.kravetz, pavel, bp,
	rdunlap, linux-api, john.allen, arnd, jamorris, rppt,
	bsingharora, x86, oleg, fweimer, keescook, gorcunov,
	andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools, debug,
	linux-mm, Syromiatnikov, Eugene, Yang, Weijiang, linux-doc,
	dave.hansen, Eranian, Stephane

On Mon, Jun 12, 2023 at 8:12 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> Sure. I probably should have included that upfront. Here is a github
> repo:
> https://github.com/rpedgeco/linux/tree/user_shstk_v9
>
> I went ahead and included the tags[0] from last time in case that's
> useful, but unfortunately the github web interface is not very
> conducive to viewing the tag-based segmentation of the series. If
> having it in a korg repo would be useful, please let me know.

Oh, kernel.org vs github doesn't matter. I'm not actually merging this
yet, I'm just doing a fetch to then easily be able to look at it
locally in different formats.

I tend to like seeing small things in my MUA just because then I don't
switch back-and-forth between reading email and some gitk workflow,
and it is easy to just scan through the series and reply all inthe
MUA.

But when it's some bigger piece, just doing a "git fetch" and then
being able to dissect it locally is really convenient.

Having worked with patches for three decades, I can read diffs in my
sleep - but it's still quite useful to say "give me the patches just
for *this* file" to just see how some specific area changed without
having to look at the other parts.

Or for example, that whole pte_mkwrite -> pte_mkwrite_novma patch is
much denser and more legible with color-coding and the --word-diff.

Anyway, I'm scanning through it right now. No comments yet, I only
just got started.

              Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 17:11         ` Edgecombe, Rick P
@ 2023-06-13 17:57           ` Mark Brown
  2023-06-13 19:57             ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-06-13 17:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: fweimer, akpm, Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski,
	Andy, nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	Torvalds, Linus, peterz, corbet, linux-kernel, jannh, dethoma,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, andrew.cooper3, keescook,
	gorcunov, Yu, Yu-cheng, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 2410 bytes --]

On Tue, Jun 13, 2023 at 05:11:35PM +0000, Edgecombe, Rick P wrote:

> Two things that came up as far as unifying the interface were:
> 1. The map_shadow_stack syscall
> x86 shadow stack does some optional pre-populating of the shadow stack
> memory. And in additional not all types of memory are supported
> (private anonymous only). This is partly to strengthen the security
> (which might be a cross-arch thing) and also partly due to x86's
> Write=0,Dirty=1 PTE bit combination. So a new syscall fit better. Some
> core-mm folks were not super keen on overloading mmap() to start doing
> things like writing to the memory being mapped, as well.

Right, the strengthening security bits made this one look cross arch -
that one wasn't worrying me.

> 2. The arch_prctl() interface
> While enable and disable might be shared, there are some arch-specific
> stuff for x86 like enabling the WRSS instruction.

> For x86 all of the exercising of the kernel interface was in arch
> specific code, so unifying the kernel interface didn't save much on the
> user side. If there turns out to be some unification opportunities when
> everything is explored and decided on, we could have the option of
> tying x86's feature into it later.

> I think the map_shadow_stack syscall had the most debate. But the
> arch_prctl() was mostly agreed on IIRC. The debate was mostly with
> glibc folks and the riscv shadow stack developer.

For arm64 we have an equivalentish thing to WRSS which lets us control
if userspace can explicitly push or pop values onto the shadow stack
(GCS for us) so it all maps on well - before I noticed that it was
arch_prctl() I was looking at it and thinking it worked for us.  At the
minute I've taken the prctl() patch from the riscv series and added in a
flag for writability since we just don't have an arch_prctl(), this
isn't a huge deal but it just seemed like needless effort to wonder why
it's different.

> For my part, the thing I would really like to see unified as much as
> possible is at the app developer's interface (glibc/gcc). The idea
> would be to make it easy for app developers to know if their app
> supports shadow stack. There will probably be some differences, but it
> would be great if there was mostly the same behavior and a small list
> of differences. I'm thinking about the behavior of longjmp(),
> swapcontext(), etc.

Yes, very much so.  sigaltcontext() too.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-06-13  0:10 ` [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
  2023-06-13 16:01   ` Edgecombe, Rick P
@ 2023-06-13 17:58   ` Linus Torvalds
  2023-06-13 19:37     ` Edgecombe, Rick P
  1 sibling, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2023-06-13 17:58 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, broonie, Yu-cheng Yu, Pengfei Xu

Small nit.

On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> +static inline unsigned long mksaveddirty_shift(unsigned long v)
> +{
> +       unsigned long cond = !(v & (1 << _PAGE_BIT_RW));
> +
> +       v |= ((v >> _PAGE_BIT_DIRTY) & cond) << _PAGE_BIT_SAVED_DIRTY;
> +       v &= ~(cond << _PAGE_BIT_DIRTY);

I assume you checked that the compiler does the right thing here?

Because the above is kind of an odd way to do things, I feel.

You use boolean operators and then work with an "unsigned long" and
then shift things by hand. So you're kind of mixing two different
mental models.

To me, it would be more natural to do that 'cond' calculation as

        unsigned long cond = (~v >> _PAGE_BIT_RW) & 1;

and keep everything in the "bitops" domain.

I suspect - and hope - that the compiler is smart enough to turn that
boolean test into just the shift, but if that's the intent, why not
just write it with that in mind and not have that "both ways" model?

> +static inline unsigned long clear_saveddirty_shift(unsigned long v)
> +{
> +       unsigned long cond = !!(v & (1 << _PAGE_BIT_RW));

Same comment here.

             Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
  2023-06-13  0:10 ` [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
@ 2023-06-13 18:01   ` Linus Torvalds
  0 siblings, 0 replies; 151+ messages in thread
From: Linus Torvalds @ 2023-06-13 18:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, broonie, Yu-cheng Yu, Pengfei Xu

On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1189,7 +1189,17 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>                                       unsigned long addr, pte_t *ptep)
>  {
> -       clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> +       /*
> +        * Avoid accidentally creating shadow stack PTEs
> +        * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
> +        * the hardware setting Dirty=1.
> +        */
> +       pte_t old_pte, new_pte;
> +
> +       old_pte = READ_ONCE(*ptep);
> +       do {
> +               new_pte = pte_wrprotect(old_pte);
> +       } while (!try_cmpxchg((long *)&ptep->pte, (long *)&old_pte, *(long *)&new_pte));
>  }

Thanks. Much nicer with this all being done just one way and no need
for ifdeffery on config options and runtime static branches.

                  Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 00/42] Shadow stacks for userspace
  2023-06-13 17:44     ` Linus Torvalds
@ 2023-06-13 18:27       ` Linus Torvalds
  2023-06-13 19:38         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2023-06-13 18:27 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: akpm, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, peterz, corbet,
	linux-kernel, jannh, dethoma, broonie, mike.kravetz, pavel, bp,
	rdunlap, linux-api, john.allen, arnd, jamorris, rppt,
	bsingharora, x86, oleg, fweimer, keescook, gorcunov,
	andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools, debug,
	linux-mm, Syromiatnikov, Eugene, Yang, Weijiang, linux-doc,
	dave.hansen, Eranian, Stephane

On Tue, Jun 13, 2023 at 10:44 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, I'm scanning through it right now. No comments yet, I only
> just got started.

Well, it all looked fine from a quick scan. One small comment, and
even that was just a minor stylistic nit.

I didn't actually look through the x86 state infrastructure side -
I'll just trust that is fine, and it doesn't interact with anything
else, so I don't really worry about it. I mainly care about the VM
side not causing problems, and the changes on that side all looked
fine.

             Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY
  2023-06-13 17:58   ` Linus Torvalds
@ 2023-06-13 19:37     ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 19:37 UTC (permalink / raw)
  To: Torvalds, Linus
  Cc: akpm, Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina, peterz,
	corbet, linux-kernel, jannh, dethoma, broonie, mike.kravetz,
	pavel, bp, rdunlap, linux-api, john.allen, arnd, jamorris, rppt,
	bsingharora, x86, oleg, fweimer, keescook, gorcunov, Yu,
	Yu-cheng, andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Eranian, Stephane

On Tue, 2023-06-13 at 10:58 -0700, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > +static inline unsigned long mksaveddirty_shift(unsigned long v)
> > +{
> > +       unsigned long cond = !(v & (1 << _PAGE_BIT_RW));
> > +
> > +       v |= ((v >> _PAGE_BIT_DIRTY) & cond) <<
> > _PAGE_BIT_SAVED_DIRTY;
> > +       v &= ~(cond << _PAGE_BIT_DIRTY);
> 
> I assume you checked that the compiler does the right thing here?
> 
> Because the above is kind of an odd way to do things, I feel.
> 
> You use boolean operators and then work with an "unsigned long" and
> then shift things by hand. So you're kind of mixing two different
> mental models.
> 
> To me, it would be more natural to do that 'cond' calculation as
> 
>         unsigned long cond = (~v >> _PAGE_BIT_RW) & 1;
> 
> and keep everything in the "bitops" domain.

That makes sense. It lets the reader's brain stay in bitmath mode.

> 
> I suspect - and hope - that the compiler is smart enough to turn that
> boolean test into just the shift, but if that's the intent, why not
> just write it with that in mind and not have that "both ways" model?

Well, it wasn't for this reason, but gcc likes to emit two more
instructions for the boolean-less version. Clang generates identical
code. If it makes this complicated code any simpler to read, it's
probably still worth it.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 00/42] Shadow stacks for userspace
  2023-06-13 18:27       ` Linus Torvalds
@ 2023-06-13 19:38         ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 19:38 UTC (permalink / raw)
  To: Torvalds, Linus
  Cc: tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, peterz, corbet,
	linux-kernel, dethoma, broonie, jannh, x86, pavel, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora,
	mike.kravetz, oleg, fweimer, keescook, gorcunov, andrew.cooper3,
	hpa, mingo, szabolcs.nagy, hjl.tools, debug, linux-mm,
	Syromiatnikov, Eugene, akpm, Yang, Weijiang, dave.hansen,
	linux-doc, Eranian, Stephane

On Tue, 2023-06-13 at 11:27 -0700, Linus Torvalds wrote:
> On Tue, Jun 13, 2023 at 10:44 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > Anyway, I'm scanning through it right now. No comments yet, I only
> > just got started.
> 
> Well, it all looked fine from a quick scan. One small comment, and
> even that was just a minor stylistic nit.
> 
> I didn't actually look through the x86 state infrastructure side -
> I'll just trust that is fine, and it doesn't interact with anything
> else, so I don't really worry about it. I mainly care about the VM
> side not causing problems, and the changes on that side all looked
> fine.

Thanks!

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 17:57           ` Mark Brown
@ 2023-06-13 19:57             ` Edgecombe, Rick P
  2023-06-14 10:43               ` szabolcs.nagy
  2023-06-14 13:12               ` Mark Brown
  0 siblings, 2 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-13 19:57 UTC (permalink / raw)
  To: broonie
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, linux-kernel, dethoma, jannh, mike.kravetz, pavel, bp,
	rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, x86, oleg, andrew.cooper3, keescook, gorcunov,
	fweimer, Yu, Yu-cheng, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus, akpm,
	dave.hansen, Yang, Weijiang, Eranian, Stephane

On Tue, 2023-06-13 at 18:57 +0100, Mark Brown wrote:
> > > > For my part, the thing I would really like to see unified as
> > > > much > > as
> > > > possible is at the app developer's interface (glibc/gcc). The
> > > > idea
> > > > would be to make it easy for app developers to know if their
> > > > app
> > > > supports shadow stack. There will probably be some differences,
> > > > but > > it
> > > > would be great if there was mostly the same behavior and a
> > > > small > > list
> > > > of differences. I'm thinking about the behavior of longjmp(),
> > > > swapcontext(), etc.
> > 
> > Yes, very much so.  sigaltcontext() too.

For alt shadow stack's, this is what I came up with:
https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/

Unfortunately it can't work automatically with sigaltstack(). Since it
has to be a new thing anyway, it's been left for the future. I guess
that might have a better chance of being cross arch.


BTW, last time this series accidentally broke an arm config and made it
all the way through the robots up to Linus. Would you mind giving
patches 1-3 a check?

Thanks,

Rick

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap()
  2023-06-13  0:10 ` [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
@ 2023-06-14  8:49   ` David Hildenbrand
  2023-06-14 23:30   ` Mark Brown
  1 sibling, 0 replies; 151+ messages in thread
From: David Hildenbrand @ 2023-06-14  8:49 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug, szabolcs.nagy, torvalds, broonie
  Cc: Yu-cheng Yu, Peter Collingbourne, Pengfei Xu

On 13.06.23 02:10, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> There was no more caller passing vm_flags to do_mmap(), and vm_flags was
> removed from the function's input by:
> 
>      commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").
> 
> There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
> do_mmap().  Thus, re-introduce vm_flags to do_mmap().
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> Reviewed-by: Peter Collingbourne <pcc@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2023-06-13  0:10 ` [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
@ 2023-06-14  8:50   ` David Hildenbrand
  0 siblings, 0 replies; 151+ messages in thread
From: David Hildenbrand @ 2023-06-14  8:50 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug, szabolcs.nagy, torvalds, broonie
  Cc: Yu-cheng Yu, Axel Rasmussen, Peter Xu, Pengfei Xu

On 13.06.23 02:10, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Future patches will introduce a new VM flag VM_SHADOW_STACK that will be
> VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are
> bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake
> of order, make all VM_HIGH_ARCH_BITs stay together by moving
> VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be
> introduced as 37.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Acked-by: Peter Xu <peterx@redhat.com>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> ---
>   include/linux/mm.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9ec20cbb20c1..6f52c1e7c640 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -370,7 +370,7 @@ extern unsigned int kobjsize(const void *objp);
>   #endif
>   
>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> -# define VM_UFFD_MINOR_BIT	37
> +# define VM_UFFD_MINOR_BIT	38
>   # define VM_UFFD_MINOR		BIT(VM_UFFD_MINOR_BIT)	/* UFFD minor faults */
>   #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>   # define VM_UFFD_MINOR		VM_NONE

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2023-06-13  0:10 ` [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
@ 2023-06-14  8:50   ` David Hildenbrand
  2023-06-14 23:31   ` Mark Brown
  1 sibling, 0 replies; 151+ messages in thread
From: David Hildenbrand @ 2023-06-14  8:50 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	debug, szabolcs.nagy, torvalds, broonie
  Cc: Yu-cheng Yu, Pengfei Xu

On 13.06.23 02:10, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> New hardware extensions implement support for shadow stack memory, such
> as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to
> identify these areas, for example, to be used to properly indicate shadow
> stack PTEs to the hardware.
> 
> Shadow stack VMA creation will be tightly controlled and limited to
> anonymous memory to make the implementation simpler and since that is all
> that is required. The solution will rely on pte_mkwrite() to create the
> shadow stack PTEs, so it will not be required for vm_get_page_prot() to
> learn how to create shadow stack memory. For this reason document that
> VM_SHADOW_STACK should not be mixed with VM_SHARED.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> ---
>   Documentation/filesystems/proc.rst | 1 +
>   fs/proc/task_mmu.c                 | 3 +++
>   include/linux/mm.h                 | 8 ++++++++
>   3 files changed, 12 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 7897a7dafcbc..6ccb57089a06 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -566,6 +566,7 @@ encoded manner. The codes are the following:
>       mt    arm64 MTE allocation tags are enabled
>       um    userfaultfd missing tracking
>       uw    userfaultfd wr-protect tracking
> +    ss    shadow stack page
>       ==    =======================================
>   
>   Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 420510f6a545..38b19a757281 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
>   		[ilog2(VM_UFFD_MINOR)]	= "ui",
>   #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +		[ilog2(VM_SHADOW_STACK)] = "ss",
> +#endif
>   	};
>   	size_t i;
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6f52c1e7c640..fb17cbd531ac 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -319,11 +319,13 @@ extern unsigned int kobjsize(const void *objp);
>   #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>   #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>   #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>   #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>   #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
>   #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>   
>   #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -339,6 +341,12 @@ extern unsigned int kobjsize(const void *objp);
>   #endif
>   #endif /* CONFIG_ARCH_HAS_PKEYS */
>   
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +# define VM_SHADOW_STACK	VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
> +#else
> +# define VM_SHADOW_STACK	VM_NONE
> +#endif
> +
>   #if defined(CONFIG_X86)
>   # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
>   #elif defined(CONFIG_PPC)

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 19:57             ` Edgecombe, Rick P
@ 2023-06-14 10:43               ` szabolcs.nagy
  2023-06-14 16:57                 ` Edgecombe, Rick P
  2023-06-14 13:12               ` Mark Brown
  1 sibling, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-14 10:43 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, linux-kernel, dethoma, jannh, mike.kravetz, pavel, bp,
	rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, x86, oleg, andrew.cooper3, keescook, gorcunov,
	fweimer, Yu, Yu-cheng, hpa, mingo, hjl.tools, debug, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

The 06/13/2023 19:57, Edgecombe, Rick P wrote:
> On Tue, 2023-06-13 at 18:57 +0100, Mark Brown wrote:
> > > > > For my part, the thing I would really like to see unified as
> > > > > much > > as
> > > > > possible is at the app developer's interface (glibc/gcc). The
> > > > > idea
> > > > > would be to make it easy for app developers to know if their
> > > > > app
> > > > > supports shadow stack. There will probably be some differences,
> > > > > but > > it
> > > > > would be great if there was mostly the same behavior and a
> > > > > small > > list
> > > > > of differences. I'm thinking about the behavior of longjmp(),
> > > > > swapcontext(), etc.
> > >
> > > Yes, very much so.  sigaltcontext() too.
>
> For alt shadow stack's, this is what I came up with:
> https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/
>
> Unfortunately it can't work automatically with sigaltstack(). Since it
> has to be a new thing anyway, it's been left for the future. I guess
> that might have a better chance of being cross arch.

i dont think you can add sigaltshstk later.

libgcc already has unwinder code for shstk and that cannot handle
discontinous shadow stack. (may affect longjmp too depending on
how it is implemented)

we can change the unwinder now to know how to switch shstk when
it unwinds the signal frame and backport that to systems that
want to support shstk. or we can introduce a new elf marking
scheme just for sigaltshstk when it is added so incompatibility
can be detected. or we simply not support unwinding with
sigaltshstk which would make it pretty much useless in practice.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13 19:57             ` Edgecombe, Rick P
  2023-06-14 10:43               ` szabolcs.nagy
@ 2023-06-14 13:12               ` Mark Brown
  1 sibling, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-06-14 13:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, linux-kernel, dethoma, jannh, mike.kravetz, pavel, bp,
	rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, x86, oleg, andrew.cooper3, keescook, gorcunov,
	fweimer, Yu, Yu-cheng, hpa, mingo, szabolcs.nagy, hjl.tools,
	debug, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus, akpm,
	dave.hansen, Yang, Weijiang, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 956 bytes --]

On Tue, Jun 13, 2023 at 07:57:37PM +0000, Edgecombe, Rick P wrote:

> For alt shadow stack's, this is what I came up with:
> https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/

> Unfortunately it can't work automatically with sigaltstack(). Since it
> has to be a new thing anyway, it's been left for the future. I guess
> that might have a better chance of being cross arch.

Yeah, I've not seen and can't think of anything that's entirely
satisfactory either.  Like Szabolcs says I do think we need a story on
this.

> BTW, last time this series accidentally broke an arm config and made it
> all the way through the robots up to Linus. Would you mind giving
> patches 1-3 a check?

I'm in the middle of importing the whole series into my development
branch, but note that I'm only really working with arm64 not arm so
might miss stuff the bots would hit.  Hopefully there should be some
Tested-bys coming for arm64 anyway.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-14 10:43               ` szabolcs.nagy
@ 2023-06-14 16:57                 ` Edgecombe, Rick P
  2023-06-19  8:47                   ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-14 16:57 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, linux-kernel, dethoma, jannh, mike.kravetz,
	pavel, bp, rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, x86, oleg, fweimer, keescook, andrew.cooper3,
	gorcunov, Yu, Yu-cheng, hpa, mingo, hjl.tools, debug, linux-mm,
	Syromiatnikov, Eugene, linux-doc, Torvalds, Linus, dave.hansen,
	akpm, Eranian, Stephane

On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> The 06/13/2023 19:57, Edgecombe, Rick P wrote:
> > On Tue, 2023-06-13 at 18:57 +0100, Mark Brown wrote:
> > > > > > For my part, the thing I would really like to see unified
> > > > > > as
> > > > > > much > > as
> > > > > > possible is at the app developer's interface (glibc/gcc).
> > > > > > The
> > > > > > idea
> > > > > > would be to make it easy for app developers to know if
> > > > > > their
> > > > > > app
> > > > > > supports shadow stack. There will probably be some
> > > > > > differences,
> > > > > > but > > it
> > > > > > would be great if there was mostly the same behavior and a
> > > > > > small > > list
> > > > > > of differences. I'm thinking about the behavior of
> > > > > > longjmp(),
> > > > > > swapcontext(), etc.
> > > > 
> > > > Yes, very much so.  sigaltcontext() too.
> > 
> > For alt shadow stack's, this is what I came up with:
> > https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/
> > 
> > Unfortunately it can't work automatically with sigaltstack(). Since
> > it
> > has to be a new thing anyway, it's been left for the future. I
> > guess
> > that might have a better chance of being cross arch.
> 
> i dont think you can add sigaltshstk later.
> 
> libgcc already has unwinder code for shstk and that cannot handle
> discontinous shadow stack.

Are you referring to the existing C++ exception unwinding code that
expects a different signal frame format? Yea this is a problem, but I
don't see how it's a problem with any solutions now that will be harder
later. I mentioned it when I brought up all the app compatibility
problems.[0]

The problem is that gcc expects a fixed 8 byte sized shadow stack
signal frame. The format in these patches is such that it can be
expanded for the sake of supporting alt shadow stack later, but it
happens to be a fixed 8 bytes for now, so it will work seamlessly with
these old gcc's. HJ has some patches to fix GCC to jump over a
dynamically sized shadow stack signal frame, but this of course won't
stop old gcc's from generating binaries that won't work with an
expanded frame.

I was waffling on whether it would be better to pad the shadow stack
[1] signal frame to start, this would break compatibility with any
binaries that use this -fnon-call-exceptions feature (if there are
any), but would set us up better for the future if we got away with it.
On one hand we are already juggling some compatibility issues so maybe
it's not too much worse, but on the other hand the kernel is trying its
best to be as compatible as it can given the situation. It doesn't
*need* to break this compatibility at this point.

In the end I thought it was better to deal with it later.

>  (may affect longjmp too depending on
> how it is implemented)

glibc's longjmp ignores anything everything it skips over and just does
INCSSP until it gets back to the setjmp point. So it is not affected by
the shadow stack signal frame format. I don't think we can support
longjmping off an alt shadow stack unless we enable WRSS or get the
kernel's help. So this was to be declared as unsupported.

> 
> we can change the unwinder now to know how to switch shstk when
> it unwinds the signal frame and backport that to systems that
> want to support shstk. or we can introduce a new elf marking
> scheme just for sigaltshstk when it is added so incompatibility
> can be detected. or we simply not support unwinding with
> sigaltshstk which would make it pretty much useless in practice.

Yea, I was thinking along the same lines. Someday we could easily need
some new marker. Maybe because we want to add something, or maybe
because of the pre-existing userspace. In that case, this
implementation will get the ball rolling and we can learn more about
how shadow stack will be used. So if we need to break compatibility
with any apps, we would not really be in a different situation than we
are already in (if we are going to take proper care to not break
userspace). So if/when that happens all the learning's can go into the
clean break.

But if it's not clear, unwinder's that properly use the format in these
patches should work from an alt shadow stack implemented like that RFC
linked earlier in the thread. At least it will be able to read back the
shadow stack starting from the alt shadow stack, it can't actually
resume control flow from where it unwound to. For that we need WRSS or
some kernel help.

[0]
https://lore.kernel.org/lkml/7d8133c7e0186bdaeb3893c1c808148dc0d11945.camel@intel.com/


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  2023-06-13 17:00       ` David Hildenbrand
@ 2023-06-14 17:00         ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-14 17:00 UTC (permalink / raw)
  To: david, rppt
  Cc: linux-doc, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, szabolcs.nagy, Schimpe, Christina, peterz,
	corbet, linux-kernel, dethoma, broonie, linux-s390, x86, pavel,
	bp, mike.kravetz, linux-api, john.allen, jamorris, arnd,
	bsingharora, xen-devel, jannh, oleg, fweimer, keescook, gorcunov,
	rdunlap, hpa, mingo, andrew.cooper3, hjl.tools, linux-arm-kernel,
	debug, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus, akpm,
	dave.hansen, Yang, Weijiang, Eranian, Stephane

On Tue, 2023-06-13 at 19:00 +0200, David Hildenbrand wrote:
> On 13.06.23 18:19, Edgecombe, Rick P wrote:
> > On Tue, 2023-06-13 at 10:44 +0300, Mike Rapoport wrote:
> > > > Previous patches have done the first step, so next move the
> > > > callers
> > > > that
> > > > don't have a VMA to pte_mkwrite_novma(). Also do the same for
> > > 
> > > I hear x86 maintainers asking to drop "previous patches" ;-)
> > > 
> > > Maybe
> > > This is the second step of the conversion that moves the callers
> > > ...
> > 
> > Really? I've not heard that. Just a strong aversion to "this
> > patch".
> > I've got feedback to say "previous patches" and not "the last
> > patch" so
> > it doesn't get stale. I guess it could be "previous changes".
> 
> Talking about patches make sense when discussing literal patches sent
> to 
> the mailing list. In the git log, it's commit, and "future commits"
> or 
> "follow-up work".
> 
> Yes, we use "patches" all of the time in commit logs, especially when
> we 
>   include the cover letter in the commit message (as done frequently
> in 
> the -mm tree).

I think I'll switch over to talking about "changes". If you talk about
commits it doesn't make as much sense when they are still just patches.
Thanks.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap()
  2023-06-13  0:10 ` [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
  2023-06-14  8:49   ` David Hildenbrand
@ 2023-06-14 23:30   ` Mark Brown
  1 sibling, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-06-14 23:30 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Yu-cheng Yu,
	Peter Collingbourne, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

On Mon, Jun 12, 2023 at 05:10:30PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> There was no more caller passing vm_flags to do_mmap(), and vm_flags was
> removed from the function's input by:
> 
>     commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").
> 
> There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
> do_mmap().  Thus, re-introduce vm_flags to do_mmap().

Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2023-06-13  0:10 ` [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
  2023-06-14  8:50   ` David Hildenbrand
@ 2023-06-14 23:31   ` Mark Brown
  1 sibling, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-06-14 23:31 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Yu-cheng Yu, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 463 bytes --]

On Mon, Jun 12, 2023 at 05:10:40PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> New hardware extensions implement support for shadow stack memory, such
> as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to
> identify these areas, for example, to be used to properly indicate shadow
> stack PTEs to the hardware.

Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-13  0:10 ` [PATCH v9 16/42] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2023-06-14 23:34   ` Mark Brown
  2023-06-22 18:21   ` Matthew Wilcox
  1 sibling, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-06-14 23:34 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Yu-cheng Yu, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 339 bytes --]

On Mon, Jun 12, 2023 at 05:10:42PM -0700, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.

Reviewed-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma
  2023-06-13  0:10 ` [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
@ 2023-06-14 23:35   ` Mark Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-06-14 23:35 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 339 bytes --]

On Mon, Jun 12, 2023 at 05:10:43PM -0700, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.

Reviewed-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 00/42] Shadow stacks for userspace
  2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
                   ` (42 preceding siblings ...)
  2023-06-13  1:34 ` [PATCH v9 00/42] Shadow stacks for userspace Linus Torvalds
@ 2023-06-14 23:45 ` Mark Brown
  43 siblings, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-06-14 23:45 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Mon, Jun 12, 2023 at 05:10:26PM -0700, Rick Edgecombe wrote:

> This series implements Shadow Stacks for userspace using x86's Control-flow 
> Enforcement Technology (CET). CET consists of two related security
> features: shadow stacks and indirect branch tracking. This series
> implements just the  shadow stack part of this feature, and just for
> userspace.

I've been using the generic changes here for the work I've been doing on
arm64's similar GCS feature, while that is still very much WIP and
hasn't been posted anywhere most of the common code here has been
exercised.  I've been through the patches that I've specifically checked
or used.  Thanks for all the work here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
                     ` (2 preceding siblings ...)
  2023-06-13 12:26   ` David Hildenbrand
@ 2023-06-19  4:27   ` Helge Deller
  2023-07-14 22:57   ` Mark Brown
  4 siblings, 0 replies; 151+ messages in thread
From: Helge Deller @ 2023-06-19  4:27 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, linux-kernel; +Cc: linux-parisc

On 6/13/23 02:10, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.
>
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
> that the x86 implementation of it can know whether to create regular
> writable memory or shadow stack memory.
>
> But there are a couple of challenges to this. Modifying the signatures of
> each arch pte_mkwrite() implementation would be error prone because some
> are generated with macros and would need to be re-implemented. Also, some
> pte_mkwrite() callers operate on kernel memory without a VMA.
>
> So this can be done in a three step process. First pte_mkwrite() can be
> renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
> added that just calls pte_mkwrite_novma(). Next callers without a VMA can
> be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
> can be changed to take/pass a VMA.
>
> Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and
> adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
> pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(),
> create a new arch config HAS_HUGE_PAGE that can be used to tell if
> pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the
> compiler would not be able to find pmd_mkwrite_novma().
>
> No functional change.
>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: linux-hexagon@vger.kernel.org
> Cc: linux-ia64@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-m68k@lists.linux-m68k.org
> Cc: Michal Simek <monstr@monstr.eu>
> Cc: Dinh Nguyen <dinguyen@kernel.org>
> Cc: linux-mips@vger.kernel.org
> Cc: openrisc@lists.librecores.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-um@lists.infradead.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
> ---
> Hi Non-x86 Arch’s,
>
> x86 has a feature that allows for the creation of a special type of
> writable memory (shadow stack) that is only writable in limited specific
> ways. Previously, changes were proposed to core MM code to teach it to
> decide when to create normally writable memory or the special shadow stack
> writable memory, but David Hildenbrand suggested[0] to change
> pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
> moved into x86 code. Later Linus suggested a less error-prone way[1] to go
> about this after the first attempt had a bug.
>
> Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
> changes. So that is why you are seeing some patches out of a big x86
> series pop up in your arch mailing list. There is no functional change.
> After this refactor, the shadow stack series goes on to use the arch
> helpers to push arch memory details inside arch/x86 and other arch's
> with upcoming shadow stack features.
>
> Testing was just 0-day build testing.
>
> Hopefully that is enough context. Thanks!
>
> [0] https://lore.kernel.org/lkml/0e29a2d0-08d8-bcd6-ff26-4bea0e4037b0@redhat.com/
> [1] https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
> ---
>   arch/parisc/include/asm/pgtable.h            |  2 +-

Acked-by: Helge Deller <deller@gmx.de> # parisc

Helge

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-14 16:57                 ` Edgecombe, Rick P
@ 2023-06-19  8:47                   ` szabolcs.nagy
  2023-06-19 16:44                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-19  8:47 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, linux-kernel, dethoma, jannh, mike.kravetz,
	pavel, bp, rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, x86, oleg, fweimer, keescook, andrew.cooper3,
	gorcunov, Yu, Yu-cheng, hpa, mingo, hjl.tools, debug, linux-mm,
	Syromiatnikov, Eugene, linux-doc, Torvalds, Linus, dave.hansen,
	akpm, Eranian, Stephane, nd

The 06/14/2023 16:57, Edgecombe, Rick P wrote:
> On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> > i dont think you can add sigaltshstk later.
> > 
> > libgcc already has unwinder code for shstk and that cannot handle
> > discontinous shadow stack.
> 
> Are you referring to the existing C++ exception unwinding code that
> expects a different signal frame format? Yea this is a problem, but I
> don't see how it's a problem with any solutions now that will be harder
> later. I mentioned it when I brought up all the app compatibility
> problems.[0]

there is old unwinder code incompatible with the current patches,
but that was fixed. however the new unwind code assumes signal
entry pushes one extra token that just have to be popped from the
shstk. this abi cannot be expanded which means

1) kernel cannot push more tokens for more integrity checks
   (or to add whatever other features)

2) sigaltshstk cannot work.

if the unwinder instead interprets the token to be the old ssp and
either incssp or switch to that ssp (depending on continous or
discontinous shstk, which the unwinder can detect), then 1) and 2)
are fixed.

but currently the distributed unwinder binary is incompatible with
1) and 2) so sigaltshstk cannot be added later. breaking the unwind
abi is not acceptable.

> The problem is that gcc expects a fixed 8 byte sized shadow stack
> signal frame. The format in these patches is such that it can be
> expanded for the sake of supporting alt shadow stack later, but it
> happens to be a fixed 8 bytes for now, so it will work seamlessly with
> these old gcc's. HJ has some patches to fix GCC to jump over a
> dynamically sized shadow stack signal frame, but this of course won't
> stop old gcc's from generating binaries that won't work with an
> expanded frame.
> 
> I was waffling on whether it would be better to pad the shadow stack
> [1] signal frame to start, this would break compatibility with any
> binaries that use this -fnon-call-exceptions feature (if there are
> any), but would set us up better for the future if we got away with it.

i don't see how -fnon-call-exceptions is relevant.

you can unwind from a signal handler (this is not a c++ question
but unwind abi question) and in practice eh works e.g. if the
signal is raised (sync or async) in a frame where there are no
cleanup handlers registered. in practice code rarely relies on
this (because it's not valid in c++). the main user of this i
know of is the glibc cancellation implmentation. (that is special
in that it never catches the exception so ssp does not have to be
updated for things to work, but in principle the unwinder should
still verify the entries on shstk, otherwise the security
guarantees are broken and the cleanup handlers can be hijacked.
there are glibc abi issues that prevent fixing this, but in other
libcs this may be still relevant).

> On one hand we are already juggling some compatibility issues so maybe
> it's not too much worse, but on the other hand the kernel is trying its
> best to be as compatible as it can given the situation. It doesn't
> *need* to break this compatibility at this point.
> 
> In the end I thought it was better to deal with it later.
> 
> >  (may affect longjmp too depending on
> > how it is implemented)
> 
> glibc's longjmp ignores anything everything it skips over and just does
> INCSSP until it gets back to the setjmp point. So it is not affected by
> the shadow stack signal frame format. I don't think we can support
> longjmping off an alt shadow stack unless we enable WRSS or get the
> kernel's help. So this was to be declared as unsupported.

longjmp can support discontinous shadow stack without wrss.
the current code proposed to glibc does not, which is wrong
(it breaks altshstk and green thread users like qemu for no
good reason).

declaring things unsupported means you have to go around to
audit and mark binaries accordingly.

> > we can change the unwinder now to know how to switch shstk when
> > it unwinds the signal frame and backport that to systems that
> > want to support shstk. or we can introduce a new elf marking
> > scheme just for sigaltshstk when it is added so incompatibility
> > can be detected. or we simply not support unwinding with
> > sigaltshstk which would make it pretty much useless in practice.
> 
> Yea, I was thinking along the same lines. Someday we could easily need
> some new marker. Maybe because we want to add something, or maybe
> because of the pre-existing userspace. In that case, this
> implementation will get the ball rolling and we can learn more about
> how shadow stack will be used. So if we need to break compatibility
> with any apps, we would not really be in a different situation than we
> are already in (if we are going to take proper care to not break
> userspace). So if/when that happens all the learning's can go into the
> clean break.
> 
> But if it's not clear, unwinder's that properly use the format in these
> patches should work from an alt shadow stack implemented like that RFC
> linked earlier in the thread. At least it will be able to read back the
> shadow stack starting from the alt shadow stack, it can't actually
> resume control flow from where it unwound to. For that we need WRSS or
> some kernel help.

wrss is not needed to resume control flow on a different shstk.

(if you needed wrss then the map_shadow_stack would be useless.)

> 
> [0]
> https://lore.kernel.org/lkml/7d8133c7e0186bdaeb3893c1c808148dc0d11945.camel@intel.com/
> 

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-19  8:47                   ` szabolcs.nagy
@ 2023-06-19 16:44                     ` Edgecombe, Rick P
  2023-06-20  9:17                       ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-19 16:44 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, akpm, peterz, corbet,
	linux-kernel, nd, dethoma, jannh, x86, pavel, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora,
	mike.kravetz, andrew.cooper3, oleg, keescook, gorcunov, fweimer,
	Yu, Yu-cheng, hpa, mingo, hjl.tools, debug, linux-mm,
	Syromiatnikov, Eugene, Yang, Weijiang, linux-doc, dave.hansen,
	Torvalds, Linus, Eranian, Stephane

On Mon, 2023-06-19 at 09:47 +0100, szabolcs.nagy@arm.com wrote:
> The 06/14/2023 16:57, Edgecombe, Rick P wrote:
> > On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> > > i dont think you can add sigaltshstk later.
> > > 
> > > libgcc already has unwinder code for shstk and that cannot handle
> > > discontinous shadow stack.
> > 
> > Are you referring to the existing C++ exception unwinding code that
> > expects a different signal frame format? Yea this is a problem, but
> > I
> > don't see how it's a problem with any solutions now that will be
> > harder
> > later. I mentioned it when I brought up all the app compatibility
> > problems.[0]
> 
> there is old unwinder code incompatible with the current patches,
> but that was fixed. however the new unwind code assumes signal
> entry pushes one extra token that just have to be popped from the
> shstk. this abi cannot be expanded which means
> 
> 1) kernel cannot push more tokens for more integrity checks
>    (or to add whatever other features)
> 
> 2) sigaltshstk cannot work.
> 
> if the unwinder instead interprets the token to be the old ssp and
> either incssp or switch to that ssp (depending on continous or
> discontinous shstk, which the unwinder can detect), then 1) and 2)
> are fixed.
> 
> but currently the distributed unwinder binary is incompatible with
> 1) and 2) so sigaltshstk cannot be added later. breaking the unwind
> abi is not acceptable.

Can you point me to what you are talking about? I tested adding fields
to the shadow stack on top of these changes. It worked with HJ's new
(unposted I think) C++ changes for gcc. Adding fields doesn't work with
the existing gcc because it assumes a fixed size.

> 
> > The problem is that gcc expects a fixed 8 byte sized shadow stack
> > signal frame. The format in these patches is such that it can be
> > expanded for the sake of supporting alt shadow stack later, but it
> > happens to be a fixed 8 bytes for now, so it will work seamlessly
> > with
> > these old gcc's. HJ has some patches to fix GCC to jump over a
> > dynamically sized shadow stack signal frame, but this of course
> > won't
> > stop old gcc's from generating binaries that won't work with an
> > expanded frame.
> > 
> > I was waffling on whether it would be better to pad the shadow
> > stack
> > [1] signal frame to start, this would break compatibility with any
> > binaries that use this -fnon-call-exceptions feature (if there are
> > any), but would set us up better for the future if we got away with
> > it.
> 
> i don't see how -fnon-call-exceptions is relevant.

It uses unwinder code that does assume a fixed shadow stack signal
frame size. Since gcc 8.5 I think. So these compilers will continue to
generate code that assumes a fixed frame size. This is one of the
limitations we have for not moving to a new elf bit.

> 
> you can unwind from a signal handler (this is not a c++ question
> but unwind abi question) and in practice eh works e.g. if the
> signal is raised (sync or async) in a frame where there are no
> cleanup handlers registered. in practice code rarely relies on
> this (because it's not valid in c++). the main user of this i
> know of is the glibc cancellation implmentation. (that is special
> in that it never catches the exception so ssp does not have to be
> updated for things to work, but in principle the unwinder should
> still verify the entries on shstk, otherwise the security
> guarantees are broken and the cleanup handlers can be hijacked.
> there are glibc abi issues that prevent fixing this, but in other
> libcs this may be still relevant).

I'm not fully sure what you are trying to say here. The glibc shadow
stack stuff that is there today supports unwinding through a signal
handler. The longjmp code (unlike fnon-call-exections) doesn't look at
the shstk signal frame. It just does INCSSP until it reaches its
desired SSP, not caring what it is INCSSPing over.

> 
> > On one hand we are already juggling some compatibility issues so
> > maybe
> > it's not too much worse, but on the other hand the kernel is trying
> > its
> > best to be as compatible as it can given the situation. It doesn't
> > *need* to break this compatibility at this point.
> > 
> > In the end I thought it was better to deal with it later.
> > 
> > >  (may affect longjmp too depending on
> > > how it is implemented)
> > 
> > glibc's longjmp ignores anything everything it skips over and just
> > does
> > INCSSP until it gets back to the setjmp point. So it is not
> > affected by
> > the shadow stack signal frame format. I don't think we can support
> > longjmping off an alt shadow stack unless we enable WRSS or get the
> > kernel's help. So this was to be declared as unsupported.
> 
> longjmp can support discontinous shadow stack without wrss.
> the current code proposed to glibc does not, which is wrong
> (it breaks altshstk and green thread users like qemu for no
> good reason).
> 
> declaring things unsupported means you have to go around to
> audit and mark binaries accordingly.

The idea that all apps can be supported without auditing has been
assumed to be impossible by everyone I've talked to, including the
GLIBC developers deeply versed in the architectural limitations of this
feature. So if you have a magic solution, then that is a notable claim
and I think you should propose it instead of just alluding to the fact
that there is one.

The only non-WRSS "longjmp from an alt shadow stack solution" that I
can think of would have something like a new syscall performing some
limited shadow stack actions normally prohibited in userspace by the
architecture. We'd have to think through how this would impact the
security. There are a lot of security/compatibility tradeoffs to parse
in this. So also, just because something can be done, doesn't mean we
should do it. I think the philosophy at this point is, lets get the
basics working that can support most apps, and learn more about stuff
like where this bar is in the real world.

> 
> > > we can change the unwinder now to know how to switch shstk when
> > > it unwinds the signal frame and backport that to systems that
> > > want to support shstk. or we can introduce a new elf marking
> > > scheme just for sigaltshstk when it is added so incompatibility
> > > can be detected. or we simply not support unwinding with
> > > sigaltshstk which would make it pretty much useless in practice.
> > 
> > Yea, I was thinking along the same lines. Someday we could easily
> > need
> > some new marker. Maybe because we want to add something, or maybe
> > because of the pre-existing userspace. In that case, this
> > implementation will get the ball rolling and we can learn more
> > about
> > how shadow stack will be used. So if we need to break compatibility
> > with any apps, we would not really be in a different situation than
> > we
> > are already in (if we are going to take proper care to not break
> > userspace). So if/when that happens all the learning's can go into
> > the
> > clean break.
> > 
> > But if it's not clear, unwinder's that properly use the format in
> > these
> > patches should work from an alt shadow stack implemented like that
> > RFC
> > linked earlier in the thread. At least it will be able to read back
> > the
> > shadow stack starting from the alt shadow stack, it can't actually
> > resume control flow from where it unwound to. For that we need WRSS
> > or
> > some kernel help.
> 
> wrss is not needed to resume control flow on a different shstk.

WRSS lets you resume control flow at arbitrarily points by writing your
own restore token. Otherwise there are restrictions.

> 
> (if you needed wrss then the map_shadow_stack would be useless.)

map_shadow_stack is usually prepopulated with a token, otherwise it
does need WRSS to create one on it.

> 
> > 
> > [0]
> > https://lore.kernel.org/lkml/7d8133c7e0186bdaeb3893c1c808148dc0d11945.camel@intel.com/
> > 


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-19 16:44                     ` Edgecombe, Rick P
@ 2023-06-20  9:17                       ` szabolcs.nagy
  2023-06-20 19:34                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-20  9:17 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, akpm, peterz, corbet,
	linux-kernel, nd, dethoma, jannh, x86, pavel, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora,
	mike.kravetz, andrew.cooper3, oleg, keescook, gorcunov, fweimer,
	Yu, Yu-cheng, hpa, mingo, hjl.tools, debug, linux-mm,
	Syromiatnikov, Eugene, Yang, Weijiang, linux-doc, dave.hansen,
	Torvalds, Linus, Eranian, Stephane

The 06/19/2023 16:44, Edgecombe, Rick P wrote:
> On Mon, 2023-06-19 at 09:47 +0100, szabolcs.nagy@arm.com wrote:
> > The 06/14/2023 16:57, Edgecombe, Rick P wrote:
> > > On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> > > > i dont think you can add sigaltshstk later.
> > > > 
> > > > libgcc already has unwinder code for shstk and that cannot handle
> > > > discontinous shadow stack.
> > > 
> > > Are you referring to the existing C++ exception unwinding code that
> > > expects a different signal frame format? Yea this is a problem, but
> > > I
> > > don't see how it's a problem with any solutions now that will be
> > > harder
> > > later. I mentioned it when I brought up all the app compatibility
> > > problems.[0]
> > 
> > there is old unwinder code incompatible with the current patches,
> > but that was fixed. however the new unwind code assumes signal
> > entry pushes one extra token that just have to be popped from the
> > shstk. this abi cannot be expanded which means
> > 
> > 1) kernel cannot push more tokens for more integrity checks
> >    (or to add whatever other features)
> > 
> > 2) sigaltshstk cannot work.
> > 
> > if the unwinder instead interprets the token to be the old ssp and
> > either incssp or switch to that ssp (depending on continous or
> > discontinous shstk, which the unwinder can detect), then 1) and 2)
> > are fixed.
> > 
> > but currently the distributed unwinder binary is incompatible with
> > 1) and 2) so sigaltshstk cannot be added later. breaking the unwind
> > abi is not acceptable.
> 
> Can you point me to what you are talking about? I tested adding fields
> to the shadow stack on top of these changes. It worked with HJ's new
> (unposted I think) C++ changes for gcc. Adding fields doesn't work with
> the existing gcc because it assumes a fixed size.

if there is a fix that's good, i haven't seen it.

my point was that the current unwinder works with current kernel
patches, but does not allow future extensions which prevents
sigaltshstk to work. the unwinder is not versioned so this cannot
be fixed later. it only works if distros ensure shstk is disabled
until the unwinder is fixed. (however there is no way to detect
old unwinder if somebody builds gcc from source.)

also note that there is generic code in the unwinder that will
deal with this and likely the x86 patches will conflict with
arm and riscv etc patches that try to fix the same issue..
so posting patches on the tools side of the abi would be useful
at this point.

> > > The problem is that gcc expects a fixed 8 byte sized shadow stack
> > > signal frame. The format in these patches is such that it can be
> > > expanded for the sake of supporting alt shadow stack later, but it
> > > happens to be a fixed 8 bytes for now, so it will work seamlessly
> > > with
> > > these old gcc's. HJ has some patches to fix GCC to jump over a
> > > dynamically sized shadow stack signal frame, but this of course
> > > won't
> > > stop old gcc's from generating binaries that won't work with an
> > > expanded frame.
> > > 
> > > I was waffling on whether it would be better to pad the shadow
> > > stack
> > > [1] signal frame to start, this would break compatibility with any
> > > binaries that use this -fnon-call-exceptions feature (if there are
> > > any), but would set us up better for the future if we got away with
> > > it.
> > 
> > i don't see how -fnon-call-exceptions is relevant.
> 
> It uses unwinder code that does assume a fixed shadow stack signal
> frame size. Since gcc 8.5 I think. So these compilers will continue to
> generate code that assumes a fixed frame size. This is one of the
> limitations we have for not moving to a new elf bit.

how does "fixed shadow stack signal frame size" relates to
"-fnon-call-exceptions"?

if there were instruction boundaries within a function where the
ret addr is not yet pushed or already poped from the shstk then
the flag would be relevant, but since push/pop happens atomically
at function entry/return -fnon-call-exceptions makes no
difference as far as shstk unwinding is concerned.

> > you can unwind from a signal handler (this is not a c++ question
> > but unwind abi question) and in practice eh works e.g. if the
> > signal is raised (sync or async) in a frame where there are no
> > cleanup handlers registered. in practice code rarely relies on
> > this (because it's not valid in c++). the main user of this i
> > know of is the glibc cancellation implmentation. (that is special
> > in that it never catches the exception so ssp does not have to be
> > updated for things to work, but in principle the unwinder should
> > still verify the entries on shstk, otherwise the security
> > guarantees are broken and the cleanup handlers can be hijacked.
> > there are glibc abi issues that prevent fixing this, but in other
> > libcs this may be still relevant).
> 
> I'm not fully sure what you are trying to say here. The glibc shadow

you mentioned -fnon-call-exceptions and i'm saying even without that
unwinding from signal handler is relevant and has to track shstk and
ideally even update it for eh control transfer (i.e. properly switch
to a different shstk in case of discontinous shstk).

> stack stuff that is there today supports unwinding through a signal
> handler. The longjmp code (unlike fnon-call-exections) doesn't look at
> the shstk signal frame. It just does INCSSP until it reaches its
> desired SSP, not caring what it is INCSSPing over.

x86 longjmp has differnet problems (cannot handle discontinous
shstk now).

glibc cancellation is a mix of unwinding and special longjmp and
it is currently broken in that the unwind bit cannot verify the
return addresses. the unwinder does control transfer to cleanup
handlers so control flow hijack is possible in principle on a
corrupt stack (though i don't think cancellation is a practical
attack surface).

> > longjmp can support discontinous shadow stack without wrss.
> > the current code proposed to glibc does not, which is wrong
> > (it breaks altshstk and green thread users like qemu for no
> > good reason).
> > 
> > declaring things unsupported means you have to go around to
> > audit and mark binaries accordingly.
> 
> The idea that all apps can be supported without auditing has been
> assumed to be impossible by everyone I've talked to, including the
> GLIBC developers deeply versed in the architectural limitations of this
> feature. So if you have a magic solution, then that is a notable claim
> and I think you should propose it instead of just alluding to the fact
> that there is one.

there is no magic, longjmp should be implemented as:

	target_ssp = read from jmpbuf;
	current_ssp = read ssp;
	for (p = target_ssp; p != current_ssp; p--) {
		if (*p == restore-token) {
			// target_ssp is on a different shstk.
			switch_shstk_to(p);
			break;
		}
	}
	for (; p != target_ssp; p++)
		// ssp is now on the same shstk as target.
		inc_ssp();

this is what setcontext is doing and longjmp can do the same:
for programs that always longjmp within the same shstk the first
loop is just p = current_ssp, but it also works when longjmp
target is on a different shstk assuming nothing is running on
that shstk, which is only possible if there is a restore token
on top.

this implies if the kernel switches shstk on signal entry it has
to add a restore-token on the switched away shstk.

> The only non-WRSS "longjmp from an alt shadow stack solution" that I
> can think of would have something like a new syscall performing some
> limited shadow stack actions normally prohibited in userspace by the

there is setcontext and swapcontext already doing an shstk
switch, i don't see why you think longjmp is different and
needs magic syscalls or wrss.

> architecture. We'd have to think through how this would impact the
> security. There are a lot of security/compatibility tradeoffs to parse
> in this. So also, just because something can be done, doesn't mean we
> should do it. I think the philosophy at this point is, lets get the
> basics working that can support most apps, and learn more about stuff
> like where this bar is in the real world.

i think longjmp should really be discussed with libc devs,
not on the kernel list, since they know the practical
constraints and trade-offs better. however longjmp is
relevant for the signal abi design so it's not ideal to
push a linux abi and then have the libc side discussion
later..

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-20  9:17                       ` szabolcs.nagy
@ 2023-06-20 19:34                         ` Edgecombe, Rick P
  2023-06-21 11:36                           ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-20 19:34 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, nd, dethoma, jannh, mike.kravetz,
	x86, bp, rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, debug, pavel, oleg, andrew.cooper3, keescook,
	gorcunov, fweimer, Yu, Yu-cheng, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, akpm, Yang, Weijiang, dave.hansen,
	linux-doc, Eranian, Stephane

On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> if there is a fix that's good, i haven't seen it.
> 
> my point was that the current unwinder works with current kernel
> patches, but does not allow future extensions which prevents
> sigaltshstk to work. the unwinder is not versioned so this cannot
> be fixed later. it only works if distros ensure shstk is disabled
> until the unwinder is fixed. (however there is no way to detect
> old unwinder if somebody builds gcc from source.)

This is a problem the kernel is having to deal with, not causing. The
userspace changes were upstreamed before the kernel. Userspace folks
are adamantly against moving to a new elf bit, to start over with a
clean slate. I tried everything to influence this and was not
successful. So I'm still not sure what the proposal here is for the
kernel.

I am guessing that the fnon-call-exceptions/expanded frame size
incompatibilities could end up causing something to grow an opt-in at
some point.

> 
> also note that there is generic code in the unwinder that will
> deal with this and likely the x86 patches will conflict with
> arm and riscv etc patches that try to fix the same issue..
> so posting patches on the tools side of the abi would be useful
> at this point.

The glibc patches are unfortunately mostly upstream already. See HJ for
the diff that targets the new enabling interface. From lessons learned
earlier in this effort, he was not going push those changes before the
kernel support was upstream. There shouldn't be any glibc changes to
signal or longjmp stuff in those AFAIK though.

[ snip ]

> how does "fixed shadow stack signal frame size" relates to
> "-fnon-call-exceptions"?
> 
> if there were instruction boundaries within a function where the
> ret addr is not yet pushed or already poped from the shstk then
> the flag would be relevant, but since push/pop happens atomically
> at function entry/return -fnon-call-exceptions makes no
> difference as far as shstk unwinding is concerned.

As I said, the existing unwinding code for fnon-call-excecptions
assumes a fixed shadow stack signal frame size of 8 bytes. Since the
exception is thrown out of a signal, it needs to know how to unwind
through the shadow stack signal frame.

[ snip ]

> there is no magic, longjmp should be implemented as:
> 
>         target_ssp = read from jmpbuf;
>         current_ssp = read ssp;
>         for (p = target_ssp; p != current_ssp; p--) {
>                 if (*p == restore-token) {
>                         // target_ssp is on a different shstk.
>                         switch_shstk_to(p);
>                         break;
>                 }
>         }
>         for (; p != target_ssp; p++)
>                 // ssp is now on the same shstk as target.
>                 inc_ssp();
> 
> this is what setcontext is doing and longjmp can do the same:
> for programs that always longjmp within the same shstk the first
> loop is just p = current_ssp, but it also works when longjmp
> target is on a different shstk assuming nothing is running on
> that shstk, which is only possible if there is a restore token
> on top.
> 
> this implies if the kernel switches shstk on signal entry it has
> to add a restore-token on the switched away shstk.

I actually did a POC for this, but rejected it. The problem is, if
there is a shadow stack overflow at that point then the kernel can't
push the shadow stack token to the old stack. And shadow stack overflow
is exactly the alt shadow stack use case. So it doesn't really solve
the problem.

This reasoning was actually elaborated on when the alt shadow stack
patches were posted. And it looks like I previously pointed you at it.

This history here is quite long and complicated, but I’ve done my best
to summarize it in the coverletters. It would be helpful if you could
review those links.

[ snip ]

> i think longjmp should really be discussed with libc devs,
> not on the kernel list, since they know the practical
> constraints and trade-offs better. however longjmp is
> relevant for the signal abi design so it's not ideal to
> push a linux abi and then have the libc side discussion
> later..

It sounds like you are aware of the limitations the pre-existing
upstream userspace places on the shadow stack signal frame. We also
previously discussed how the kernel had to work around other aspects of
upstream userspace that assumed undecided kernel ABI. How on earth are
you getting that the kernel ABI is being pushed before input from the
userspace side? The situation is the opposite.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-20 19:34                         ` Edgecombe, Rick P
@ 2023-06-21 11:36                           ` szabolcs.nagy
  2023-06-21 18:54                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-21 11:36 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Torvalds, Linus,
	peterz, corbet, linux-kernel, nd, dethoma, jannh, mike.kravetz,
	x86, bp, rdunlap, linux-api, rppt, jamorris, arnd, john.allen,
	bsingharora, debug, pavel, oleg, andrew.cooper3, keescook,
	gorcunov, fweimer, Yu, Yu-cheng, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, akpm, Yang, Weijiang, dave.hansen,
	linux-doc, Eranian, Stephane

The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> > if there is a fix that's good, i haven't seen it.
> > 
> > my point was that the current unwinder works with current kernel
> > patches, but does not allow future extensions which prevents
> > sigaltshstk to work. the unwinder is not versioned so this cannot
> > be fixed later. it only works if distros ensure shstk is disabled
> > until the unwinder is fixed. (however there is no way to detect
> > old unwinder if somebody builds gcc from source.)
> 
> This is a problem the kernel is having to deal with, not causing. The
> userspace changes were upstreamed before the kernel. Userspace folks
> are adamantly against moving to a new elf bit, to start over with a
> clean slate. I tried everything to influence this and was not
> successful. So I'm still not sure what the proposal here is for the
> kernel.

i agree, the glibc and libgcc patches should not have been accepted
before a linux abi.

but the other direction also holds: the linux patches should not be
pushed before the userspace design is discussed. (the current code
upstream is wrong, and new code for the proposed linux abi is not
posted yet. this is not your fault, i'm saying it here, because the
discussion is here.)

> I am guessing that the fnon-call-exceptions/expanded frame size
> incompatibilities could end up causing something to grow an opt-in at
> some point.

there are independent userspace components and not every component
has a chance to opt-in.

> > how does "fixed shadow stack signal frame size" relates to
> > "-fnon-call-exceptions"?
> > 
> > if there were instruction boundaries within a function where the
> > ret addr is not yet pushed or already poped from the shstk then
> > the flag would be relevant, but since push/pop happens atomically
> > at function entry/return -fnon-call-exceptions makes no
> > difference as far as shstk unwinding is concerned.
> 
> As I said, the existing unwinding code for fnon-call-excecptions
> assumes a fixed shadow stack signal frame size of 8 bytes. Since the
> exception is thrown out of a signal, it needs to know how to unwind
> through the shadow stack signal frame.

sorry but there is some misunderstanding about -fnon-call-exceptions.

it is for emitting cleanup and exception handler data for a function
such that throwing from certain instructions within that function
works, while normally only throwing from calls work.

it is not about *unwinding* from an async signal handler, which is
-fasynchronous-unwind-tables and should always work on linux, nor for
dealing with cleanup/exception handlers above the interrupted frame
(likewise it works on linux without special cflags).

as far as i can tell the current unwinder handles shstk unwinding
correctly across signal handlers (sync or async and cleanup/exceptions
handlers too), i see no issue with "fixed shadow stack signal frame
size of 8 bytes" other than future extensions and discontinous shstk.

> > there is no magic, longjmp should be implemented as:
> > 
> >         target_ssp = read from jmpbuf;
> >         current_ssp = read ssp;
> >         for (p = target_ssp; p != current_ssp; p--) {
> >                 if (*p == restore-token) {
> >                         // target_ssp is on a different shstk.
> >                         switch_shstk_to(p);
> >                         break;
> >                 }
> >         }
> >         for (; p != target_ssp; p++)
> >                 // ssp is now on the same shstk as target.
> >                 inc_ssp();
> > 
> > this is what setcontext is doing and longjmp can do the same:
> > for programs that always longjmp within the same shstk the first
> > loop is just p = current_ssp, but it also works when longjmp
> > target is on a different shstk assuming nothing is running on
> > that shstk, which is only possible if there is a restore token
> > on top.
> > 
> > this implies if the kernel switches shstk on signal entry it has
> > to add a restore-token on the switched away shstk.
> 
> I actually did a POC for this, but rejected it. The problem is, if
> there is a shadow stack overflow at that point then the kernel can't
> push the shadow stack token to the old stack. And shadow stack overflow
> is exactly the alt shadow stack use case. So it doesn't really solve
> the problem.

the restore token in the alt shstk case does not regress anything but
makes some use-cases work.

alt shadow stack is important if code tries to jump in and out of
signal handlers (dosemu does this with swapcontext) and for that a
restore token is needed.

alt shadow stack is important if the original shstk did not overflow
but the signal handler would overflow it (small thread stack, huge
sigaltstack case).

alt shadow stack is also important for crash reporting on shstk
overflow even if longjmp does not work then. longjmp to a makecontext
stack would still work and longjmp back to the original stack can be
made to mostly work by an altshstk option to overwrite the top entry
with a restore token on overflow (this can break unwinding though).


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 11:36                           ` szabolcs.nagy
@ 2023-06-21 18:54                             ` Edgecombe, Rick P
  2023-06-21 22:22                               ` Edgecombe, Rick P
                                                 ` (2 more replies)
  0 siblings, 3 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-21 18:54 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, nd, dethoma, jannh, linux-kernel, debug, pavel, bp,
	mike.kravetz, linux-api, rppt, jamorris, arnd, john.allen,
	rdunlap, bsingharora, oleg, andrew.cooper3, keescook, x86,
	gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> > > > > > if there is a fix that's good, i haven't seen it.
> > > > > > 
> > > > > > my point was that the current unwinder works with current
> > > > > > kernel
> > > > > > patches, but does not allow future extensions which
> > > > > > prevents
> > > > > > sigaltshstk to work. the unwinder is not versioned so this
> > > > > > cannot
> > > > > > be fixed later. it only works if distros ensure shstk is
> > > > > > disabled
> > > > > > until the unwinder is fixed. (however there is no way to
> > > > > > detect
> > > > > > old unwinder if somebody builds gcc from source.)
> > > > 
> > > > This is a problem the kernel is having to deal with, not
> > > > causing. > > The
> > > > userspace changes were upstreamed before the kernel. Userspace
> > > > > > folks
> > > > are adamantly against moving to a new elf bit, to start over
> > > > with a
> > > > clean slate. I tried everything to influence this and was not
> > > > successful. So I'm still not sure what the proposal here is for
> > > > the
> > > > kernel.
> > 
> > i agree, the glibc and libgcc patches should not have been accepted
> > before a linux abi.
> > 
> > but the other direction also holds: the linux patches should not be
> > pushed before the userspace design is discussed. (the current code
> > upstream is wrong, and new code for the proposed linux abi is not
> > posted yet. this is not your fault, i'm saying it here, because the
> > discussion is here.)

This series has been discussed with glibc/gcc developers regularly
throughout the enabling effort. In fact there have been ongoing
discussions about future shadow stack functionality. 

It's not like this feature has been a fast or hidden effort. You are
just walking into the tail end of it. (much of it predates my
involvement BTW, including the initial glibc support)

AFAIK HJ presented the enabling changes at some glibc meeting. The
signal side of glibc is unchanged from what is already upstream. So I'm
not sure characterizing it that way is fair. It seems you were not part
of those old discussions, but that might be because your interest is
new. In any case we are constrained by some of these earlier outcomes.
More on that below.

> > 
> > > > I am guessing that the fnon-call-exceptions/expanded frame size
> > > > incompatibilities could end up causing something to grow an
> > > > opt-in > > at
> > > > some point.
> > 
> > there are independent userspace components and not every component
> > has a chance to opt-in.
> > 
> > > > > > how does "fixed shadow stack signal frame size" relates to
> > > > > > "-fnon-call-exceptions"?
> > > > > > 
> > > > > > if there were instruction boundaries within a function
> > > > > > where the
> > > > > > ret addr is not yet pushed or already poped from the shstk
> > > > > > then
> > > > > > the flag would be relevant, but since push/pop happens
> > > > > > atomically
> > > > > > at function entry/return -fnon-call-exceptions makes no
> > > > > > difference as far as shstk unwinding is concerned.
> > > > 
> > > > As I said, the existing unwinding code for fnon-call-
> > > > excecptions
> > > > assumes a fixed shadow stack signal frame size of 8 bytes.
> > > > Since > > the
> > > > exception is thrown out of a signal, it needs to know how to
> > > > unwind
> > > > through the shadow stack signal frame.
> > 
> > sorry but there is some misunderstanding about -fnon-call-
> > exceptions.
> > 
> > it is for emitting cleanup and exception handler data for a
> > function
> > such that throwing from certain instructions within that function
> > works, while normally only throwing from calls work.
> > 
> > it is not about *unwinding* from an async signal handler, which is
> > -fasynchronous-unwind-tables and should always work on linux, nor
> > for
> > dealing with cleanup/exception handlers above the interrupted frame
> > (likewise it works on linux without special cflags).
> > 
> > as far as i can tell the current unwinder handles shstk unwinding
> > correctly across signal handlers (sync or async and >
> > cleanup/exceptions
> > handlers too), i see no issue with "fixed shadow stack signal frame
> > size of 8 bytes" other than future extensions and discontinous
> > shstk.

HJ, can you link your patch that makes it extensible and we can clear
this up? Maybe the issue extends beyond fnon-call-exceptions, but that
is where I reproduced it.

> > 
> > > > > > there is no magic, longjmp should be implemented as:
> > > > > > 
> > > > > >         target_ssp = read from jmpbuf;
> > > > > >         current_ssp = read ssp;
> > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > >                 if (*p == restore-token) {
> > > > > >                         // target_ssp is on a different
> > > > > > shstk.
> > > > > >                         switch_shstk_to(p);
> > > > > >                         break;
> > > > > >                 }
> > > > > >         }
> > > > > >         for (; p != target_ssp; p++)
> > > > > >                 // ssp is now on the same shstk as target.
> > > > > >                 inc_ssp();
> > > > > > 
> > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > same:
> > > > > > for programs that always longjmp within the same shstk the
> > > > > > first
> > > > > > loop is just p = current_ssp, but it also works when
> > > > > > longjmp
> > > > > > target is on a different shstk assuming nothing is running
> > > > > > on
> > > > > > that shstk, which is only possible if there is a restore
> > > > > > token
> > > > > > on top.
> > > > > > 
> > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > it has
> > > > > > to add a restore-token on the switched away shstk.
> > > > 
> > > > I actually did a POC for this, but rejected it. The problem is,
> > > > if
> > > > there is a shadow stack overflow at that point then the kernel
> > > > > > can't
> > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > overflow
> > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > solve
> > > > the problem.
> > 
> > the restore token in the alt shstk case does not regress anything
> > but
> > makes some use-cases work.
> > 
> > alt shadow stack is important if code tries to jump in and out of
> > signal handlers (dosemu does this with swapcontext) and for that a
> > restore token is needed.
> > 
> > alt shadow stack is important if the original shstk did not
> > overflow
> > but the signal handler would overflow it (small thread stack, huge
> > sigaltstack case).
> > 
> > alt shadow stack is also important for crash reporting on shstk
> > overflow even if longjmp does not work then. longjmp to a
> > makecontext
> > stack would still work and longjmp back to the original stack can
> > be
> > made to mostly work by an altshstk option to overwrite the top
> > entry
> > with a restore token on overflow (this can break unwinding though).
> > 

There was previously a request to create an alt shadow stack for the
purpose of handling shadow stack overflow. So you are now suggesting to
to exclude that and instead target a different use case for alt shadow
stack?

But I'm not sure how much we should change the ABI at this point since
we are constrained by existing userspace. If you read the history, we
may end up needing to deprecate the whole elf bit for this and other
reasons.

So should we struggle to find a way to grow the existing ABI without
disturbing the existing userspace? Or should we start with something,
finally, and see where we need to grow and maybe get a chance at a
fresh start to grow it?

Like, maybe 3 people will show up saying "hey, I *really* need to use
shadow stack and longjmp from a ucontext stack", and no one says
anything about shadow stack overflow. Then we know what to do. And
maybe dosemu decides it doesn't need to implement shadow stack (highly
likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
was created for dosemu, and the alt shadow stack patch adopted this
behavior. So it's speculation that there is even a problem in that
scenario.

Or maybe people just enable WRSS for longjmp() and directly jump back
to the setjmp() point. Do most people want fast setjmp/longjmp() at the
cost of a little security?

Even if, with enough discussion, we could optimize for all
hypotheticals without real user feedback, I don't see how it helps
users to hold shadow stack. So I think we should move forward with the
current ABI.



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 18:54                             ` Edgecombe, Rick P
@ 2023-06-21 22:22                               ` Edgecombe, Rick P
  2023-06-21 23:05                                 ` H.J. Lu
  2023-06-22  8:27                                 ` szabolcs.nagy
  2023-06-21 23:02                               ` H.J. Lu
  2023-06-22  9:18                               ` szabolcs.nagy
  2 siblings, 2 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-21 22:22 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, nd, dethoma, jannh, linux-kernel, debug, pavel, bp,
	mike.kravetz, linux-api, rppt, jamorris, arnd, john.allen,
	rdunlap, bsingharora, oleg, andrew.cooper3, keescook, x86,
	gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

On Wed, 2023-06-21 at 11:54 -0700, Rick Edgecombe wrote:
> > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > > 
> > > > > > >         target_ssp = read from jmpbuf;
> > > > > > >         current_ssp = read ssp;
> > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > >                 if (*p == restore-token) {
> > > > > > >                         // target_ssp is on a different
> > > > > > > shstk.
> > > > > > >                         switch_shstk_to(p);
> > > > > > >                         break;
> > > > > > >                 }
> > > > > > >         }
> > > > > > >         for (; p != target_ssp; p++)
> > > > > > >                 // ssp is now on the same shstk as
> > > > > > > target.
> > > > > > >                 inc_ssp();
> > > > > > > 
> > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > same:
> > > > > > > for programs that always longjmp within the same shstk
> > > > > > > the
> > > > > > > first
> > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > longjmp
> > > > > > > target is on a different shstk assuming nothing is
> > > > > > > running
> > > > > > > on
> > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > token
> > > > > > > on top.
> > > > > > > 
> > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > it has
> > > > > > > to add a restore-token on the switched away shstk.

Wait a second, the claim is that the kernel should add a restore token
on the current shadow stack before handling a signal, to allow to
unwind from an alt shadow stack, right? But in this series there is not
an alt shadow stack, so signal will be handled on the current shadow
stack. If the user stays on the current shadow stack, the existing
simple INCSSP based solution will work.

If the user swapcontext()'s away while handling a signal (which *is*
currently supported) they will leave their own restore token on the old
stack. Hypothetically glibc could unwind back through a series of
ucontext stacks by pivoting, if it kept some metadata somewhere about
where to restore to. So there are actually already enough tokens to
make it back in this case, glibc just doesn't do this.

But how does the proposed token placed by the kernel on the original
stack help this problem? The longjmp() would have to be able to find
the location of the restore tokens somehow, which would not necessarily
be near the setjmp() point. The signal token could even be on a
different shadow stack.

So I think the above is short of a design for a universally compatible
longjmp().

Which makes me think if we did want to make a more compatible longjmp()
a better the way to do it might be an arch_prctl that emits a token at
the current SSP. This would be loosening up the security somewhat (have
to be an opt-in), but less so then enabling WRSS. But it would also be
way simpler, work for all cases (I think), and be faster (maybe?) than
INCSSPing through a bunch of stacks.

I'm also not sure leaving a token on signal doesn't weaken the security
it it's own way as well. Any thread could then swap to that token.
Where as the shadow stack signal frame ssp pointer can only be used
from the shadow stack the signal was handled on.

So I think, in addition to blocking the shadow stack overflow use case
in the future, leaving a token behind on signal will not really help
longjmp(). (or at least I'm not following)


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 18:54                             ` Edgecombe, Rick P
  2023-06-21 22:22                               ` Edgecombe, Rick P
@ 2023-06-21 23:02                               ` H.J. Lu
  2023-06-22  7:40                                 ` szabolcs.nagy
  2023-06-22  9:18                               ` szabolcs.nagy
  2 siblings, 1 reply; 151+ messages in thread
From: H.J. Lu @ 2023-06-21 23:02 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: broonie, szabolcs.nagy, Xu, Pengfei, tglx, linux-arch, kcc,
	Lutomirski, Andy, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, linux-doc, peterz, corbet, nd, dethoma, jannh,
	linux-kernel, debug, pavel, bp, mike.kravetz, linux-api, rppt,
	jamorris, arnd, john.allen, rdunlap, bsingharora, oleg,
	andrew.cooper3, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer,
	hpa, mingo, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus,
	akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

On Wed, Jun 21, 2023 at 11:54 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> > > > > > > if there is a fix that's good, i haven't seen it.
> > > > > > >
> > > > > > > my point was that the current unwinder works with current
> > > > > > > kernel
> > > > > > > patches, but does not allow future extensions which
> > > > > > > prevents
> > > > > > > sigaltshstk to work. the unwinder is not versioned so this
> > > > > > > cannot
> > > > > > > be fixed later. it only works if distros ensure shstk is
> > > > > > > disabled
> > > > > > > until the unwinder is fixed. (however there is no way to
> > > > > > > detect
> > > > > > > old unwinder if somebody builds gcc from source.)
> > > > >
> > > > > This is a problem the kernel is having to deal with, not
> > > > > causing. > > The
> > > > > userspace changes were upstreamed before the kernel. Userspace
> > > > > > > folks
> > > > > are adamantly against moving to a new elf bit, to start over
> > > > > with a
> > > > > clean slate. I tried everything to influence this and was not
> > > > > successful. So I'm still not sure what the proposal here is for
> > > > > the
> > > > > kernel.
> > >
> > > i agree, the glibc and libgcc patches should not have been accepted
> > > before a linux abi.
> > >
> > > but the other direction also holds: the linux patches should not be
> > > pushed before the userspace design is discussed. (the current code
> > > upstream is wrong, and new code for the proposed linux abi is not
> > > posted yet. this is not your fault, i'm saying it here, because the
> > > discussion is here.)
>
> This series has been discussed with glibc/gcc developers regularly
> throughout the enabling effort. In fact there have been ongoing
> discussions about future shadow stack functionality.
>
> It's not like this feature has been a fast or hidden effort. You are
> just walking into the tail end of it. (much of it predates my
> involvement BTW, including the initial glibc support)
>
> AFAIK HJ presented the enabling changes at some glibc meeting. The
> signal side of glibc is unchanged from what is already upstream. So I'm
> not sure characterizing it that way is fair. It seems you were not part
> of those old discussions, but that might be because your interest is
> new. In any case we are constrained by some of these earlier outcomes.
> More on that below.
>
> > >
> > > > > I am guessing that the fnon-call-exceptions/expanded frame size
> > > > > incompatibilities could end up causing something to grow an
> > > > > opt-in > > at
> > > > > some point.
> > >
> > > there are independent userspace components and not every component
> > > has a chance to opt-in.
> > >
> > > > > > > how does "fixed shadow stack signal frame size" relates to
> > > > > > > "-fnon-call-exceptions"?
> > > > > > >
> > > > > > > if there were instruction boundaries within a function
> > > > > > > where the
> > > > > > > ret addr is not yet pushed or already poped from the shstk
> > > > > > > then
> > > > > > > the flag would be relevant, but since push/pop happens
> > > > > > > atomically
> > > > > > > at function entry/return -fnon-call-exceptions makes no
> > > > > > > difference as far as shstk unwinding is concerned.
> > > > >
> > > > > As I said, the existing unwinding code for fnon-call-
> > > > > excecptions
> > > > > assumes a fixed shadow stack signal frame size of 8 bytes.
> > > > > Since > > the
> > > > > exception is thrown out of a signal, it needs to know how to
> > > > > unwind
> > > > > through the shadow stack signal frame.
> > >
> > > sorry but there is some misunderstanding about -fnon-call-
> > > exceptions.
> > >
> > > it is for emitting cleanup and exception handler data for a
> > > function
> > > such that throwing from certain instructions within that function
> > > works, while normally only throwing from calls work.
> > >
> > > it is not about *unwinding* from an async signal handler, which is
> > > -fasynchronous-unwind-tables and should always work on linux, nor
> > > for
> > > dealing with cleanup/exception handlers above the interrupted frame
> > > (likewise it works on linux without special cflags).
> > >
> > > as far as i can tell the current unwinder handles shstk unwinding
> > > correctly across signal handlers (sync or async and >
> > > cleanup/exceptions
> > > handlers too), i see no issue with "fixed shadow stack signal frame
> > > size of 8 bytes" other than future extensions and discontinous
> > > shstk.
>
> HJ, can you link your patch that makes it extensible and we can clear
> this up? Maybe the issue extends beyond fnon-call-exceptions, but that
> is where I reproduced it.

Here is the patch:

https://gitlab.com/x86-gcc/gcc/-/commit/aab4c24b67b5f05b72e52a3eaae005c2277710b9

> > >
> > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > >
> > > > > > >         target_ssp = read from jmpbuf;
> > > > > > >         current_ssp = read ssp;
> > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > >                 if (*p == restore-token) {
> > > > > > >                         // target_ssp is on a different
> > > > > > > shstk.
> > > > > > >                         switch_shstk_to(p);
> > > > > > >                         break;
> > > > > > >                 }
> > > > > > >         }
> > > > > > >         for (; p != target_ssp; p++)
> > > > > > >                 // ssp is now on the same shstk as target.
> > > > > > >                 inc_ssp();
> > > > > > >
> > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > same:
> > > > > > > for programs that always longjmp within the same shstk the
> > > > > > > first
> > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > longjmp
> > > > > > > target is on a different shstk assuming nothing is running
> > > > > > > on
> > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > token
> > > > > > > on top.
> > > > > > >
> > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > it has
> > > > > > > to add a restore-token on the switched away shstk.
> > > > >
> > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > if
> > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > can't
> > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > overflow
> > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > solve
> > > > > the problem.
> > >
> > > the restore token in the alt shstk case does not regress anything
> > > but
> > > makes some use-cases work.
> > >
> > > alt shadow stack is important if code tries to jump in and out of
> > > signal handlers (dosemu does this with swapcontext) and for that a
> > > restore token is needed.
> > >
> > > alt shadow stack is important if the original shstk did not
> > > overflow
> > > but the signal handler would overflow it (small thread stack, huge
> > > sigaltstack case).
> > >
> > > alt shadow stack is also important for crash reporting on shstk
> > > overflow even if longjmp does not work then. longjmp to a
> > > makecontext
> > > stack would still work and longjmp back to the original stack can
> > > be
> > > made to mostly work by an altshstk option to overwrite the top
> > > entry
> > > with a restore token on overflow (this can break unwinding though).
> > >
>
> There was previously a request to create an alt shadow stack for the
> purpose of handling shadow stack overflow. So you are now suggesting to
> to exclude that and instead target a different use case for alt shadow
> stack?
>
> But I'm not sure how much we should change the ABI at this point since
> we are constrained by existing userspace. If you read the history, we
> may end up needing to deprecate the whole elf bit for this and other
> reasons.
>
> So should we struggle to find a way to grow the existing ABI without
> disturbing the existing userspace? Or should we start with something,
> finally, and see where we need to grow and maybe get a chance at a
> fresh start to grow it?
>
> Like, maybe 3 people will show up saying "hey, I *really* need to use
> shadow stack and longjmp from a ucontext stack", and no one says
> anything about shadow stack overflow. Then we know what to do. And
> maybe dosemu decides it doesn't need to implement shadow stack (highly
> likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> was created for dosemu, and the alt shadow stack patch adopted this
> behavior. So it's speculation that there is even a problem in that
> scenario.
>
> Or maybe people just enable WRSS for longjmp() and directly jump back
> to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> cost of a little security?
>
> Even if, with enough discussion, we could optimize for all
> hypotheticals without real user feedback, I don't see how it helps
> users to hold shadow stack. So I think we should move forward with the
> current ABI.
>
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 22:22                               ` Edgecombe, Rick P
@ 2023-06-21 23:05                                 ` H.J. Lu
  2023-06-21 23:15                                   ` Edgecombe, Rick P
  2023-06-22  8:27                                 ` szabolcs.nagy
  1 sibling, 1 reply; 151+ messages in thread
From: H.J. Lu @ 2023-06-21 23:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: broonie, szabolcs.nagy, Xu, Pengfei, tglx, linux-arch, kcc,
	Lutomirski, Andy, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, linux-doc, peterz, corbet, nd, dethoma, jannh,
	linux-kernel, debug, pavel, bp, mike.kravetz, linux-api, rppt,
	jamorris, arnd, john.allen, rdunlap, bsingharora, oleg,
	andrew.cooper3, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer,
	hpa, mingo, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus,
	akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

On Wed, Jun 21, 2023 at 3:23 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-06-21 at 11:54 -0700, Rick Edgecombe wrote:
> > > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > > >
> > > > > > > >         target_ssp = read from jmpbuf;
> > > > > > > >         current_ssp = read ssp;
> > > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > > >                 if (*p == restore-token) {
> > > > > > > >                         // target_ssp is on a different
> > > > > > > > shstk.
> > > > > > > >                         switch_shstk_to(p);
> > > > > > > >                         break;
> > > > > > > >                 }
> > > > > > > >         }
> > > > > > > >         for (; p != target_ssp; p++)
> > > > > > > >                 // ssp is now on the same shstk as
> > > > > > > > target.
> > > > > > > >                 inc_ssp();
> > > > > > > >
> > > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > > same:
> > > > > > > > for programs that always longjmp within the same shstk
> > > > > > > > the
> > > > > > > > first
> > > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > > longjmp
> > > > > > > > target is on a different shstk assuming nothing is
> > > > > > > > running
> > > > > > > > on
> > > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > > token
> > > > > > > > on top.
> > > > > > > >
> > > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > > it has
> > > > > > > > to add a restore-token on the switched away shstk.
>
> Wait a second, the claim is that the kernel should add a restore token
> on the current shadow stack before handling a signal, to allow to
> unwind from an alt shadow stack, right? But in this series there is not
> an alt shadow stack, so signal will be handled on the current shadow
> stack. If the user stays on the current shadow stack, the existing
> simple INCSSP based solution will work.
>
> If the user swapcontext()'s away while handling a signal (which *is*
> currently supported) they will leave their own restore token on the old
> stack. Hypothetically glibc could unwind back through a series of
> ucontext stacks by pivoting, if it kept some metadata somewhere about
> where to restore to. So there are actually already enough tokens to
> make it back in this case, glibc just doesn't do this.
>
> But how does the proposed token placed by the kernel on the original
> stack help this problem? The longjmp() would have to be able to find
> the location of the restore tokens somehow, which would not necessarily
> be near the setjmp() point. The signal token could even be on a
> different shadow stack.
>
> So I think the above is short of a design for a universally compatible
> longjmp().
>
> Which makes me think if we did want to make a more compatible longjmp()
> a better the way to do it might be an arch_prctl that emits a token at
> the current SSP. This would be loosening up the security somewhat (have
> to be an opt-in), but less so then enabling WRSS. But it would also be
> way simpler, work for all cases (I think), and be faster (maybe?) than
> INCSSPing through a bunch of stacks.

Since longjmp isn't required to be called after setjmp, leaving a restore
token doesn't work when longjmp isn't called.

> I'm also not sure leaving a token on signal doesn't weaken the security
> it it's own way as well. Any thread could then swap to that token.
> Where as the shadow stack signal frame ssp pointer can only be used
> from the shadow stack the signal was handled on.
>
> So I think, in addition to blocking the shadow stack overflow use case
> in the future, leaving a token behind on signal will not really help
> longjmp(). (or at least I'm not following)
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 23:05                                 ` H.J. Lu
@ 2023-06-21 23:15                                   ` Edgecombe, Rick P
  2023-06-22  1:07                                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-21 23:15 UTC (permalink / raw)
  To: hjl.tools
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	szabolcs.nagy, david, kirill.shutemov, Schimpe, Christina, Yang,
	Weijiang, peterz, corbet, nd, broonie, jannh, linux-kernel,
	debug, pavel, bp, rdunlap, linux-api, rppt, jamorris, arnd,
	john.allen, bsingharora, mike.kravetz, dethoma, andrew.cooper3,
	oleg, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo,
	linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds, Linus,
	dave.hansen, akpm, Eranian, Stephane

On Wed, 2023-06-21 at 16:05 -0700, H.J. Lu wrote:
> > Which makes me think if we did want to make a more compatible
> > longjmp()
> > a better the way to do it might be an arch_prctl that emits a token
> > at
> > the current SSP. This would be loosening up the security somewhat
> > (have
> > to be an opt-in), but less so then enabling WRSS. But it would also
> > be
> > way simpler, work for all cases (I think), and be faster (maybe?)
> > than
> > INCSSPing through a bunch of stacks.
> 
> Since longjmp isn't required to be called after setjmp, leaving a
> restore
> token doesn't work when longjmp isn't called.

Oh good point. Hmm.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 23:15                                   ` Edgecombe, Rick P
@ 2023-06-22  1:07                                     ` Edgecombe, Rick P
  2023-06-22  3:23                                       ` H.J. Lu
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-22  1:07 UTC (permalink / raw)
  To: hjl.tools
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	szabolcs.nagy, david, kirill.shutemov, Schimpe, Christina, Yang,
	Weijiang, peterz, corbet, nd, broonie, jannh, linux-kernel,
	debug, pavel, bp, rdunlap, linux-api, rppt, jamorris, arnd,
	john.allen, bsingharora, mike.kravetz, dethoma, andrew.cooper3,
	oleg, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo,
	linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds, Linus,
	dave.hansen, akpm, Eranian, Stephane

On Wed, 2023-06-21 at 16:15 -0700, Rick Edgecombe wrote:
> On Wed, 2023-06-21 at 16:05 -0700, H.J. Lu wrote:
> > > Which makes me think if we did want to make a more compatible
> > > longjmp()
> > > a better the way to do it might be an arch_prctl that emits a
> > > token
> > > at
> > > the current SSP. This would be loosening up the security somewhat
> > > (have
> > > to be an opt-in), but less so then enabling WRSS. But it would
> > > also
> > > be
> > > way simpler, work for all cases (I think), and be faster (maybe?)
> > > than
> > > INCSSPing through a bunch of stacks.
> > 
> > Since longjmp isn't required to be called after setjmp, leaving a
> > restore
> > token doesn't work when longjmp isn't called.
> 
> Oh good point. Hmm.

Just had a quick chat with HJ on this. It seems like it *might* be able
to made to work. How it would go is setjmp() could act as a wrapper by
calling it's own return address (the function that called setjmp()).
This would mean in the case of longjmp() not being called, control flow
would return through setjmp() before returning from the calling method.
This would allow libc to do a RSTORSSP when returning though setjmp() 
in the non-shadow stack case, and essentially skip over the kernel
placed restore token, and then return from setjmp() like normal. In the
case of longjmp() being called, it could RSTORSSP directly to the
token, and then return from setjmp().

Another option could be getting the compilers help to do the RSTORSSP
in the case of longjmp() not being called. Apparently compilers are
aware of setjmp() and already do special things around it (makes sense
I guess, but news to me).

And also, this all would actually work with IBT, because the compiler
knows already to add an endbr at that point right after setjmp().

I think neither of us were ready to bet on it, but thought maybe it
could work. And even if it works it's much more complicated than I
first thought, so I don't like it as much. It's also unclear what a
change like that would mean for security.

As for unwinding through the existing swapcontext() placed restore
tokens, the problem was as assumed - that it's difficult to find them.
Even considering brute force options like doing manual searches for a
nearby token to use turned up edge cases pretty quick. So I think that
kind of leaves us where we were originally, with no known solutions
that would require breaking kernel ABI changes.


Are you interested in helping get longjmp() from a ucontext stack
working for shadow stack? One other thing that came up in the
conversation was that while it is known that some apps are doing this,
there are no tests for mixing longjmp and ucontext in glibc. So we may
not know which combinations of mixing them together even work in the
non-shadow stack case.

It could be useful to add some tests for this to glibc and we could get
some clarity on what behaviors shadow stack would actually need to
support.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22  1:07                                     ` Edgecombe, Rick P
@ 2023-06-22  3:23                                       ` H.J. Lu
  0 siblings, 0 replies; 151+ messages in thread
From: H.J. Lu @ 2023-06-22  3:23 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	szabolcs.nagy, david, kirill.shutemov, Schimpe, Christina, Yang,
	Weijiang, peterz, corbet, nd, broonie, jannh, linux-kernel,
	debug, pavel, bp, rdunlap, linux-api, rppt, jamorris, arnd,
	john.allen, bsingharora, mike.kravetz, dethoma, andrew.cooper3,
	oleg, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo,
	linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds, Linus,
	dave.hansen, akpm, Eranian, Stephane

On Wed, Jun 21, 2023 at 6:07 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-06-21 at 16:15 -0700, Rick Edgecombe wrote:
> > On Wed, 2023-06-21 at 16:05 -0700, H.J. Lu wrote:
> > > > Which makes me think if we did want to make a more compatible
> > > > longjmp()
> > > > a better the way to do it might be an arch_prctl that emits a
> > > > token
> > > > at
> > > > the current SSP. This would be loosening up the security somewhat
> > > > (have
> > > > to be an opt-in), but less so then enabling WRSS. But it would
> > > > also
> > > > be
> > > > way simpler, work for all cases (I think), and be faster (maybe?)
> > > > than
> > > > INCSSPing through a bunch of stacks.
> > >
> > > Since longjmp isn't required to be called after setjmp, leaving a
> > > restore
> > > token doesn't work when longjmp isn't called.
> >
> > Oh good point. Hmm.
>
> Just had a quick chat with HJ on this. It seems like it *might* be able
> to made to work. How it would go is setjmp() could act as a wrapper by
> calling it's own return address (the function that called setjmp()).
> This would mean in the case of longjmp() not being called, control flow
> would return through setjmp() before returning from the calling method.

It may not work since we can't tell if RAX (return value) is set by longjmp
or function return.

> This would allow libc to do a RSTORSSP when returning though setjmp()
> in the non-shadow stack case, and essentially skip over the kernel
> placed restore token, and then return from setjmp() like normal. In the
> case of longjmp() being called, it could RSTORSSP directly to the
> token, and then return from setjmp().
>
> Another option could be getting the compilers help to do the RSTORSSP
> in the case of longjmp() not being called. Apparently compilers are
> aware of setjmp() and already do special things around it (makes sense
> I guess, but news to me).
>
> And also, this all would actually work with IBT, because the compiler
> knows already to add an endbr at that point right after setjmp().
>
> I think neither of us were ready to bet on it, but thought maybe it
> could work. And even if it works it's much more complicated than I
> first thought, so I don't like it as much. It's also unclear what a
> change like that would mean for security.
>
> As for unwinding through the existing swapcontext() placed restore
> tokens, the problem was as assumed - that it's difficult to find them.
> Even considering brute force options like doing manual searches for a
> nearby token to use turned up edge cases pretty quick. So I think that
> kind of leaves us where we were originally, with no known solutions
> that would require breaking kernel ABI changes.
>
>
> Are you interested in helping get longjmp() from a ucontext stack
> working for shadow stack? One other thing that came up in the
> conversation was that while it is known that some apps are doing this,
> there are no tests for mixing longjmp and ucontext in glibc. So we may
> not know which combinations of mixing them together even work in the
> non-shadow stack case.
>
> It could be useful to add some tests for this to glibc and we could get
> some clarity on what behaviors shadow stack would actually need to
> support.



-- 
H.J.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 23:02                               ` H.J. Lu
@ 2023-06-22  7:40                                 ` szabolcs.nagy
  2023-06-22 16:46                                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-22  7:40 UTC (permalink / raw)
  To: H.J. Lu, Edgecombe, Rick P
  Cc: broonie, Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	linux-doc, peterz, corbet, nd, dethoma, jannh, linux-kernel,
	debug, pavel, bp, mike.kravetz, linux-api, rppt, jamorris, arnd,
	john.allen, rdunlap, bsingharora, oleg, andrew.cooper3, keescook,
	x86, gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

The 06/21/2023 16:02, H.J. Lu wrote:
> On Wed, Jun 21, 2023 at 11:54 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > HJ, can you link your patch that makes it extensible and we can clear
> > this up? Maybe the issue extends beyond fnon-call-exceptions, but that
> > is where I reproduced it.
> 
> Here is the patch:
> 
> https://gitlab.com/x86-gcc/gcc/-/commit/aab4c24b67b5f05b72e52a3eaae005c2277710b9

ok i don't see how this is related to fnon-call-exceptions..

but it is wrong if the shstk is ever discontinous even though the
shstk format would allow that. this was my original point in this
thread: with current unwinder code you cannot add altshstk later.
and no, introducing new binary markings and rebuild the world just
for altshstk is not OK. so we have to decide now if we want alt
shstk in the future or not, i gave reasons why we would want it.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 22:22                               ` Edgecombe, Rick P
  2023-06-21 23:05                                 ` H.J. Lu
@ 2023-06-22  8:27                                 ` szabolcs.nagy
  2023-06-22 16:47                                   ` Edgecombe, Rick P
  1 sibling, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-22  8:27 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, nd, dethoma, jannh, linux-kernel, debug, pavel, bp,
	mike.kravetz, linux-api, rppt, jamorris, arnd, john.allen,
	rdunlap, bsingharora, oleg, andrew.cooper3, keescook, x86,
	gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

The 06/21/2023 22:22, Edgecombe, Rick P wrote:
> On Wed, 2023-06-21 at 11:54 -0700, Rick Edgecombe wrote:
> > > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > > > 
> > > > > > > >         target_ssp = read from jmpbuf;
> > > > > > > >         current_ssp = read ssp;
> > > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > > >                 if (*p == restore-token) {
> > > > > > > >                         // target_ssp is on a different
> > > > > > > > shstk.
> > > > > > > >                         switch_shstk_to(p);
> > > > > > > >                         break;
> > > > > > > >                 }
> > > > > > > >         }
> > > > > > > >         for (; p != target_ssp; p++)
> > > > > > > >                 // ssp is now on the same shstk as
> > > > > > > > target.
> > > > > > > >                 inc_ssp();
> > > > > > > > 
> > > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > > same:
> > > > > > > > for programs that always longjmp within the same shstk
> > > > > > > > the
> > > > > > > > first
> > > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > > longjmp
> > > > > > > > target is on a different shstk assuming nothing is
> > > > > > > > running
> > > > > > > > on
> > > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > > token
> > > > > > > > on top.
> > > > > > > > 
> > > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > > it has
> > > > > > > > to add a restore-token on the switched away shstk.
> 
> Wait a second, the claim is that the kernel should add a restore token
> on the current shadow stack before handling a signal, to allow to
> unwind from an alt shadow stack, right? But in this series there is not
> an alt shadow stack, so signal will be handled on the current shadow
> stack. If the user stays on the current shadow stack, the existing
> simple INCSSP based solution will work.

yes.

> If the user swapcontext()'s away while handling a signal (which *is*
> currently supported) they will leave their own restore token on the old
> stack. Hypothetically glibc could unwind back through a series of
> ucontext stacks by pivoting, if it kept some metadata somewhere about
> where to restore to. So there are actually already enough tokens to
> make it back in this case, glibc just doesn't do this.

swapcontext is currently *not* supported: for it to work you have to
be able to jump *back* into the signal handler, which does not work if
the swapcontext target is on the original thread stack (instead of
say a makecontext stack).

jumping back can only be supported if alt stack can be paired with
an alt shadow stack.

unwinding across a series of signal interrupts should work even
with discontinous shstk. libgcc does not implement this which is
a problem i think.

> But how does the proposed token placed by the kernel on the original
> stack help this problem? The longjmp() would have to be able to find
> the location of the restore tokens somehow, which would not necessarily
> be near the setjmp() point. The signal token could even be on a
> different shadow stack.

i posted the exact longjmp code and it takes care of this case.

setjmp does not need to do anything special.

the invariant is that an shstk is either capped by a restore token
or in use by some executing task. this is guaranteed architecturally
(when shstk is switched with an instruction) and should be guaranteed
by the kernel too (when shstk is switched by the kernel).

> I'm also not sure leaving a token on signal doesn't weaken the security
> it it's own way as well. Any thread could then swap to that token.
> Where as the shadow stack signal frame ssp pointer can only be used
> from the shadow stack the signal was handled on.

as far as i'm concerned it is a valid programming model to switch
to a stack that is currently not in use and we should always allow
that. (signal handled on an alt stack may not return)

> So I think, in addition to blocking the shadow stack overflow use case
> in the future, leaving a token behind on signal will not really help
> longjmp(). (or at least I'm not following)

the restore token must only be added if shstk is switched
(currently it is not switched so don't add it, however if
we agree on this then the unwinder can be fixed accordingly.)

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-21 18:54                             ` Edgecombe, Rick P
  2023-06-21 22:22                               ` Edgecombe, Rick P
  2023-06-21 23:02                               ` H.J. Lu
@ 2023-06-22  9:18                               ` szabolcs.nagy
  2023-06-22 15:26                                 ` Andy Lutomirski
  2 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-22  9:18 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, nd, dethoma, jannh, linux-kernel, debug, pavel, bp,
	mike.kravetz, linux-api, rppt, jamorris, arnd, john.allen,
	rdunlap, bsingharora, oleg, andrew.cooper3, keescook, x86,
	gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > if
> > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > can't
> > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > overflow
> > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > solve
> > > > > the problem.
> > > 
> > > the restore token in the alt shstk case does not regress anything
> > > but
> > > makes some use-cases work.
> > > 
> > > alt shadow stack is important if code tries to jump in and out of
> > > signal handlers (dosemu does this with swapcontext) and for that a
> > > restore token is needed.
> > > 
> > > alt shadow stack is important if the original shstk did not
> > > overflow
> > > but the signal handler would overflow it (small thread stack, huge
> > > sigaltstack case).
> > > 
> > > alt shadow stack is also important for crash reporting on shstk
> > > overflow even if longjmp does not work then. longjmp to a
> > > makecontext
> > > stack would still work and longjmp back to the original stack can
> > > be
> > > made to mostly work by an altshstk option to overwrite the top
> > > entry
> > > with a restore token on overflow (this can break unwinding though).
> > > 
> 
> There was previously a request to create an alt shadow stack for the
> purpose of handling shadow stack overflow. So you are now suggesting to
> to exclude that and instead target a different use case for alt shadow
> stack?

that is not what i said.

> But I'm not sure how much we should change the ABI at this point since
> we are constrained by existing userspace. If you read the history, we
> may end up needing to deprecate the whole elf bit for this and other
> reasons.

i'm not against deprecating the elf bit, but i think binary
marking will be difficult for this kind of feature no matter what
(code may be incompatible for complex runtime dependent reasons).

> So should we struggle to find a way to grow the existing ABI without
> disturbing the existing userspace? Or should we start with something,
> finally, and see where we need to grow and maybe get a chance at a
> fresh start to grow it?
> 
> Like, maybe 3 people will show up saying "hey, I *really* need to use
> shadow stack and longjmp from a ucontext stack", and no one says
> anything about shadow stack overflow. Then we know what to do. And
> maybe dosemu decides it doesn't need to implement shadow stack (highly
> likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> was created for dosemu, and the alt shadow stack patch adopted this
> behavior. So it's speculation that there is even a problem in that
> scenario.
> 
> Or maybe people just enable WRSS for longjmp() and directly jump back
> to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> cost of a little security?
> 
> Even if, with enough discussion, we could optimize for all
> hypotheticals without real user feedback, I don't see how it helps
> users to hold shadow stack. So I think we should move forward with the
> current ABI.

you may not get a second chance to fix a security feature.
it will be just disabled if it causes problems.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22  9:18                               ` szabolcs.nagy
@ 2023-06-22 15:26                                 ` Andy Lutomirski
  2023-06-22 16:42                                   ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Andy Lutomirski @ 2023-06-22 15:26 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, broonie, Xu, Pengfei, tglx, linux-arch, kcc,
	Lutomirski, Andy, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, linux-doc, peterz, corbet, nd, dethoma, jannh,
	linux-kernel, debug, pavel, bp, mike.kravetz, linux-api, rppt,
	jamorris, arnd, john.allen, rdunlap, bsingharora, oleg,
	andrew.cooper3, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer,
	hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene, Torvalds,
	Linus, akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

On Thu, Jun 22, 2023 at 2:28 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > > if
> > > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > > can't
> > > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > > overflow
> > > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > > solve
> > > > > > the problem.
> > > >
> > > > the restore token in the alt shstk case does not regress anything
> > > > but
> > > > makes some use-cases work.
> > > >
> > > > alt shadow stack is important if code tries to jump in and out of
> > > > signal handlers (dosemu does this with swapcontext) and for that a
> > > > restore token is needed.
> > > >
> > > > alt shadow stack is important if the original shstk did not
> > > > overflow
> > > > but the signal handler would overflow it (small thread stack, huge
> > > > sigaltstack case).
> > > >
> > > > alt shadow stack is also important for crash reporting on shstk
> > > > overflow even if longjmp does not work then. longjmp to a
> > > > makecontext
> > > > stack would still work and longjmp back to the original stack can
> > > > be
> > > > made to mostly work by an altshstk option to overwrite the top
> > > > entry
> > > > with a restore token on overflow (this can break unwinding though).
> > > >
> >
> > There was previously a request to create an alt shadow stack for the
> > purpose of handling shadow stack overflow. So you are now suggesting to
> > to exclude that and instead target a different use case for alt shadow
> > stack?
>
> that is not what i said.
>
> > But I'm not sure how much we should change the ABI at this point since
> > we are constrained by existing userspace. If you read the history, we
> > may end up needing to deprecate the whole elf bit for this and other
> > reasons.
>
> i'm not against deprecating the elf bit, but i think binary
> marking will be difficult for this kind of feature no matter what
> (code may be incompatible for complex runtime dependent reasons).
>
> > So should we struggle to find a way to grow the existing ABI without
> > disturbing the existing userspace? Or should we start with something,
> > finally, and see where we need to grow and maybe get a chance at a
> > fresh start to grow it?
> >
> > Like, maybe 3 people will show up saying "hey, I *really* need to use
> > shadow stack and longjmp from a ucontext stack", and no one says
> > anything about shadow stack overflow. Then we know what to do. And
> > maybe dosemu decides it doesn't need to implement shadow stack (highly
> > likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> > was created for dosemu, and the alt shadow stack patch adopted this
> > behavior. So it's speculation that there is even a problem in that
> > scenario.
> >
> > Or maybe people just enable WRSS for longjmp() and directly jump back
> > to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> > cost of a little security?
> >
> > Even if, with enough discussion, we could optimize for all
> > hypotheticals without real user feedback, I don't see how it helps
> > users to hold shadow stack. So I think we should move forward with the
> > current ABI.
>
> you may not get a second chance to fix a security feature.
> it will be just disabled if it causes problems.

*I* would use altshadowstack.

I run a production system (that cares about correctness *and*
performance, but that's not really relevant here -- SHSTK ought to be
fast).  And, if it crashes, I want to know why.  So I handle SIGSEGV,
etc so I have good logs if it crashes.  And I want those same logs if
I overflow the stack.

That being said, I have no need for longjmp or siglongjmp for this.  I
use exit(2) to escape.

For what it's worth, setjmp/longjmp is a bad API.  The actual pattern
that ought to work well (and that could be supported well by fancy
compilers and non-C languages, as I understand it) is more like a
function call that has two ways out.  Like this (pseudo-C):

void function(struct better_jmp_buf &buf, args...)
{
   ...
       if (condition)
          better_long_jump(buf);  // long jumps out!
       // could also pass buf to another function
   ...
       // could also return normally
}

better_call_with_jmp_buf(function, args);

*This* could support altshadowstack just fine.  And many users might
be okay with the understanding that, if altshadowstack is on, you have
to use a better long jump to get out (or a normal sigreturn or _exit).
No one is getting an altshadowstack signal handler without code
changes.

siglongjmp() could support altshadowstack with help from the kernel,
but we probably don't want to go there.

--Andy

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22 15:26                                 ` Andy Lutomirski
@ 2023-06-22 16:42                                   ` szabolcs.nagy
  2023-06-22 23:18                                     ` Edgecombe, Rick P
  2023-06-25 23:52                                     ` Andy Lutomirski
  0 siblings, 2 replies; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-22 16:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Edgecombe, Rick P, broonie, Xu, Pengfei, tglx, linux-arch, kcc,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	linux-doc, peterz, corbet, nd, dethoma, jannh, linux-kernel,
	debug, pavel, bp, mike.kravetz, linux-api, rppt, jamorris, arnd,
	john.allen, rdunlap, bsingharora, oleg, andrew.cooper3, keescook,
	x86, gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools,
	linux-mm, Syromiatnikov, Eugene, Torvalds, Linus, akpm,
	dave.hansen, Yang, Weijiang, Eranian, Stephane

The 06/22/2023 08:26, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023 at 2:28 AM szabolcs.nagy@arm.com
> <szabolcs.nagy@arm.com> wrote:
> >
> > The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> > > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > > > if
> > > > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > > > can't
> > > > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > > > overflow
> > > > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > > > solve
> > > > > > > the problem.
> > > > >
> > > > > the restore token in the alt shstk case does not regress anything
> > > > > but
> > > > > makes some use-cases work.
> > > > >
> > > > > alt shadow stack is important if code tries to jump in and out of
> > > > > signal handlers (dosemu does this with swapcontext) and for that a
> > > > > restore token is needed.
> > > > >
> > > > > alt shadow stack is important if the original shstk did not
> > > > > overflow
> > > > > but the signal handler would overflow it (small thread stack, huge
> > > > > sigaltstack case).
> > > > >
> > > > > alt shadow stack is also important for crash reporting on shstk
> > > > > overflow even if longjmp does not work then. longjmp to a
> > > > > makecontext
> > > > > stack would still work and longjmp back to the original stack can
> > > > > be
> > > > > made to mostly work by an altshstk option to overwrite the top
> > > > > entry
> > > > > with a restore token on overflow (this can break unwinding though).
> > > > >
> > >
> > > There was previously a request to create an alt shadow stack for the
> > > purpose of handling shadow stack overflow. So you are now suggesting to
> > > to exclude that and instead target a different use case for alt shadow
> > > stack?
> >
> > that is not what i said.
> >
> > > But I'm not sure how much we should change the ABI at this point since
> > > we are constrained by existing userspace. If you read the history, we
> > > may end up needing to deprecate the whole elf bit for this and other
> > > reasons.
> >
> > i'm not against deprecating the elf bit, but i think binary
> > marking will be difficult for this kind of feature no matter what
> > (code may be incompatible for complex runtime dependent reasons).
> >
> > > So should we struggle to find a way to grow the existing ABI without
> > > disturbing the existing userspace? Or should we start with something,
> > > finally, and see where we need to grow and maybe get a chance at a
> > > fresh start to grow it?
> > >
> > > Like, maybe 3 people will show up saying "hey, I *really* need to use
> > > shadow stack and longjmp from a ucontext stack", and no one says
> > > anything about shadow stack overflow. Then we know what to do. And
> > > maybe dosemu decides it doesn't need to implement shadow stack (highly
> > > likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> > > was created for dosemu, and the alt shadow stack patch adopted this
> > > behavior. So it's speculation that there is even a problem in that
> > > scenario.
> > >
> > > Or maybe people just enable WRSS for longjmp() and directly jump back
> > > to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> > > cost of a little security?
> > >
> > > Even if, with enough discussion, we could optimize for all
> > > hypotheticals without real user feedback, I don't see how it helps
> > > users to hold shadow stack. So I think we should move forward with the
> > > current ABI.
> >
> > you may not get a second chance to fix a security feature.
> > it will be just disabled if it causes problems.
> 
> *I* would use altshadowstack.
> 
> I run a production system (that cares about correctness *and*
> performance, but that's not really relevant here -- SHSTK ought to be
> fast).  And, if it crashes, I want to know why.  So I handle SIGSEGV,
> etc so I have good logs if it crashes.  And I want those same logs if
> I overflow the stack.
> 
> That being said, I have no need for longjmp or siglongjmp for this.  I
> use exit(2) to escape.

the same crash handler that prints a log on shstk overflow should
work when a different cause of SIGSEGV is recoverable via longjmp.
to me this means that alt shstk must work with longjmp at least in
the non-shstk overflow case (which can be declared non-recoverable).

> For what it's worth, setjmp/longjmp is a bad API.  The actual pattern
> that ought to work well (and that could be supported well by fancy
> compilers and non-C languages, as I understand it) is more like a
> function call that has two ways out.  Like this (pseudo-C):
> 
> void function(struct better_jmp_buf &buf, args...)
> {
>    ...
>        if (condition)
>           better_long_jump(buf);  // long jumps out!
>        // could also pass buf to another function
>    ...
>        // could also return normally
> }
> 
> better_call_with_jmp_buf(function, args);
> 
> *This* could support altshadowstack just fine.  And many users might
> be okay with the understanding that, if altshadowstack is on, you have
> to use a better long jump to get out (or a normal sigreturn or _exit).

i don't understand why this would work fine when longjmp does not.
how does the shstk switch happen?

> No one is getting an altshadowstack signal handler without code
> changes.

assuming the same component is doing the alt shstk setup as the
longjmp.

> siglongjmp() could support altshadowstack with help from the kernel,
> but we probably don't want to go there.

what kind of help? maybe we need that help..

e.g. if the signal frame token is detected by longjmp on
the shstk then doing an rt_sigreturn with the right signal
frame context allows longjmp to continue unwinding the shstk.
however kernel sigcontext layout can change so userspace may
not know it so longjmp needs a helper, but only in the jump
across signal frame case.

(this is a different design than what i proposed earlier,
it also makes longjmp from alt shstk work without wrss,
the downside is that longjmp across makecontext needs a
separate solution then which implies that all shstk needs
a detectable token at the end of the shstk.. so again
something that we have to get right now and cannot add
later.)

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22  7:40                                 ` szabolcs.nagy
@ 2023-06-22 16:46                                   ` Edgecombe, Rick P
  2023-06-26 14:08                                     ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-22 16:46 UTC (permalink / raw)
  To: szabolcs.nagy, hjl.tools
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, nd, dethoma, jannh, linux-kernel, debug,
	mike.kravetz, bp, rdunlap, linux-api, rppt, jamorris, pavel,
	john.allen, bsingharora, x86, broonie, andrew.cooper3, oleg,
	keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer, hpa, mingo,
	linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds, Linus,
	dave.hansen, akpm, Eranian, Stephane

On Thu, 2023-06-22 at 08:40 +0100, szabolcs.nagy@arm.com wrote:
> The 06/21/2023 16:02, H.J. Lu wrote:
> > On Wed, Jun 21, 2023 at 11:54 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > > HJ, can you link your patch that makes it extensible and we can
> > > clear
> > > this up? Maybe the issue extends beyond fnon-call-exceptions, but
> > > that
> > > is where I reproduced it.
> > 
> > Here is the patch:
> > 
> > https://gitlab.com/x86-gcc/gcc/-/commit/aab4c24b67b5f05b72e52a3eaae005c2277710b9
> 
> ok i don't see how this is related to fnon-call-exceptions..

I don't know what to tell you. I'm not a compiler expert, but a simple
fnon-call-exceptions test was how I reproduced it. If there is some
other use case for throwing an exception out of a signal handler, I
don't see how it makes fnon-call-exceptions unrelated.

> 
> but it is wrong if the shstk is ever discontinous even though the
> shstk format would allow that. this was my original point in this
> thread: with current unwinder code you cannot add altshstk later.
> and no, introducing new binary markings and rebuild the world just
> for altshstk is not OK. so we have to decide now if we want alt
> shstk in the future or not, i gave reasons why we would want it.

The point was that the old GCCs restrict expanding the shadow stack
signal frame, which is a prerequisite to supporting alt shadow stacks.

So the existing userspace prevents us from supporting regular alt
shadow stack, before we even get to fancy unwinding alt shadow stack.
Some kind of ABI opt in will likely be required. Maybe a new elf bit,
at which point we can take advantage of what we learned.

You previously said:

On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> as far as i can tell the current unwinder handles shstk unwinding
> correctly across signal handlers (sync or async and
> cleanup/exceptions
> handlers too), i see no issue with "fixed shadow stack signal frame
> size of 8 bytes" other than future extensions and discontinous shstk.

I took that to mean that you didn't see how the the existing unwinder
prevented alt shadow stacks. Hopefully we're all on the same page now. 

BTW, when alt shadow stack's were POCed, I hadn't encountered this GCC
behavior yet. So it was assumed it could be bolted on later without
disturbing anything. If Linus or someone wants to say we're ok with
breaking these old GCCs in this way, the first thing I would do would
be to pad the shadow stack signal frame with room for alt shadow stack
and more. I actually have a patch to do this, but alas we are already
pushing it regression wise.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22  8:27                                 ` szabolcs.nagy
@ 2023-06-22 16:47                                   ` Edgecombe, Rick P
  2023-06-23 16:25                                     ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-22 16:47 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, nd, dethoma, jannh, linux-kernel, debug,
	mike.kravetz, bp, x86, linux-api, rppt, jamorris, pavel,
	john.allen, bsingharora, andrew.cooper3, oleg, keescook,
	gorcunov, arnd, Yu, Yu-cheng, rdunlap, fweimer, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds,
	Linus, dave.hansen, akpm, Eranian, Stephane

On Thu, 2023-06-22 at 09:27 +0100, szabolcs.nagy@arm.com wrote:


[ snip ]

> swapcontext is currently *not* supported: for it to work you have to
> be able to jump *back* into the signal handler, which does not work
> if
> the swapcontext target is on the original thread stack (instead of
> say a makecontext stack).
> 
> jumping back can only be supported if alt stack can be paired with
> an alt shadow stack.
> 
> unwinding across a series of signal interrupts should work even
> with discontinous shstk. libgcc does not implement this which is
> a problem i think.

I would be helpful if you could enumerate the cases you think are
important to support. I would like to see how we could support them in
the future in some mode.

> 
> > But how does the proposed token placed by the kernel on the
> > original
> > stack help this problem? The longjmp() would have to be able to
> > find
> > the location of the restore tokens somehow, which would not
> > necessarily
> > be near the setjmp() point. The signal token could even be on a
> > different shadow stack.
> 
> i posted the exact longjmp code and it takes care of this case.

I see how it works for the simple case of longjmp() from an alt shadow
stack. I would still prefer a different solution that works with the
overflow case. (probably new kernel functionality)

But I don't see how it works for unwinding off of a ucontext stack. Or
unwinding off of an ucontext stack that was swapped to from an alt
shadow stack.

> 
> setjmp does not need to do anything special.
> 
> the invariant is that an shstk is either capped by a restore token
> or in use by some executing task. this is guaranteed architecturally
> (when shstk is switched with an instruction) and should be guaranteed
> by the kernel too (when shstk is switched by the kernel).
> 
> > I'm also not sure leaving a token on signal doesn't weaken the
> > security
> > it it's own way as well. Any thread could then swap to that token.
> > Where as the shadow stack signal frame ssp pointer can only be used
> > from the shadow stack the signal was handled on.
> 
> as far as i'm concerned it is a valid programming model to switch
> to a stack that is currently not in use and we should always allow
> that. (signal handled on an alt stack may not return)

Some people just want shadow stack for like, a super stack canary, and
want everything to "just work". Some people want to lock things down as
much as possible and change their code to do it if needed.

It sure is a challenge to figure out where the happy medium is. Ideally
there would be several modes so all the users could be happy, but I
wouldn't want to start with that for multiple reasons. Although we do
have WRSS today, which can support pretty much everything
functionality-wise.

But in any case, we have limited room for movement. I actually had some
other ABI tweaks fully ready to post around the tracing usages, but I
just thought given the situation, it was better to start with what we
have. This project could really use a time machine...

> 
> > So I think, in addition to blocking the shadow stack overflow use
> > case
> > in the future, leaving a token behind on signal will not really
> > help
> > longjmp(). (or at least I'm not following)
> 
> the restore token must only be added if shstk is switched
> (currently it is not switched so don't add it, however if
> we agree on this then the unwinder can be fixed accordingly.)

I do agree that a token should not be added when the stack is not
switched, as today. But I don't agree on this solution for alt shadow
stacks. Again, I actually built that exact design in actual code, so
it's not a NIH thing. It just doesn't work for valid use cases.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-13  0:10 ` [PATCH v9 16/42] mm: Add guard pages around a shadow stack Rick Edgecombe
  2023-06-14 23:34   ` Mark Brown
@ 2023-06-22 18:21   ` Matthew Wilcox
  2023-06-22 18:27     ` Edgecombe, Rick P
  1 sibling, 1 reply; 151+ messages in thread
From: Matthew Wilcox @ 2023-06-22 18:21 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, broonie, Yu-cheng Yu,
	Pengfei Xu

On Mon, Jun 12, 2023 at 05:10:42PM -0700, Rick Edgecombe wrote:
> +++ b/include/linux/mm.h
> @@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void *objp);
>  #endif /* CONFIG_ARCH_HAS_PKEYS */
>  
>  #ifdef CONFIG_X86_USER_SHADOW_STACK
> -# define VM_SHADOW_STACK	VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
> +/*
> + * This flag should not be set with VM_SHARED because of lack of support
> + * core mm. It will also get a guard page. This helps userspace protect
> + * itself from attacks. The reasoning is as follows:
> + *
> + * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
> + * INCSSP instruction can increment the shadow stack pointer. It is the
> + * shadow stack analog of an instruction like:
> + *
> + *   addq $0x80, %rsp
> + *
> + * However, there is one important difference between an ADD on %rsp
> + * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
> + * memory of the first and last elements that were "popped". It can be
> + * thought of as acting like this:
> + *
> + * READ_ONCE(ssp);       // read+discard top element on stack
> + * ssp += nr_to_pop * 8; // move the shadow stack
> + * READ_ONCE(ssp-8);     // read+discard last popped stack element
> + *
> + * The maximum distance INCSSP can move the SSP is 2040 bytes, before
> + * it would read the memory. Therefore a single page gap will be enough
> + * to prevent any operation from shifting the SSP to an adjacent stack,
> + * since it would have to land in the gap at least once, causing a
> + * fault.
> + *
> + * Prevent using INCSSP to move the SSP between shadow stacks by
> + * having a PAGE_SIZE guard gap.
> + */
> +# define VM_SHADOW_STACK	VM_HIGH_ARCH_5
>  #else
>  # define VM_SHADOW_STACK	VM_NONE
>  #endif

This is a lot of very x86-specific language in a generic header file.
I'm sure there's a better place for all this text.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-22 18:21   ` Matthew Wilcox
@ 2023-06-22 18:27     ` Edgecombe, Rick P
  2023-06-23  7:40       ` Mike Rapoport
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-22 18:27 UTC (permalink / raw)
  To: willy
  Cc: akpm, Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina, Torvalds,
	Linus, peterz, corbet, linux-kernel, jannh, dethoma, broonie,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, Yu, Yu-cheng, andrew.cooper3, hpa, mingo,
	szabolcs.nagy, hjl.tools, debug, linux-mm, Syromiatnikov, Eugene,
	Yang, Weijiang, linux-doc, dave.hansen, Eranian, Stephane

On Thu, 2023-06-22 at 19:21 +0100, Matthew Wilcox wrote:
> On Mon, Jun 12, 2023 at 05:10:42PM -0700, Rick Edgecombe wrote:
> > +++ b/include/linux/mm.h
> > @@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void
> > *objp);
> >   #endif /* CONFIG_ARCH_HAS_PKEYS */
> >   
> >   #ifdef CONFIG_X86_USER_SHADOW_STACK
> > -# define VM_SHADOW_STACK       VM_HIGH_ARCH_5 /* Should not be set
> > with VM_SHARED */
> > +/*
> > + * This flag should not be set with VM_SHARED because of lack of
> > support
> > + * core mm. It will also get a guard page. This helps userspace
> > protect
> > + * itself from attacks. The reasoning is as follows:
> > + *
> > + * The shadow stack pointer(SSP) is moved by CALL, RET, and
> > INCSSPQ. The
> > + * INCSSP instruction can increment the shadow stack pointer. It
> > is the
> > + * shadow stack analog of an instruction like:
> > + *
> > + *   addq $0x80, %rsp
> > + *
> > + * However, there is one important difference between an ADD on
> > %rsp
> > + * and INCSSP. In addition to modifying SSP, INCSSP also reads
> > from the
> > + * memory of the first and last elements that were "popped". It
> > can be
> > + * thought of as acting like this:
> > + *
> > + * READ_ONCE(ssp);       // read+discard top element on stack
> > + * ssp += nr_to_pop * 8; // move the shadow stack
> > + * READ_ONCE(ssp-8);     // read+discard last popped stack element
> > + *
> > + * The maximum distance INCSSP can move the SSP is 2040 bytes,
> > before
> > + * it would read the memory. Therefore a single page gap will be
> > enough
> > + * to prevent any operation from shifting the SSP to an adjacent
> > stack,
> > + * since it would have to land in the gap at least once, causing a
> > + * fault.
> > + *
> > + * Prevent using INCSSP to move the SSP between shadow stacks by
> > + * having a PAGE_SIZE guard gap.
> > + */
> > +# define VM_SHADOW_STACK       VM_HIGH_ARCH_5
> >   #else
> >   # define VM_SHADOW_STACK      VM_NONE
> >   #endif
> 
> This is a lot of very x86-specific language in a generic header file.
> I'm sure there's a better place for all this text.

Yes, I couldn't find another place for it. This was the reasoning:
https://lore.kernel.org/lkml/07deaffc10b1b68721bbbce370e145d8fec2a494.camel@intel.com/

Did you have any particular place in mind?

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22 16:42                                   ` szabolcs.nagy
@ 2023-06-22 23:18                                     ` Edgecombe, Rick P
  2023-06-29 16:07                                       ` szabolcs.nagy
  2023-06-25 23:52                                     ` Andy Lutomirski
  1 sibling, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-22 23:18 UTC (permalink / raw)
  To: szabolcs.nagy, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, Yang, Weijiang, peterz, corbet, nd,
	broonie, dethoma, linux-kernel, x86, debug, bp, rdunlap,
	linux-api, rppt, jamorris, pavel, john.allen, bsingharora,
	mike.kravetz, jannh, andrew.cooper3, oleg, keescook, gorcunov,
	arnd, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, linux-doc, Torvalds, Linus, dave.hansen,
	akpm, Eranian, Stephane

On Thu, 2023-06-22 at 17:42 +0100, szabolcs.nagy@arm.com wrote:
> the downside is that longjmp across makecontext needs a
> separate solution then which implies that all shstk needs
> a detectable token at the end of the shstk.. so again
> something that we have to get right now and cannot add
> later.)

This sounds like some scheme to search for a token on another stack,
which if so, you haven't elaborated on.

I'm not going to be able to contribute on this thread much over the
next week, but if you think you know to solve problems which have
remained unsolved for years, please spell out the solutions.

I'd also appreciate if you could spell out exactly which:
 - ucontext
 - signal
 - longjmp
 - custom library stack switching

patterns you think shadow stack should support working together.
Because even after all these mails, I'm still not sure exactly what you
are trying to achieve.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-22 18:27     ` Edgecombe, Rick P
@ 2023-06-23  7:40       ` Mike Rapoport
  2023-06-23 12:17         ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Mike Rapoport @ 2023-06-23  7:40 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: willy, akpm, Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski,
	Andy, nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	Torvalds, Linus, peterz, corbet, linux-kernel, jannh, dethoma,
	broonie, mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen,
	arnd, jamorris, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, Yu, Yu-cheng, andrew.cooper3, hpa, mingo,
	szabolcs.nagy, hjl.tools, debug, linux-mm, Syromiatnikov, Eugene,
	Yang, Weijiang, linux-doc, dave.hansen, Eranian, Stephane

On Thu, Jun 22, 2023 at 06:27:40PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-06-22 at 19:21 +0100, Matthew Wilcox wrote:
> > On Mon, Jun 12, 2023 at 05:10:42PM -0700, Rick Edgecombe wrote:
> > > +++ b/include/linux/mm.h
> > > @@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void
> > > *objp);
> > >   #endif /* CONFIG_ARCH_HAS_PKEYS */
> > >   
> > >   #ifdef CONFIG_X86_USER_SHADOW_STACK
> > > -# define VM_SHADOW_STACK       VM_HIGH_ARCH_5 /* Should not be set
> > > with VM_SHARED */
> > > +/*
> > > + * This flag should not be set with VM_SHARED because of lack of
> > > support
> > > + * core mm. It will also get a guard page. This helps userspace
> > > protect
> > > + * itself from attacks. The reasoning is as follows:
> > > + *
> > > + * The shadow stack pointer(SSP) is moved by CALL, RET, and
> > > INCSSPQ. The
> > > + * INCSSP instruction can increment the shadow stack pointer. It
> > > is the
> > > + * shadow stack analog of an instruction like:
> > > + *
> > > + *   addq $0x80, %rsp
> > > + *
> > > + * However, there is one important difference between an ADD on
> > > %rsp
> > > + * and INCSSP. In addition to modifying SSP, INCSSP also reads
> > > from the
> > > + * memory of the first and last elements that were "popped". It
> > > can be
> > > + * thought of as acting like this:
> > > + *
> > > + * READ_ONCE(ssp);       // read+discard top element on stack
> > > + * ssp += nr_to_pop * 8; // move the shadow stack
> > > + * READ_ONCE(ssp-8);     // read+discard last popped stack element
> > > + *
> > > + * The maximum distance INCSSP can move the SSP is 2040 bytes,
> > > before
> > > + * it would read the memory. Therefore a single page gap will be
> > > enough
> > > + * to prevent any operation from shifting the SSP to an adjacent
> > > stack,
> > > + * since it would have to land in the gap at least once, causing a
> > > + * fault.
> > > + *
> > > + * Prevent using INCSSP to move the SSP between shadow stacks by
> > > + * having a PAGE_SIZE guard gap.
> > > + */
> > > +# define VM_SHADOW_STACK       VM_HIGH_ARCH_5
> > >   #else
> > >   # define VM_SHADOW_STACK      VM_NONE
> > >   #endif
> > 
> > This is a lot of very x86-specific language in a generic header file.
> > I'm sure there's a better place for all this text.
> 
> Yes, I couldn't find another place for it. This was the reasoning:
> https://lore.kernel.org/lkml/07deaffc10b1b68721bbbce370e145d8fec2a494.camel@intel.com/
> 
> Did you have any particular place in mind?

Since it's near CONFIG_X86_USER_SHADOW_STACK the comment in mm.h could be 

/*
 * VMA is used for shadow stack and implies guard pages.
 * See arch/x86/kernel/shstk.c for details
 */

and the long reasoning comment can be moved near alloc_shstk in
arch/x86/kernel/shstk.h

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-23  7:40       ` Mike Rapoport
@ 2023-06-23 12:17         ` Mark Brown
  2023-06-25 16:44           ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-06-23 12:17 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Edgecombe, Rick P, willy, akpm, Xu, Pengfei, tglx, linux-arch,
	kcc, Lutomirski, Andy, nadav.amit, kirill.shutemov, david,
	Schimpe, Christina, Torvalds, Linus, peterz, corbet,
	linux-kernel, jannh, dethoma, mike.kravetz, pavel, bp, rdunlap,
	linux-api, john.allen, arnd, jamorris, bsingharora, x86, oleg,
	fweimer, keescook, gorcunov, Yu, Yu-cheng, andrew.cooper3, hpa,
	mingo, szabolcs.nagy, hjl.tools, debug, linux-mm, Syromiatnikov,
	Eugene, Yang, Weijiang, linux-doc, dave.hansen, Eranian,
	Stephane

[-- Attachment #1: Type: text/plain, Size: 1318 bytes --]

On Fri, Jun 23, 2023 at 10:40:00AM +0300, Mike Rapoport wrote:
> On Thu, Jun 22, 2023 at 06:27:40PM +0000, Edgecombe, Rick P wrote:

> > Yes, I couldn't find another place for it. This was the reasoning:
> > https://lore.kernel.org/lkml/07deaffc10b1b68721bbbce370e145d8fec2a494.camel@intel.com/

> > Did you have any particular place in mind?

> Since it's near CONFIG_X86_USER_SHADOW_STACK the comment in mm.h could be 

> /*
>  * VMA is used for shadow stack and implies guard pages.
>  * See arch/x86/kernel/shstk.c for details
>  */

> and the long reasoning comment can be moved near alloc_shstk in
> arch/x86/kernel/shstk.h

This isn't an x86 specific concept, arm64 has a very similar extension
called Guarded Control Stack (which I should be publishing changes for
in the not too distant future) and riscv also has something.  For arm64
I'm using the generic mm changes wholesale, we have a similar need for
guard pages around the GCS and while the mechanics of accessing are
different the requirement ends up being the same.  Perhaps we could just
rewrite the comment to say that guard pages prevent over/underflow of
the stack by userspace and that a single page is sufficient for all
current architectures, with the details of the working for x86 put in
some x86 specific place?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22 16:47                                   ` Edgecombe, Rick P
@ 2023-06-23 16:25                                     ` szabolcs.nagy
  2023-06-25 18:48                                       ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-23 16:25 UTC (permalink / raw)
  To: Edgecombe, Rick P, broonie
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, nd, dethoma, jannh, linux-kernel, debug,
	mike.kravetz, bp, x86, linux-api, rppt, jamorris, pavel,
	john.allen, bsingharora, andrew.cooper3, oleg, keescook,
	gorcunov, arnd, Yu, Yu-cheng, rdunlap, fweimer, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds,
	Linus, dave.hansen, akpm, Eranian, Stephane

The 06/22/2023 16:47, Edgecombe, Rick P wrote:
> On Thu, 2023-06-22 at 09:27 +0100, szabolcs.nagy@arm.com wrote:
> 
> 
> [ snip ]
> 
> > swapcontext is currently *not* supported: for it to work you have to
> > be able to jump *back* into the signal handler, which does not work
> > if
> > the swapcontext target is on the original thread stack (instead of
> > say a makecontext stack).
> > 
> > jumping back can only be supported if alt stack can be paired with
> > an alt shadow stack.
> > 
> > unwinding across a series of signal interrupts should work even
> > with discontinous shstk. libgcc does not implement this which is
> > a problem i think.
> 
> I would be helpful if you could enumerate the cases you think are
> important to support. I would like to see how we could support them in
> the future in some mode.
> 
> > 
> > > But how does the proposed token placed by the kernel on the
> > > original
> > > stack help this problem? The longjmp() would have to be able to
> > > find
> > > the location of the restore tokens somehow, which would not
> > > necessarily
> > > be near the setjmp() point. The signal token could even be on a
> > > different shadow stack.
> > 
> > i posted the exact longjmp code and it takes care of this case.
> 
> I see how it works for the simple case of longjmp() from an alt shadow
> stack. I would still prefer a different solution that works with the
> overflow case. (probably new kernel functionality)
> 
> But I don't see how it works for unwinding off of a ucontext stack. Or
> unwinding off of an ucontext stack that was swapped to from an alt
> shadow stack.

why?

a stack can be active or inactive (task executing on it or not),
and if inactive it can be live or dead (has stack frames that can
be jumped to or not).

this is independent of shadow stacks: longjmp is only valid if the
target is either the same active stack or an inactive live stack.
(there are cases that may seem to work, but fundamentally broken
and not supportable: e.g. two tasks executing on the same stack
where the one above happens to not clobber frames deep enough to
collide with the task below.)

the proposed longjmp design works for both cases. no assumption is
made about ucontext or signals other than the shadow stack for an
inactive live stack ends in a restore token, which is guaranteed by
the isa so we only need the kernel to do the same when it switches
shadow stacks. then longjmp works by construction.

the only wart is that an overflowed shadow stack is inactive dead
instead of inactive live because the token cannot be added. (note
that handling shstk overflow and avoiding shstk overflow during
signal handling still works with alt shstk!)

an alternative solution is to allow jump to inactive dead stack
if that's created by a signal interrupt. for that a syscall is
needed and longjmp has to detect if the target stack is dead or
live. (the kernel also has to be able to tell if switching to the
dead stack is valid for security reasons.) i don't know if this
is doable (if we allow some hacks it's doable).

unwinding across signal handlers is just a matter of having
enough information at the signal frame to continue, it does
not have to follow crazy jumps or weird coroutine things:
that does not work without shadow stacks either. but unwind
across alt stack frame should work.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-23 12:17         ` Mark Brown
@ 2023-06-25 16:44           ` Edgecombe, Rick P
  2023-06-26 12:45             ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-25 16:44 UTC (permalink / raw)
  To: broonie, rppt
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, linux-doc, peterz,
	corbet, linux-kernel, dethoma, jannh, mike.kravetz, pavel, bp,
	rdunlap, linux-api, john.allen, jamorris, arnd, bsingharora, x86,
	oleg, fweimer, keescook, willy, gorcunov, Yu, Yu-cheng,
	andrew.cooper3, hpa, mingo, szabolcs.nagy, hjl.tools, debug,
	linux-mm, Syromiatnikov, Eugene, Torvalds, Linus, akpm,
	dave.hansen, Yang, Weijiang, Eranian, Stephane

On Fri, 2023-06-23 at 13:17 +0100, Mark Brown wrote:
> On Fri, Jun 23, 2023 at 10:40:00AM +0300, Mike Rapoport wrote:
> > On Thu, Jun 22, 2023 at 06:27:40PM +0000, Edgecombe, Rick P wrote:
> 
> > > Yes, I couldn't find another place for it. This was the
> > > reasoning:
> > > 
> https://lore.kernel.org/lkml/07deaffc10b1b68721bbbce370e145d8fec2a494.camel@intel.com/
> 
> > > Did you have any particular place in mind?
> 
> > Since it's near CONFIG_X86_USER_SHADOW_STACK the comment in mm.h
> could be 
> 
> > /*
> >   * VMA is used for shadow stack and implies guard pages.
> >   * See arch/x86/kernel/shstk.c for details
> >   */
> 
> > and the long reasoning comment can be moved near alloc_shstk in
> > arch/x86/kernel/shstk.h

Makes sense. Not sure why I didn't think of this earlier.

> 
> This isn't an x86 specific concept, arm64 has a very similar
> extension
> called Guarded Control Stack (which I should be publishing changes
> for
> in the not too distant future) and riscv also has something.  For
> arm64
> I'm using the generic mm changes wholesale, we have a similar need
> for
> guard pages around the GCS and while the mechanics of accessing are
> different the requirement ends up being the same.  Perhaps we could
> just
> rewrite the comment to say that guard pages prevent over/underflow of
> the stack by userspace and that a single page is sufficient for all
> current architectures, with the details of the working for x86 put in
> some x86 specific place?

Something sort of similar came up in regards to the riscv series, about
adding something like an is_shadow_stack_vma() helper. The plan was to
not make too many assumptions about the final details of the other
shadow stack features and leave that for refactoring. I think some kind
of generic comment like you suggest makes sense, but I don't want to
try to assert any arch specifics for features that are not upstream. It
should be very easy to tweak the comment when the time comes.

The points about x86 details not belonging in non-arch headers and
having some arch generic explanation in the file are well taken though.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-23 16:25                                     ` szabolcs.nagy
@ 2023-06-25 18:48                                       ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-25 18:48 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, akpm, peterz, corbet,
	nd, dethoma, jannh, linux-kernel, debug, mike.kravetz, bp, x86,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora, pavel,
	andrew.cooper3, oleg, keescook, gorcunov, fweimer, Yu, Yu-cheng,
	rdunlap, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene,
	Yang, Weijiang, linux-doc, dave.hansen, Torvalds, Linus, Eranian,
	Stephane

On Fri, 2023-06-23 at 17:25 +0100, szabolcs.nagy@arm.com wrote:
> why?
> 
> a stack can be active or inactive (task executing on it or not),
> and if inactive it can be live or dead (has stack frames that can
> be jumped to or not).
> 
> this is independent of shadow stacks: longjmp is only valid if the
> target is either the same active stack or an inactive live stack.
> (there are cases that may seem to work, but fundamentally broken
> and not supportable: e.g. two tasks executing on the same stack
> where the one above happens to not clobber frames deep enough to
> collide with the task below.)
> 
> the proposed longjmp design works for both cases. no assumption is
> made about ucontext or signals other than the shadow stack for an
> inactive live stack ends in a restore token, 

One of the problems for the case of longjmp() from another another
stack, is how to find the old stack's token. HJ and I had previously
discussed searching for the token from the target SSP forward, but the
problems are it is 1, not gauranteed to be there and 2, pretty awkward
and potentially slow.

> which is guaranteed by
> the isa so we only need the kernel to do the same when it switches
> shadow stacks. then longjmp works by construction.

No it's not.

> 
> the only wart is that an overflowed shadow stack is inactive dead
> instead of inactive live because the token cannot be added. (note
> that handling shstk overflow and avoiding shstk overflow during
> signal handling still works with alt shstk!)

I thought we were on the same page as far as pushing a restore token on
signal not being robust against shadow stack overflow, so is this a new
idea? Usually people around here say "code talks", but all I'm asking
for is a full explanation of what you are trying to accomplish and what
the idea is. Otherwise this is asking to hold up this feature based on
hand waving.

Could you answer the questions here, along with a full description of
your proposal:
https://lore.kernel.org/all/1cd67ae45fc379fd82d2745190e4caf74e67499e.camel@intel.com/

It we are talking past each other, it could help to do a levelset at
this point.

> 
> an alternative solution is to allow jump to inactive dead stack
> if that's created by a signal interrupt. for that a syscall is
> needed and longjmp has to detect if the target stack is dead or
> live. (the kernel also has to be able to tell if switching to the
> dead stack is valid for security reasons.) i don't know if this
> is doable (if we allow some hacks it's doable).
> 
> unwinding across signal handlers is just a matter of having
> enough information at the signal frame to continue, it does
> not have to follow crazy jumps or weird coroutine things:
> that does not work without shadow stacks either. but unwind
> across alt stack frame should work.

Even if there is some workable idea, there is already a bunch of 
userspace built around the existing solution, and users waiting for it.

In addition the whole discussion around alt shadow stack cases will
require alt shadow stacks to be implemented and we might be constrained
there anyway.

If we want to take learnings and do something new, let's build it
around a new elf bit. This current kernel ABI is to support the old elf
bit userspace stack, which has been solidified by the existing upstream
userspace.

I had thought we should start from scratch now and proposed a patch to
block the old elf bit to force this, but lost that battle with other
glibc developers.

Speaking of which, I don't see any enthusiasm from any other glibc
developers that have been involved in this previously. Who were you
thinking was going to implement any of this on the glibc side? Have you
made any progress in getting any of them onboard with this order of
operations in the months since you first brought up changing the
design?

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22 16:42                                   ` szabolcs.nagy
  2023-06-22 23:18                                     ` Edgecombe, Rick P
@ 2023-06-25 23:52                                     ` Andy Lutomirski
  1 sibling, 0 replies; 151+ messages in thread
From: Andy Lutomirski @ 2023-06-25 23:52 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Andy Lutomirski, Edgecombe, Rick P, broonie, Xu, Pengfei, tglx,
	linux-arch, kcc, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, linux-doc, peterz, corbet, nd, dethoma, jannh,
	linux-kernel, debug, pavel, bp, mike.kravetz, linux-api, rppt,
	jamorris, arnd, john.allen, rdunlap, bsingharora, oleg,
	andrew.cooper3, keescook, x86, gorcunov, Yu, Yu-cheng, fweimer,
	hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene, Torvalds,
	Linus, akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

On Thu, Jun 22, 2023 at 9:43 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 06/22/2023 08:26, Andy Lutomirski wrote:
> > On Thu, Jun 22, 2023 at 2:28 AM szabolcs.nagy@arm.com
> > <szabolcs.nagy@arm.com> wrote:
> > >
> > > The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> > > > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > > > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > > > > if
> > > > > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > > > > can't
> > > > > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > > > > overflow
> > > > > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > > > > solve
> > > > > > > > the problem.
> > > > > >
> > > > > > the restore token in the alt shstk case does not regress anything
> > > > > > but
> > > > > > makes some use-cases work.
> > > > > >
> > > > > > alt shadow stack is important if code tries to jump in and out of
> > > > > > signal handlers (dosemu does this with swapcontext) and for that a
> > > > > > restore token is needed.
> > > > > >
> > > > > > alt shadow stack is important if the original shstk did not
> > > > > > overflow
> > > > > > but the signal handler would overflow it (small thread stack, huge
> > > > > > sigaltstack case).
> > > > > >
> > > > > > alt shadow stack is also important for crash reporting on shstk
> > > > > > overflow even if longjmp does not work then. longjmp to a
> > > > > > makecontext
> > > > > > stack would still work and longjmp back to the original stack can
> > > > > > be
> > > > > > made to mostly work by an altshstk option to overwrite the top
> > > > > > entry
> > > > > > with a restore token on overflow (this can break unwinding though).
> > > > > >
> > > >
> > > > There was previously a request to create an alt shadow stack for the
> > > > purpose of handling shadow stack overflow. So you are now suggesting to
> > > > to exclude that and instead target a different use case for alt shadow
> > > > stack?
> > >
> > > that is not what i said.
> > >
> > > > But I'm not sure how much we should change the ABI at this point since
> > > > we are constrained by existing userspace. If you read the history, we
> > > > may end up needing to deprecate the whole elf bit for this and other
> > > > reasons.
> > >
> > > i'm not against deprecating the elf bit, but i think binary
> > > marking will be difficult for this kind of feature no matter what
> > > (code may be incompatible for complex runtime dependent reasons).
> > >
> > > > So should we struggle to find a way to grow the existing ABI without
> > > > disturbing the existing userspace? Or should we start with something,
> > > > finally, and see where we need to grow and maybe get a chance at a
> > > > fresh start to grow it?
> > > >
> > > > Like, maybe 3 people will show up saying "hey, I *really* need to use
> > > > shadow stack and longjmp from a ucontext stack", and no one says
> > > > anything about shadow stack overflow. Then we know what to do. And
> > > > maybe dosemu decides it doesn't need to implement shadow stack (highly
> > > > likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> > > > was created for dosemu, and the alt shadow stack patch adopted this
> > > > behavior. So it's speculation that there is even a problem in that
> > > > scenario.
> > > >
> > > > Or maybe people just enable WRSS for longjmp() and directly jump back
> > > > to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> > > > cost of a little security?
> > > >
> > > > Even if, with enough discussion, we could optimize for all
> > > > hypotheticals without real user feedback, I don't see how it helps
> > > > users to hold shadow stack. So I think we should move forward with the
> > > > current ABI.
> > >
> > > you may not get a second chance to fix a security feature.
> > > it will be just disabled if it causes problems.
> >
> > *I* would use altshadowstack.
> >
> > I run a production system (that cares about correctness *and*
> > performance, but that's not really relevant here -- SHSTK ought to be
> > fast).  And, if it crashes, I want to know why.  So I handle SIGSEGV,
> > etc so I have good logs if it crashes.  And I want those same logs if
> > I overflow the stack.
> >
> > That being said, I have no need for longjmp or siglongjmp for this.  I
> > use exit(2) to escape.
>
> the same crash handler that prints a log on shstk overflow should
> work when a different cause of SIGSEGV is recoverable via longjmp.
> to me this means that alt shstk must work with longjmp at least in
> the non-shstk overflow case (which can be declared non-recoverable).

Sure, but how many SIGSEGV handlers would use altshadowstack and
*also, in the same handler* ever resume?  Not mine.  Obviously I'm
only one sample.

>
> > For what it's worth, setjmp/longjmp is a bad API.  The actual pattern
> > that ought to work well (and that could be supported well by fancy
> > compilers and non-C languages, as I understand it) is more like a
> > function call that has two ways out.  Like this (pseudo-C):
> >
> > void function(struct better_jmp_buf &buf, args...)
> > {
> >    ...
> >        if (condition)
> >           better_long_jump(buf);  // long jumps out!
> >        // could also pass buf to another function
> >    ...
> >        // could also return normally
> > }
> >
> > better_call_with_jmp_buf(function, args);
> >
> > *This* could support altshadowstack just fine.  And many users might
> > be okay with the understanding that, if altshadowstack is on, you have
> > to use a better long jump to get out (or a normal sigreturn or _exit).
>
> i don't understand why this would work fine when longjmp does not.
> how does the shstk switch happen?

Ugh, I think this may have some issues given how the ISA works.  Sigh.

I was imagining that better_call_with_jmp_buf would push a restore
token on the shadow stack, then call the passed-in function, then, on
a successful return, INCSSP over the token and continue on.
better_long_jump() would RSTORSSP to the saved token.

But I'm not sure how to write the token without WRUSS.

What *could* be done, which would be nasty and
sigaltshadowstack-specific, is to have a jump out of a signal handler
provide a pointer to the signal frame (siginfo_t or ucontext pointer),
and the kernel would assist it in switching the shadow stack back.
Eww.

--Andy

>
> > No one is getting an altshadowstack signal handler without code
> > changes.
>
> assuming the same component is doing the alt shstk setup as the
> longjmp.
>
> > siglongjmp() could support altshadowstack with help from the kernel,
> > but we probably don't want to go there.
>
> what kind of help? maybe we need that help..
>
> e.g. if the signal frame token is detected by longjmp on
> the shstk then doing an rt_sigreturn with the right signal
> frame context allows longjmp to continue unwinding the shstk.
> however kernel sigcontext layout can change so userspace may
> not know it so longjmp needs a helper, but only in the jump
> across signal frame case.
>
> (this is a different design than what i proposed earlier,
> it also makes longjmp from alt shstk work without wrss,
> the downside is that longjmp across makecontext needs a
> separate solution then which implies that all shstk needs
> a detectable token at the end of the shstk.. so again
> something that we have to get right now and cannot add
> later.)

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 16/42] mm: Add guard pages around a shadow stack.
  2023-06-25 16:44           ` Edgecombe, Rick P
@ 2023-06-26 12:45             ` Mark Brown
  2023-07-06 23:32               ` [PATCH] x86/shstk: Move arch detail comment out of core mm Rick Edgecombe
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-06-26 12:45 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, Xu, Pengfei, tglx, kcc, linux-arch, Lutomirski, Andy,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	linux-doc, peterz, corbet, linux-kernel, dethoma, jannh,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen,
	jamorris, arnd, bsingharora, x86, oleg, fweimer, keescook, willy,
	gorcunov, Yu, Yu-cheng, andrew.cooper3, hpa, mingo,
	szabolcs.nagy, hjl.tools, debug, linux-mm, Syromiatnikov, Eugene,
	Torvalds, Linus, akpm, dave.hansen, Yang, Weijiang, Eranian,
	Stephane

[-- Attachment #1: Type: text/plain, Size: 1741 bytes --]

On Sun, Jun 25, 2023 at 04:44:32PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2023-06-23 at 13:17 +0100, Mark Brown wrote:

> > This isn't an x86 specific concept, arm64 has a very similar
> > extension
> > called Guarded Control Stack (which I should be publishing changes
> > for
> > in the not too distant future) and riscv also has something.  For
> > arm64
> > I'm using the generic mm changes wholesale, we have a similar need
> > for
> > guard pages around the GCS and while the mechanics of accessing are
> > different the requirement ends up being the same.  Perhaps we could
> > just
> > rewrite the comment to say that guard pages prevent over/underflow of
> > the stack by userspace and that a single page is sufficient for all
> > current architectures, with the details of the working for x86 put in
> > some x86 specific place?

> Something sort of similar came up in regards to the riscv series, about
> adding something like an is_shadow_stack_vma() helper. The plan was to
> not make too many assumptions about the final details of the other
> shadow stack features and leave that for refactoring. I think some kind
> of generic comment like you suggest makes sense, but I don't want to
> try to assert any arch specifics for features that are not upstream. It
> should be very easy to tweak the comment when the time comes.

> The points about x86 details not belonging in non-arch headers and
> having some arch generic explanation in the file are well taken though.

I think a statement to the effect that "this works for currently
supported architectures" is fine, if something comes along with
additional requirements then the comment can be adjusted as part of
merging the new thing.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22 16:46                                   ` Edgecombe, Rick P
@ 2023-06-26 14:08                                     ` szabolcs.nagy
  2023-06-28  1:23                                       ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-26 14:08 UTC (permalink / raw)
  To: Edgecombe, Rick P, hjl.tools
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, nd, dethoma, jannh, linux-kernel, debug,
	mike.kravetz, bp, rdunlap, linux-api, rppt, jamorris, pavel,
	john.allen, bsingharora, x86, broonie, andrew.cooper3, oleg,
	keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer, hpa, mingo,
	linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds, Linus,
	dave.hansen, akpm, Eranian, Stephane

The 06/22/2023 16:46, Edgecombe, Rick P wrote:
> You previously said:
> 
> On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > as far as i can tell the current unwinder handles shstk unwinding
> > correctly across signal handlers (sync or async and
> > cleanup/exceptions
> > handlers too), i see no issue with "fixed shadow stack signal frame
> > size of 8 bytes" other than future extensions and discontinous shstk.
> 
> I took that to mean that you didn't see how the the existing unwinder
> prevented alt shadow stacks. Hopefully we're all on the same page now. 

well alt shstk is discontinous.

there were two separate confusions:

- your mention of fnon-call-exception threw me off since that is a
very specific corner case.

- i was talking about an unwind design that can deal with altshstk
which requires ssp switch. (current uwinder does not support this,
but i assumed we can add that now and ignore old broken unwinders).
you are saying that alt shstk support needs additional shstk tokens
in the signal frame to maintain alt shstk state for the kernel.

i think now we are on the same page.

> BTW, when alt shadow stack's were POCed, I hadn't encountered this GCC
> behavior yet. So it was assumed it could be bolted on later without
> disturbing anything. If Linus or someone wants to say we're ok with
> breaking these old GCCs in this way, the first thing I would do would
> be to pad the shadow stack signal frame with room for alt shadow stack
> and more. I actually have a patch to do this, but alas we are already
> pushing it regression wise.

sounds like it will be hard to add alt shstk later.

(i think maintaining alt shstk state on the stack instead of
shstk should work too. but if that does not work, then alt
shstk will require another abi opt-in.)

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support
  2023-06-13  0:10 ` [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2023-06-27 17:20   ` Mark Brown
  2023-06-27 23:46     ` Dave Hansen
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-06-27 17:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Yu-cheng Yu, Pengfei Xu

[-- Attachment #1: Type: text/plain, Size: 679 bytes --]

On Mon, Jun 12, 2023 at 05:10:54PM -0700, Rick Edgecombe wrote:

> +static void unmap_shadow_stack(u64 base, u64 size)
> +{
> +	while (1) {
> +		int r;
> +
> +		r = vm_munmap(base, size);
> +
> +		/*
> +		 * vm_munmap() returns -EINTR when mmap_lock is held by
> +		 * something else, and that lock should not be held for a
> +		 * long time.  Retry it for the case.
> +		 */
> +		if (r == -EINTR) {
> +			cond_resched();
> +			continue;
> +		}

This looks generic, not even shadow stack specific - was there any
discussion of making it a vm_munmap_retry() (that's not a great name...)
or similar?  I didn't see any in old versions of the thread but I
might've missed something.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support
  2023-06-27 17:20   ` Mark Brown
@ 2023-06-27 23:46     ` Dave Hansen
  2023-06-28  0:37       ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Dave Hansen @ 2023-06-27 23:46 UTC (permalink / raw)
  To: Mark Brown, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, Yu-cheng Yu, Pengfei Xu

On 6/27/23 10:20, Mark Brown wrote:
> On Mon, Jun 12, 2023 at 05:10:54PM -0700, Rick Edgecombe wrote:
> 
>> +static void unmap_shadow_stack(u64 base, u64 size)
>> +{
>> +	while (1) {
>> +		int r;
>> +
>> +		r = vm_munmap(base, size);
>> +
>> +		/*
>> +		 * vm_munmap() returns -EINTR when mmap_lock is held by
>> +		 * something else, and that lock should not be held for a
>> +		 * long time.  Retry it for the case.
>> +		 */
>> +		if (r == -EINTR) {
>> +			cond_resched();
>> +			continue;
>> +		}
> This looks generic, not even shadow stack specific - was there any
> discussion of making it a vm_munmap_retry() (that's not a great name...)
> or similar?  I didn't see any in old versions of the thread but I
> might've missed something.

Yeah, that looks odd.  Also odd is that none of the other users of
vm_munmap() bother to check the return value (except for one passthrough
in the nommu code).

I don't think the EINTR happens during contention, though.  It's really
there to be able break out in the face of SIGKILL.  I think that's why
nobody handles it: the task is dying anyway and nobody cares.

Rick, was this hunk here for a specific reason or were you just trying
to be diligent in handling errors?

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support
  2023-06-27 23:46     ` Dave Hansen
@ 2023-06-28  0:37       ` Edgecombe, Rick P
  2023-07-06 23:38         ` [PATCH] x86/shstk: Don't retry vm_munmap() on -EINTR Rick Edgecombe
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-28  0:37 UTC (permalink / raw)
  To: broonie, Hansen, Dave
  Cc: akpm, Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy,
	nadav.amit, kirill.shutemov, david, Schimpe, Christina, Torvalds,
	Linus, peterz, corbet, linux-kernel, jannh, dethoma,
	mike.kravetz, pavel, bp, rdunlap, linux-api, john.allen, arnd,
	jamorris, rppt, bsingharora, x86, oleg, fweimer, keescook,
	gorcunov, Yu, Yu-cheng, andrew.cooper3, hpa, mingo,
	szabolcs.nagy, hjl.tools, debug, linux-mm, Syromiatnikov, Eugene,
	Yang, Weijiang, linux-doc, dave.hansen, Eranian, Stephane

On Tue, 2023-06-27 at 16:46 -0700, Dave Hansen wrote:
> > This looks generic, not even shadow stack specific - was there any
> > discussion of making it a vm_munmap_retry() (that's not a great
> > name...)
> > or similar?  I didn't see any in old versions of the thread but I
> > might've missed something.
> 
> Yeah, that looks odd.  Also odd is that none of the other users of
> vm_munmap() bother to check the return value (except for one
> passthrough
> in the nommu code).
> 
> I don't think the EINTR happens during contention, though.  It's really
> there to be able break out in the face of SIGKILL.  I think that's why
> nobody handles it: the task is dying anyway and nobody cares.
> 
> Rick, was this hunk here for a specific reason or were you just trying
> to be diligent in handling errors?

I'm not aware of any specific required cases. I think it is just pure
dilligence, originating from this comment:
https://lore.kernel.org/lkml/9847845a-749d-47a3-2a1d-bcc7c35f1bdd@intel.com/#t

I didn't see a need to remove it. 

The SIGKILL certaintly sounds like something more true than the
comment, but I can't do much of dive until next week. I would think we
need to handle EINTR differently to not WARN at least.

Yea, some of it does seem like the kind of thing that could live
outside of x86 shstk.c. But I'm not sure about the WARN part. That
should probably live in the caller. I guess someday we might even find
some shadow stack specific logic that could be cross-arch.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-26 14:08                                     ` szabolcs.nagy
@ 2023-06-28  1:23                                       ` Edgecombe, Rick P
  0 siblings, 0 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-06-28  1:23 UTC (permalink / raw)
  To: szabolcs.nagy, hjl.tools
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, akpm, peterz, corbet,
	nd, dethoma, jannh, linux-kernel, debug, pavel, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora,
	mike.kravetz, broonie, andrew.cooper3, oleg, keescook, x86,
	gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, linux-mm,
	Syromiatnikov, Eugene, Yang, Weijiang, linux-doc, dave.hansen,
	Torvalds, Linus, Eranian, Stephane

On Mon, 2023-06-26 at 15:08 +0100, szabolcs.nagy@arm.com wrote:
> The 06/22/2023 16:46, Edgecombe, Rick P wrote:
> > You previously said:
> > 
> > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > as far as i can tell the current unwinder handles shstk unwinding
> > > correctly across signal handlers (sync or async and
> > > cleanup/exceptions
> > > handlers too), i see no issue with "fixed shadow stack signal
> > > frame
> > > size of 8 bytes" other than future extensions and discontinous
> > > shstk.
> > 
> > I took that to mean that you didn't see how the the existing
> > unwinder
> > prevented alt shadow stacks. Hopefully we're all on the same page
> > now. 
> 
> well alt shstk is discontinous.
> 
> there were two separate confusions:
> 
> - your mention of fnon-call-exception threw me off since that is a
> very specific corner case.
> 
> - i was talking about an unwind design that can deal with altshstk
> which requires ssp switch. (current uwinder does not support this,
> but i assumed we can add that now and ignore old broken unwinders).
> you are saying that alt shstk support needs additional shstk tokens
> in the signal frame to maintain alt shstk state for the kernel.
> 
> i think now we are on the same page.

I don't think I fully understand your point still. It would really help
if you could do a levelset summary, like I asked on other branches of
this thread.

> 
> > BTW, when alt shadow stack's were POCed, I hadn't encountered this
> > GCC
> > behavior yet. So it was assumed it could be bolted on later without
> > disturbing anything. If Linus or someone wants to say we're ok with
> > breaking these old GCCs in this way, the first thing I would do
> > would
> > be to pad the shadow stack signal frame with room for alt shadow
> > stack
> > and more. I actually have a patch to do this, but alas we are
> > already
> > pushing it regression wise.
> 
> sounds like it will be hard to add alt shstk later.

Adding alt shadow stack without moving the elf bit runs the risk of
regressions because of the old GCCs. But moving the elf bit is easy
(from the technical side at least). There were several threads about
this in the past.

So I don't think any harder than it is now.

> 
> (i think maintaining alt shstk state on the stack instead of
> shstk should work too. but if that does not work, then alt
> shstk will require another abi opt-in.)

The x86 sigframe is pretty full AFAIK. There are already features that
require an opt in. It might be true that the alt shadow stack base and
size don't need to be protected like the previous SSP. I'd have to
think about it. It could be nice to optionally put some IBT stuff on
the shadow stack as well, which would require growing the frame, and at
that point the alt shadow stack stuff might as well go there too.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-22 23:18                                     ` Edgecombe, Rick P
@ 2023-06-29 16:07                                       ` szabolcs.nagy
  2023-07-02 18:03                                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-06-29 16:07 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, Yang, Weijiang, peterz, corbet, nd,
	broonie, dethoma, linux-kernel, x86, debug, bp, rdunlap,
	linux-api, rppt, jamorris, pavel, john.allen, bsingharora,
	mike.kravetz, jannh, andrew.cooper3, oleg, keescook, gorcunov,
	arnd, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, linux-doc, Torvalds, Linus, dave.hansen,
	akpm, Eranian, Stephane

The 06/22/2023 23:18, Edgecombe, Rick P wrote:
> I'd also appreciate if you could spell out exactly which:
>  - ucontext
>  - signal
>  - longjmp
>  - custom library stack switching
> 
> patterns you think shadow stack should support working together.
> Because even after all these mails, I'm still not sure exactly what you
> are trying to achieve.

i'm trying to support two operations (in any combination):

(1) jump up the current (active) stack.

(2) jump to a live frame in a different inactive but live stack.
    the old stack becomes inactive (= no task executes on it)
    and live (= has valid frames to jump to).

with

(3) the runtime must manage the shadow stacks transparently.
    (= portable c code does not need modifications)

mapping this to c apis:

- swapcontext, setcontext, longjmp, custom stack switching are jump
  operations. (there are conditions under which (1) and (2) must work,
  further details don't matter.)

- makecontext creates an inactive live stack.

- signal is only special if it executes on an alt stack: on signal
  entry the alt stack becomes active and the interrupted stack
  inactive but live. (nested signals execute on the alt stack until
  that is left either via a jump or signal return.)

- unwinding can be implemented with jump operations (it needs some
  other things but that's out of scope here).

the patterns that shadow stack should support falls out of this model.
(e.g. posix does not allow jumping from one thread to the stack of a
different thread, but the model does not care about that, it only
cares if the target stack is inactive and live then jump should work.)

some observations:

- it is necessary for jump to detect case (2) and then switch to the
  target shadow stack. this is also sufficient to implement it. (note:
  the restore token can be used for detection since that is guaranteed
  to be present when user code creates an inactive live stack and is
  not present anywhere else by design. a different marking can be used
  if the inactive live stack is created by the kernel, but then the
  kernel has to provide a switch method, e.g. syscall. this should not
  be controversial.)

- in this model two live stacks cannot use the same shadow stack since
  jumping between the two stacks is allowed in both directions, but
  jumping within a shadow stack only works in one direction. (also two
  tasks could execute on the same shadow stack then. and it makes
  shadow stack size accounting problematic.)

- so sharing shadow stack with alt stack is broken. (the model is
  right in the sense that valid posix code can trigger the issue. we
  can ignore that corner case and adjust the model so the shared
  shadow stack works for alt stack, but it likely does not change the
  jump design: eventually we want alt shadow stack.)

- shadow stack cannot always be managed by the runtime transparently:
  it has to be allocated for makecontext and alt stack in situations
  where allocation failure cannot be handled. more alarmingly the
  destruction of stacks may not be visible to the runtime so the
  corresponding shadow stacks leak. my preferred way to fix this is
  new apis that are shadow stack compatible (e.g. shadow_makecontext
  with shadow_freecontext) and marking the incompatible apis as such.
  portable code then can decide to update to new apis, run with shstk
  disabled or accept the leaks and OOM failures. the current approach
  needs ifdef __CET__ in user code for makecontext and sigaltstack
  has many issues.

- i'm still not happy with the shadow stack sizing. and would like to
  have a token at the end of the shadow stack to allow scanning. and
  it would be nice to deal with shadow stack overflow. and there is
  async disable on dlopen. so there are things to work on.

i understand that the proposed linux abi makes most existing binaries
with shstk marking work, which is relevant for x86.

for a while i thought we can fix the remaining issues even if that
means breaking existing shstk binaries (just bump the abi marking).
now it seems the issues can only be addressed in a future abi break.

which means x86 linux will likely end up maintaining two incompatible
abis and the future one will need user code and build system changes,
not just runtime changes. it is not a small incremental change to add
alt shadow stack support for example.

i don't think the maintenance burden of two shadow stack abis is the
right path for arm64 to follow, so the shadow stack semantics will
likely become divergent not common across targets.

i hope my position is now clearer.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-29 16:07                                       ` szabolcs.nagy
@ 2023-07-02 18:03                                         ` Edgecombe, Rick P
  2023-07-03 13:32                                           ` Mark Brown
  2023-07-03 18:19                                           ` szabolcs.nagy
  0 siblings, 2 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-02 18:03 UTC (permalink / raw)
  To: szabolcs.nagy, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, akpm, peterz, corbet, nd, broonie,
	jannh, linux-kernel, x86, debug, bp, rdunlap, linux-api, rppt,
	jamorris, pavel, john.allen, bsingharora, mike.kravetz, dethoma,
	andrew.cooper3, oleg, keescook, gorcunov, arnd, Yu, Yu-cheng,
	fweimer, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene,
	Yang, Weijiang, linux-doc, dave.hansen, Torvalds, Linus, Eranian,
	Stephane

On Thu, 2023-06-29 at 17:07 +0100, szabolcs.nagy@arm.com wrote:
> The 06/22/2023 23:18, Edgecombe, Rick P wrote:
> > I'd also appreciate if you could spell out exactly which:
> >  - ucontext
> >  - signal
> >  - longjmp
> >  - custom library stack switching
> > 
> > patterns you think shadow stack should support working together.
> > Because even after all these mails, I'm still not sure exactly what
> > you
> > are trying to achieve.

Hi Szablocs,

Thanks for writing all this up. It is helpful to understand where you
are coming from. Please don't miss my point at the very bottom of this
response.

> 
> i'm trying to support two operations (in any combination):
> 
> (1) jump up the current (active) stack.
> 
> (2) jump to a live frame in a different inactive but live stack.
>     the old stack becomes inactive (= no task executes on it)
>     and live (= has valid frames to jump to).
> 
> with
> 
> (3) the runtime must manage the shadow stacks transparently.
>     (= portable c code does not need modifications)
> 
> mapping this to c apis:
> 
> - swapcontext, setcontext, longjmp, custom stack switching are jump
>   operations. (there are conditions under which (1) and (2) must
> work,
>   further details don't matter.)
> 
> - makecontext creates an inactive live stack.
> 
> - signal is only special if it executes on an alt stack: on signal
>   entry the alt stack becomes active and the interrupted stack
>   inactive but live. (nested signals execute on the alt stack until
>   that is left either via a jump or signal return.)
> 
> - unwinding can be implemented with jump operations (it needs some
>   other things but that's out of scope here).
> 
> the patterns that shadow stack should support falls out of this
> model.
> (e.g. posix does not allow jumping from one thread to the stack of a
> different thread, but the model does not care about that, it only
> cares if the target stack is inactive and live then jump should
> work.)
> 
> some observations:
> 
> - it is necessary for jump to detect case (2) and then switch to the
>   target shadow stack. this is also sufficient to implement it.
> (note:
>   the restore token can be used for detection since that is
> guaranteed
>   to be present when user code creates an inactive live stack and is
>   not present anywhere else by design. a different marking can be
> used
>   if the inactive live stack is created by the kernel, but then the
>   kernel has to provide a switch method, e.g. syscall. this should
> not
>   be controversial.)

For x86's shadow stack you can jump to a new stack without leaving a
token behind. I don't know if maybe we could make it a rule in the
x86_64 ABI that you should always leave a token if you are going to
mark the SHSTK elf bit. But if anything did this, then longjmp() could
never make it back to the stack where setjmp() was called without
kernel help.

> 
> - in this model two live stacks cannot use the same shadow stack
> since
>   jumping between the two stacks is allowed in both directions, but
>   jumping within a shadow stack only works in one direction. (also
> two
>   tasks could execute on the same shadow stack then. and it makes
>   shadow stack size accounting problematic.)
> 
> - so sharing shadow stack with alt stack is broken. (the model is
>   right in the sense that valid posix code can trigger the issue.

Could you spell out what "the issue" is that can be triggered?

>  we
>   can ignore that corner case and adjust the model so the shared
>   shadow stack works for alt stack, but it likely does not change the
>   jump design: eventually we want alt shadow stack.)

As we discussed previously, alt shadow stack can't work transparently
with existing code due to the sigaltstack API. I wonder if maybe you
are trying to get at something else, and I'm not following.

> 
> - shadow stack cannot always be managed by the runtime transparently:
>   it has to be allocated for makecontext and alt stack in situations
>   where allocation failure cannot be handled. more alarmingly the
>   destruction of stacks may not be visible to the runtime so the
>   corresponding shadow stacks leak. my preferred way to fix this is
>   new apis that are shadow stack compatible (e.g. shadow_makecontext
>   with shadow_freecontext) and marking the incompatible apis as such.
>   portable code then can decide to update to new apis, run with shstk
>   disabled or accept the leaks and OOM failures. the current approach
>   needs ifdef __CET__ in user code for makecontext and sigaltstack
>   has many issues.

This sounds reasonable to me on the face of it. It seems mostly
unrelated to the kernel ABI and purely a userspace thing.

> 
> - i'm still not happy with the shadow stack sizing. and would like to
>   have a token at the end of the shadow stack to allow scanning. and
>   it would be nice to deal with shadow stack overflow. and there is
>   async disable on dlopen. so there are things to work on.

I was imagining that for tracing-only users, it might make sense to run
with WRSS enabled. This could mean libc's could write their own restore
tokens. In the case of longjmp() it could be simple and fast. The
implementation could just write a token at the target SSP and switch to
it. Non C runtimes that want to use if for backtracing could also write
their own preferred stack markers or other data. It also is whole
different solution to what is being discussed.

But over the course of this thread, I could imagine a little more now
how a top of stack marker could possibly be useful for non-tracing
usages. I have a patch prepared for this and I had tested to see if
adding this later could disturb anything in userspace. The only thing
that I found was that gdb might output a slightly different stack
trace. So it would be a user visible change, if not a regression.

One reason I held off on it still, is that the plan for the expanded
shadow stack signal frame includes using a 0 frame, to avoid a forgery
scenario. The token that makes sense for the end of stack marker is
also a 0 frame. So if userspace that looks for the end of stack marker
scans for the 0 frame without checking if it is part of an expanded
shadow stack signal frame, then it could make more trouble for alt
shadow stack.

So since they are tied together, and I thought to hold off on it for
now. I don't want to try to squeeze around the upstream userspace, I
think a version 2 should be a clean slate on a new elf bit.

> 
> i understand that the proposed linux abi makes most existing binaries
> with shstk marking work, which is relevant for x86.
> 
> for a while i thought we can fix the remaining issues even if that
> means breaking existing shstk binaries (just bump the abi marking).
> now it seems the issues can only be addressed in a future abi break.

Adding a new arch_prctl() ENABLE value was the plan. Not sure what you
mean by ABI break vs version bump. The plan was to add the new features
without userspace regression by putting any behavior behind a different
enable option. This relies on userspace to add a new elf bit, and to
use it.

> 
> which means x86 linux will likely end up maintaining two incompatible
> abis and the future one will need user code and build system changes,
> not just runtime changes. it is not a small incremental change to add
> alt shadow stack support for example.
> 
> i don't think the maintenance burden of two shadow stack abis is the
> right path for arm64 to follow, so the shadow stack semantics will
> likely become divergent not common across targets.

Unfortunately we are at a bit of an information asymmetry here because
the ARM spec and patches are not public. It may be part of the cause of
the confusion.

> 
> i hope my position is now clearer.

It kind of sounds like you don't like the x86 glibc implementation. And
you want to make sure the kernel can support whatever a new solution is
that you are working on. I am on board with the goal of having some
generic set of rules to make portable code work for other architectures
shadow stacks. But I think how close we can get to that goal or what it
looks like is an open question. For several reasons:
 1. Not everyone can see all the specs
 2. No POCs have been done (or at least shared)
 3. It's not clear what needs to be supported (yes, I know you have 
    made a rough proposal here, but it sounds like on the x86 glibc 
    side at least it's not even clear what non-shadow stack stack 
    switching operations can work together)

But towards these goals, I think your technical requests are:

1. Leave a token on switching to an alt shadow stack. As discussed
earlier, we can't do this because of the overflow issues. Also since,
alt shadow stack cannot be transparent to existing software anyway, it
should be ok to introduce limitations. So I think this one is a no.
What we could do is introduce security weakening kernel helpers, but
this would make sense to come with alt shadow stack support.
2. Add an end token at the top of the shadow stack. Because of the
existing userspace restriction interactions, this is complicated to
evaluate but I think we *could* do this now. There are pros and cons.
3. Support more options for shadow stack sizing. (I think you are
referring to this conversation:
https://lore.kernel.org/lkml/ZAIgrXQ4670gxlE4@arm.com/). I don't see
why this is needed for the base implementation. If ARM wants to add a
new rlimit or clone variant, I don't see why x86 can't support it
later.

So if we add 2, are you satisfied? Or otherwise, on the non-technical
request side, are you asking to hold off on x86 shadow stack, in order
to co-develop a unified solution?

I think the existing solution will see use in the meantime, including
for the development of all the x86 specific JIT implementations.


And finally, what I think is the most important point in all of this:

I think that *how* it gets used will be a better guide for further
development than us debating. For example the main pain point that has
come up so far is the problems around dlopen(). And the future work
that has been scoped has been for the kernel to help out in this area.
This is based on _user_ (distro) requests.

Any apps that don't work with shadow stack limitations can simply not
enable shadow stack. You and me are debating these specific API
combinations, but we can't know whether they are actually the best
place to focus development efforts. And the early signs are this is NOT
the most important problem to solve.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-02 18:03                                         ` Edgecombe, Rick P
@ 2023-07-03 13:32                                           ` Mark Brown
  2023-07-03 18:19                                           ` szabolcs.nagy
  1 sibling, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-07-03 13:32 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: szabolcs.nagy, Lutomirski, Andy, Xu, Pengfei, tglx, linux-arch,
	kcc, nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	akpm, peterz, corbet, nd, jannh, linux-kernel, x86, debug, bp,
	rdunlap, linux-api, rppt, jamorris, pavel, john.allen,
	bsingharora, mike.kravetz, dethoma, andrew.cooper3, oleg,
	keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Torvalds, Linus, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

On Sun, Jul 02, 2023 at 06:03:42PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-06-29 at 17:07 +0100, szabolcs.nagy@arm.com wrote:

> > which means x86 linux will likely end up maintaining two incompatible
> > abis and the future one will need user code and build system changes,
> > not just runtime changes. it is not a small incremental change to add
> > alt shadow stack support for example.

> > i don't think the maintenance burden of two shadow stack abis is the
> > right path for arm64 to follow, so the shadow stack semantics will
> > likely become divergent not common across targets.

> Unfortunately we are at a bit of an information asymmetry here because
> the ARM spec and patches are not public. It may be part of the cause of
> the confusion.

While the descriptive text bit of the spec is not yet integrated into
the ARM the architecture XML describing the instructions and system
registers is there, the document is numbered DDI0601:

    https://developer.arm.com/documentation/ddi0601/

The GCS specific instructions and system registers are all named
beginning with GCS, it's aarch64 only.

Hopefully I should have something out next week for the kernel.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-02 18:03                                         ` Edgecombe, Rick P
  2023-07-03 13:32                                           ` Mark Brown
@ 2023-07-03 18:19                                           ` szabolcs.nagy
  2023-07-03 18:38                                             ` Mark Brown
                                                               ` (2 more replies)
  1 sibling, 3 replies; 151+ messages in thread
From: szabolcs.nagy @ 2023-07-03 18:19 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, akpm, peterz, corbet, nd, broonie,
	jannh, linux-kernel, x86, debug, bp, rdunlap, linux-api, rppt,
	jamorris, pavel, john.allen, bsingharora, mike.kravetz, dethoma,
	andrew.cooper3, oleg, keescook, gorcunov, arnd, Yu, Yu-cheng,
	fweimer, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene,
	Yang, Weijiang, linux-doc, dave.hansen, Torvalds, Linus, Eranian,
	Stephane

The 07/02/2023 18:03, Edgecombe, Rick P wrote:
> On Thu, 2023-06-29 at 17:07 +0100, szabolcs.nagy@arm.com wrote:
> > The 06/22/2023 23:18, Edgecombe, Rick P wrote:
> > > I'd also appreciate if you could spell out exactly which:
> > >  - ucontext
> > >  - signal
> > >  - longjmp
> > >  - custom library stack switching
> > > 
> > > patterns you think shadow stack should support working together.
> > > Because even after all these mails, I'm still not sure exactly what
> > > you
> > > are trying to achieve.
> 
> Hi Szablocs,
> 
> Thanks for writing all this up. It is helpful to understand where you
> are coming from. Please don't miss my point at the very bottom of this
> response.
> 
> > 
> > i'm trying to support two operations (in any combination):
> > 
> > (1) jump up the current (active) stack.
> > 
> > (2) jump to a live frame in a different inactive but live stack.
> >     the old stack becomes inactive (= no task executes on it)
> >     and live (= has valid frames to jump to).
> > 
> > with
> > 
> > (3) the runtime must manage the shadow stacks transparently.
> >     (= portable c code does not need modifications)
> > 
> > mapping this to c apis:
> > 
> > - swapcontext, setcontext, longjmp, custom stack switching are jump
> >   operations. (there are conditions under which (1) and (2) must
> > work,
> >   further details don't matter.)
> > 
> > - makecontext creates an inactive live stack.
> > 
> > - signal is only special if it executes on an alt stack: on signal
> >   entry the alt stack becomes active and the interrupted stack
> >   inactive but live. (nested signals execute on the alt stack until
> >   that is left either via a jump or signal return.)
> > 
> > - unwinding can be implemented with jump operations (it needs some
> >   other things but that's out of scope here).
> > 
> > the patterns that shadow stack should support falls out of this
> > model.
> > (e.g. posix does not allow jumping from one thread to the stack of a
> > different thread, but the model does not care about that, it only
> > cares if the target stack is inactive and live then jump should
> > work.)
> > 
> > some observations:
> > 
> > - it is necessary for jump to detect case (2) and then switch to the
> >   target shadow stack. this is also sufficient to implement it.
> > (note:
> >   the restore token can be used for detection since that is
> > guaranteed
> >   to be present when user code creates an inactive live stack and is
> >   not present anywhere else by design. a different marking can be
> > used
> >   if the inactive live stack is created by the kernel, but then the
> >   kernel has to provide a switch method, e.g. syscall. this should
> > not
> >   be controversial.)
> 
> For x86's shadow stack you can jump to a new stack without leaving a
> token behind. I don't know if maybe we could make it a rule in the
> x86_64 ABI that you should always leave a token if you are going to
> mark the SHSTK elf bit. But if anything did this, then longjmp() could
> never make it back to the stack where setjmp() was called without
> kernel help.

ok. i didn't know that.

the restore token means the stack is valid to execute on later. so
from libc pov if the token is missing then jumping to that stack is
simply ub. (the kernel could help working around missing tokens but
presumably the point of not adding a token is exactly to prevent
any kind of jumps to that stack later).

so i don't think this changes much. (other than the x86 model has
inactive dead stacks that cannot be jumped to, and this is enforced.
but libc jumps are not supposed to leave such dead stacks behind so
libc jumps would have to add the restore token when doing a switch.)

> > - in this model two live stacks cannot use the same shadow stack
> > since
> >   jumping between the two stacks is allowed in both directions, but
> >   jumping within a shadow stack only works in one direction. (also
> > two
> >   tasks could execute on the same shadow stack then. and it makes
> >   shadow stack size accounting problematic.)
> > 
> > - so sharing shadow stack with alt stack is broken. (the model is
> >   right in the sense that valid posix code can trigger the issue.
> 
> Could you spell out what "the issue" is that can be triggered?

i meant jumping back from the main to the alt stack:

in main:
setup sig alt stack
setjmp buf1
	raise signal on first return
	longjmp buf2 on second return

in signal handler:
setjmp buf2
	longjmp buf1 on first return
	can continue after second return

in my reading of posix this is valid (and works if signals are masked
such that the alt stack is not clobbered when jumping away from it).

but cannot work with a single shared shadow stack.

> >  we
> >   can ignore that corner case and adjust the model so the shared
> >   shadow stack works for alt stack, but it likely does not change the
> >   jump design: eventually we want alt shadow stack.)
> 
> As we discussed previously, alt shadow stack can't work transparently
> with existing code due to the sigaltstack API. I wonder if maybe you
> are trying to get at something else, and I'm not following.

i would like a jump design that works with alt shadow stack.

...
> So since they are tied together, and I thought to hold off on it for
> now. I don't want to try to squeeze around the upstream userspace, I
> think a version 2 should be a clean slate on a new elf bit.

ok.

> > i understand that the proposed linux abi makes most existing binaries
> > with shstk marking work, which is relevant for x86.
> > 
> > for a while i thought we can fix the remaining issues even if that
> > means breaking existing shstk binaries (just bump the abi marking).
> > now it seems the issues can only be addressed in a future abi break.
> 
> Adding a new arch_prctl() ENABLE value was the plan. Not sure what you
> mean by ABI break vs version bump. The plan was to add the new features
> without userspace regression by putting any behavior behind a different
> enable option. This relies on userspace to add a new elf bit, and to
> use it.

yes future abi break was the wrong wording: new abi with new elf bit
is what i meant. i.e. x86 will have an abi v1 and abi v2, where v2
abi is not compatible e.g. with v1 unwinder.

> > which means x86 linux will likely end up maintaining two incompatible
> > abis and the future one will need user code and build system changes,
> > not just runtime changes. it is not a small incremental change to add
> > alt shadow stack support for example.
> > 
> > i don't think the maintenance burden of two shadow stack abis is the
> > right path for arm64 to follow, so the shadow stack semantics will
> > likely become divergent not common across targets.
> 
> Unfortunately we are at a bit of an information asymmetry here because
> the ARM spec and patches are not public. It may be part of the cause of
> the confusion.

yes that's unfortunate. but in this case i just meant that arm64
does not have existing marked binaries to worry about. so it seems
wrong to do a v1 abi and later a v2 abi to fix that up.

> It kind of sounds like you don't like the x86 glibc implementation. And
> you want to make sure the kernel can support whatever a new solution is
> that you are working on. I am on board with the goal of having some
> generic set of rules to make portable code work for other architectures
> shadow stacks. But I think how close we can get to that goal or what it
> looks like is an open question. For several reasons:
>  1. Not everyone can see all the specs
>  2. No POCs have been done (or at least shared)
>  3. It's not clear what needs to be supported (yes, I know you have 
>     made a rough proposal here, but it sounds like on the x86 glibc 
>     side at least it's not even clear what non-shadow stack stack 
>     switching operations can work together)
> 
> But towards these goals, I think your technical requests are:
> 
> 1. Leave a token on switching to an alt shadow stack. As discussed
> earlier, we can't do this because of the overflow issues. Also since,

i don't think the overflow discussion was conclusive.

the kernel could modify the top entry instead of adding a new token.
and provide a syscall to switch to that top entry undoing the
modification.

there might be cornercases and likely not much space for encoding
a return address and special marking in an entry. but otherwise
this makes jumping to an overflowed shadow stack work.

but it is also a valid design to just not support jumping out of alt
stack in the specific case of a shadow stack overflow (treat that as
a fatal error), but still allow the jump in less fatal situations.

> alt shadow stack cannot be transparent to existing software anyway, it

maybe not in glibc, but a libc can internally use alt shadow stack
in sigaltstack instead of exposing a separate sigaltshadowstack api.
(this is what a strict posix conform implementation has to do to
support shadow stacks), leaking shadow stacks is not a correctness
issue unless it prevents the program working (the shadow stack for
the main thread likely wastes more memory than all the alt stack
leaks. if the leaks become dominant in a thread the sigaltstack
libc api can just fail).

> should be ok to introduce limitations. So I think this one is a no.
> What we could do is introduce security weakening kernel helpers, but
> this would make sense to come with alt shadow stack support.

yes this would be for abi v2 on x86.

> 2. Add an end token at the top of the shadow stack. Because of the
> existing userspace restriction interactions, this is complicated to
> evaluate but I think we *could* do this now. There are pros and cons.

yes, this is a minor point. (and ok to do differently across targets)

> 3. Support more options for shadow stack sizing. (I think you are
> referring to this conversation:
> https://lore.kernel.org/lkml/ZAIgrXQ4670gxlE4@arm.com/). I don't see
> why this is needed for the base implementation. If ARM wants to add a
> new rlimit or clone variant, I don't see why x86 can't support it
> later.

i think it can be added later.

but it may be important for deployment on some platforms, since a
libc (or other language runtime) may want to set the shadow stack
size differently than the kernel default, because

- languages allocating large arrays on the stack
  (too big shadow stack can cause OOM with overcommit off and
  rlimits can be hit like RLIMIT_DATA, RLIMIT_AS because of it)

- tiny thread stack but big sigaltstack (musl libc, go).

> So if we add 2, are you satisfied? Or otherwise, on the non-technical
> request side, are you asking to hold off on x86 shadow stack, in order
> to co-develop a unified solution?

well a unified solution would need a v2 abi with new elf bit,
and it seems the preference is a v1 abi first, so at this point
i'm just trying to understand the potential brokenness in v1
and possible solutions for v2 since if there is a v2 i would
like that to be compatible across targets.

> I think the existing solution will see use in the meantime, including
> for the development of all the x86 specific JIT implementations.

i think some of that work have to be redone if there is a v2 abi,
which is why i thought having 2 abis is too much maintenance work.

> And finally, what I think is the most important point in all of this:
> 
> I think that *how* it gets used will be a better guide for further
> development than us debating. For example the main pain point that has
> come up so far is the problems around dlopen(). And the future work
> that has been scoped has been for the kernel to help out in this area.
> This is based on _user_ (distro) requests.

i think dlopen (and how it is used) is part of the api/abi design.

in presence of programs that can load any library (e.g. python exe)
it is difficult to make use of shstk without runtime disable.

however if dlopen gets support for runtime disable, reducing the
number of incompatible libraries over time is still relevant,
which requires abi/api design that allows that.

> Any apps that don't work with shadow stack limitations can simply not
> enable shadow stack. You and me are debating these specific API
> combinations, but we can't know whether they are actually the best
> place to focus development efforts. And the early signs are this is NOT
> the most important problem to solve.

disabling shadow stack is not simple (removing the elf marking from
an exe is often not possible/appropriate, but using an env var that
affects an entire process tree has problems too.) but i don't have
a solution for this (it is likely a userspace issue).

debating the api issues was at least useful for me to understand
what can go into the x86 v1 abi and what may go into a v2 abi.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-03 18:19                                           ` szabolcs.nagy
@ 2023-07-03 18:38                                             ` Mark Brown
  2023-07-03 18:49                                             ` Florian Weimer
  2023-07-05 18:45                                             ` Edgecombe, Rick P
  2 siblings, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-07-03 18:38 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, Lutomirski, Andy, Xu, Pengfei, tglx,
	linux-arch, kcc, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, akpm, peterz, corbet, nd, jannh, linux-kernel, x86,
	debug, bp, rdunlap, linux-api, rppt, jamorris, pavel, john.allen,
	bsingharora, mike.kravetz, dethoma, andrew.cooper3, oleg,
	keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, Yang, Weijiang,
	linux-doc, dave.hansen, Torvalds, Linus, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]

On Mon, Jul 03, 2023 at 07:19:00PM +0100, szabolcs.nagy@arm.com wrote:
> The 07/02/2023 18:03, Edgecombe, Rick P wrote:

> > 3. Support more options for shadow stack sizing. (I think you are
> > referring to this conversation:
> > https://lore.kernel.org/lkml/ZAIgrXQ4670gxlE4@arm.com/). I don't see
> > why this is needed for the base implementation. If ARM wants to add a
> > new rlimit or clone variant, I don't see why x86 can't support it
> > later.

> i think it can be added later.

> but it may be important for deployment on some platforms, since a
> libc (or other language runtime) may want to set the shadow stack
> size differently than the kernel default, because

I agree that this can be deferred to later, so long as we err on the
side of large stacks now we can provide an additional limit without
confusing things.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-03 18:19                                           ` szabolcs.nagy
  2023-07-03 18:38                                             ` Mark Brown
@ 2023-07-03 18:49                                             ` Florian Weimer
  2023-07-04 11:33                                               ` Szabolcs Nagy
  2023-07-05 18:45                                             ` Edgecombe, Rick P
  2 siblings, 1 reply; 151+ messages in thread
From: Florian Weimer @ 2023-07-03 18:49 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, Lutomirski, Andy, Xu, Pengfei, tglx,
	linux-arch, kcc, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, akpm, peterz, corbet, nd, broonie, jannh,
	linux-kernel, x86, debug, bp, rdunlap, linux-api, rppt, jamorris,
	pavel, john.allen, bsingharora, mike.kravetz, dethoma,
	andrew.cooper3, oleg, keescook, gorcunov, arnd, Yu, Yu-cheng,
	hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene, Yang,
	Weijiang, linux-doc, dave.hansen, Torvalds, Linus, Eranian,
	Stephane

* szabolcs:

>> alt shadow stack cannot be transparent to existing software anyway, it
>
> maybe not in glibc, but a libc can internally use alt shadow stack
> in sigaltstack instead of exposing a separate sigaltshadowstack api.
> (this is what a strict posix conform implementation has to do to
> support shadow stacks), leaking shadow stacks is not a correctness
> issue unless it prevents the program working (the shadow stack for
> the main thread likely wastes more memory than all the alt stack
> leaks. if the leaks become dominant in a thread the sigaltstack
> libc api can just fail).

It should be possible in theory to carve out pages from sigaltstack and
push a shadow stack page and a guard page as part of the signal frame.
As far as I understand it, the signal frame layout is not ABI, so it's
possible to hide arbitrary stuff in it.  I'm just saying that it looks
possible, not that it's a good idea.

Perhaps that's not realistic with 64K pages, though.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-03 18:49                                             ` Florian Weimer
@ 2023-07-04 11:33                                               ` Szabolcs Nagy
  0 siblings, 0 replies; 151+ messages in thread
From: Szabolcs Nagy @ 2023-07-04 11:33 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Edgecombe, Rick P, Lutomirski, Andy, Xu, Pengfei, tglx,
	linux-arch, kcc, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, akpm, peterz, corbet, nd, broonie, jannh,
	linux-kernel, x86, debug, bp, rdunlap, linux-api, rppt, jamorris,
	pavel, john.allen, bsingharora, mike.kravetz, dethoma,
	andrew.cooper3, oleg, keescook, gorcunov, arnd, Yu, Yu-cheng,
	hpa, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene, Yang,
	Weijiang, linux-doc, dave.hansen, Torvalds, Linus, Eranian,
	Stephane

The 07/03/2023 20:49, Florian Weimer wrote:
> * szabolcs:
> 
> >> alt shadow stack cannot be transparent to existing software anyway, it
> >
> > maybe not in glibc, but a libc can internally use alt shadow stack
> > in sigaltstack instead of exposing a separate sigaltshadowstack api.
> > (this is what a strict posix conform implementation has to do to
> > support shadow stacks), leaking shadow stacks is not a correctness
> > issue unless it prevents the program working (the shadow stack for
> > the main thread likely wastes more memory than all the alt stack
> > leaks. if the leaks become dominant in a thread the sigaltstack
> > libc api can just fail).
> 
> It should be possible in theory to carve out pages from sigaltstack and
> push a shadow stack page and a guard page as part of the signal frame.
> As far as I understand it, the signal frame layout is not ABI, so it's
> possible to hide arbitrary stuff in it.  I'm just saying that it looks
> possible, not that it's a good idea.
> 
> Perhaps that's not realistic with 64K pages, though.

interesting idea, but it would not work transparently:

the user expects the alt stack memory to be usable as normal
memory after longjmping out of a signal handler.

this would break code in practice e.g. when a malloced alt
stack is passed to free(), the contract there is to not
allow changes to the underlying mapping (affects malloc
interposition so not possible to paper over inside the
libc malloc).

so signal entry cannot change the mappings of alt stack.

i think kernel internal alt shadow stack allocation works
in practice where their lifetime is the same as the thread
lifetime. it is sketchy as os interface but doing it in
userspace should be fine i think (it's policy what kind of
sigaltstack usage is allowed). the kernel is easier in the
sense that if there is actual sigreturn then the alt shadow
stack can be freed, while libc cannot catch this case (at
least not easily). leaked shadow stack can also have
security implication but reuse of old alt shadow stack
sounds like a minor issue in practice.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-03 18:19                                           ` szabolcs.nagy
  2023-07-03 18:38                                             ` Mark Brown
  2023-07-03 18:49                                             ` Florian Weimer
@ 2023-07-05 18:45                                             ` Edgecombe, Rick P
  2023-07-05 19:10                                               ` Mark Brown
  2023-07-06 13:07                                               ` szabolcs.nagy
  2 siblings, 2 replies; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-05 18:45 UTC (permalink / raw)
  To: szabolcs.nagy, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, Torvalds, Linus, peterz, corbet, nd,
	broonie, jannh, linux-kernel, debug, pavel, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora,
	mike.kravetz, dethoma, oleg, andrew.cooper3, keescook, gorcunov,
	fweimer, Yu, Yu-cheng, hpa, x86, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, akpm, Yang, Weijiang, dave.hansen,
	linux-doc, Eranian, Stephane

On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> Could you spell out what "the issue" is that can be triggered?
> 
> i meant jumping back from the main to the alt stack:
> 
> in main:
> setup sig alt stack
> setjmp buf1
>         raise signal on first return
>         longjmp buf2 on second return
> 
> in signal handler:
> setjmp buf2
>         longjmp buf1 on first return
>         can continue after second return
> 
> in my reading of posix this is valid (and works if signals are masked
> such that the alt stack is not clobbered when jumping away from it).
> 
> but cannot work with a single shared shadow stack.

Ah, I see. To make this work seamlessly, you would need to have
automatic alt shadow stacks, and as we previously discussed this is not
possible with the existing sigaltstack API. (Or at least it seemed like
a closed discussion to me).

If there is a solution, then we are currently missing a detailed
proposal. It looks like further down you proposed leaking alt shadow
stacks (quoted up here near the related discussion):

On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> maybe not in glibc, but a libc can internally use alt shadow stack
> in sigaltstack instead of exposing a separate sigaltshadowstack api.
> (this is what a strict posix conform implementation has to do to
> support shadow stacks), leaking shadow stacks is not a correctness
> issue unless it prevents the program working (the shadow stack for
> the main thread likely wastes more memory than all the alt stack
> leaks. if the leaks become dominant in a thread the sigaltstack
> libc api can just fail).

It seems like your priority must be to make sure pure C apps don't have
to make any changes in order to not crash with shadow stack enabled.
And this at the expense of any performance and memory usage. Do you
have some formalized priorities or design philosophy you can share?

Earlier you suggested glibc should create new interfaces to handle
makecontext() (makes sense). Shouldn't the same thing happen here? In
which case we are in code-changes territory and we should ask ourselves
what apps really need.

> 
> > >   we
> > >   can ignore that corner case and adjust the model so the shared
> > >   shadow stack works for alt stack, but it likely does not change
> > > the
> > >   jump design: eventually we want alt shadow stack.)
> > 
> > As we discussed previously, alt shadow stack can't work
> > transparently
> > with existing code due to the sigaltstack API. I wonder if maybe
> > you
> > are trying to get at something else, and I'm not following.
> 
> i would like a jump design that works with alt shadow stack.

A shadow stack switch could happen based on the following scenarios:
 1. Alt shadow stack
 2. ucontext
 3. custom stack switching logic

If we leave a token on signal, then 1 and 2 could be guaranteed to have
a token *somewhere* above where setjmp() could have been called.

The algorithm could be to search from the target SSP up the stack until
it finds a token, and then switch to it and INCSSP back to the SSP of
the setjmp() point. This is what we are talking about, right?

And the two problems are:
 - Alt shadow stack overflow problem
 - In the case of (3) there might not be a token

Let's ignore these problems for a second - now we have a solution that
allows you to longjmp() back from an alt stack or ucontext stack. Or at
least it works functionally. But is it going to actually work for
people who are using longjmp() for things that are supposed to be fast?
Like, is this the tradeoff people want? I see some references to fiber
switching implementations using longjmp(). I wonder if the existing
INCSSP loops are not going to be ideal for every usage already, and
this sounds like going further down that road.

For jumping out occasionally in some error case, it seems it would be
useful. But I think we are then talking about targeting a subset of
people using these stack switching patterns.

Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
POP shadow stack instructions? Can ARM just push a restore token at
setjmp time, like I was trying to figure out earlier with a push token
arch_prctl? It would be good to understand how ARM is going to
implement this with these differences in what is allowed by the HW.

If there are differences in how locked down/functional the hardware
implementations are, and if we want to have some unified set of rules
for apps, there will need to some give and take. The x86 approach was
mostly to not support all behaviors and ask apps to either change or
not enable shadow stacks. We don't want one architecture to have to do
a bunch of strange things, but we also don't want one to lose some key
end user value.

I'm thinking that for pure tracing users, glibc might do things a lot
differently (use of WRSS to speed things up). So I'm guessing we will
end up with at least one more "policy" on the x86 side.

I wonder if maybe we should have something like a "max compatibility"
policy/mode where arm/x86/riscv could all behave the same from the
glibc caller perspective. We could add kernel help to achieve this for
any implementation that is more locked down. And maybe that is x86's v2
ABI. I don't know, just sort of thinking out loud at this point. And
this sort of gets back to the point I keep making: if we need to decide
tradeoffs, it would be great to get some users to start using this and
start telling us what they want. Are people caring mostly about
security, compatibility or performance?

[snip]


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-05 18:45                                             ` Edgecombe, Rick P
@ 2023-07-05 19:10                                               ` Mark Brown
  2023-07-05 19:17                                                 ` Edgecombe, Rick P
  2023-07-06 13:07                                               ` szabolcs.nagy
  1 sibling, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-07-05 19:10 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: szabolcs.nagy, Lutomirski, Andy, Xu, Pengfei, tglx, kcc,
	linux-arch, nadav.amit, kirill.shutemov, david, Schimpe,
	Christina, Torvalds, Linus, peterz, corbet, nd, jannh,
	linux-kernel, debug, pavel, bp, rdunlap, linux-api, rppt,
	jamorris, arnd, john.allen, bsingharora, mike.kravetz, dethoma,
	oleg, andrew.cooper3, keescook, gorcunov, fweimer, Yu, Yu-cheng,
	hpa, x86, mingo, hjl.tools, linux-mm, Syromiatnikov, Eugene,
	akpm, Yang, Weijiang, dave.hansen, linux-doc, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 1548 bytes --]

On Wed, Jul 05, 2023 at 06:45:38PM +0000, Edgecombe, Rick P wrote:

> Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
> POP shadow stack instructions? Can ARM just push a restore token at
> setjmp time, like I was trying to figure out earlier with a push token
> arch_prctl? It would be good to understand how ARM is going to
> implement this with these differences in what is allowed by the HW.

> If there are differences in how locked down/functional the hardware
> implementations are, and if we want to have some unified set of rules
> for apps, there will need to some give and take. The x86 approach was
> mostly to not support all behaviors and ask apps to either change or
> not enable shadow stacks. We don't want one architecture to have to do
> a bunch of strange things, but we also don't want one to lose some key
> end user value.

GCS is all or nothing, either the hardware supports GCS or it doesn't.
There are finer grained hypervisor traps (see HFGxTR_EL2 in the system
registers) but they aren't intended to be used to disable partial
functionality and there's a strong chance we'd just disable the feature
in the face of such usage.  The kernel does have the option to control
which functionality is exposed to userspace, in particular we have
separate controls for use of the GCS, the push/pop instructions and the
store instructions (similarly to the control x86 has for WRSS).
Similarly to the handling of WRSS in your series my patches allow
userspace to choose which of these features are enabled.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-05 19:10                                               ` Mark Brown
@ 2023-07-05 19:17                                                 ` Edgecombe, Rick P
  2023-07-05 19:29                                                   ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-05 19:17 UTC (permalink / raw)
  To: broonie
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	szabolcs.nagy, david, kirill.shutemov, Schimpe, Christina,
	linux-doc, peterz, corbet, nd, linux-kernel, dethoma, jannh,
	debug, mike.kravetz, bp, rdunlap, linux-api, john.allen,
	jamorris, arnd, rppt, bsingharora, x86, pavel, andrew.cooper3,
	oleg, keescook, gorcunov, fweimer, Yu, Yu-cheng, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus,
	akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

On Wed, 2023-07-05 at 20:10 +0100, Mark Brown wrote:
> On Wed, Jul 05, 2023 at 06:45:38PM +0000, Edgecombe, Rick P wrote:
> 
> > Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH
> > and
> > POP shadow stack instructions? Can ARM just push a restore token at
> > setjmp time, like I was trying to figure out earlier with a push
> > token
> > arch_prctl? It would be good to understand how ARM is going to
> > implement this with these differences in what is allowed by the HW.
> 
> > If there are differences in how locked down/functional the hardware
> > implementations are, and if we want to have some unified set of
> > rules
> > for apps, there will need to some give and take. The x86 approach
> > was
> > mostly to not support all behaviors and ask apps to either change
> > or
> > not enable shadow stacks. We don't want one architecture to have to
> > do
> > a bunch of strange things, but we also don't want one to lose some
> > key
> > end user value.
> 
> GCS is all or nothing, either the hardware supports GCS or it
> doesn't.
> There are finer grained hypervisor traps (see HFGxTR_EL2 in the
> system
> registers) but they aren't intended to be used to disable partial
> functionality and there's a strong chance we'd just disable the
> feature
> in the face of such usage.  The kernel does have the option to
> control
> which functionality is exposed to userspace, in particular we have
> separate controls for use of the GCS, the push/pop instructions and
> the
> store instructions (similarly to the control x86 has for WRSS).
> Similarly to the handling of WRSS in your series my patches allow
> userspace to choose which of these features are enabled.

Ah, interesting, thanks for the extra info. So which features is glibc
planning to use? (probably more of a question for Szabolcs). Are push
and pop controllable separately?

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-05 19:17                                                 ` Edgecombe, Rick P
@ 2023-07-05 19:29                                                   ` Mark Brown
  2023-07-06 13:14                                                     ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-07-05 19:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	szabolcs.nagy, david, kirill.shutemov, Schimpe, Christina,
	linux-doc, peterz, corbet, nd, linux-kernel, dethoma, jannh,
	debug, mike.kravetz, bp, rdunlap, linux-api, john.allen,
	jamorris, arnd, rppt, bsingharora, x86, pavel, andrew.cooper3,
	oleg, keescook, gorcunov, fweimer, Yu, Yu-cheng, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus,
	akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 344 bytes --]

On Wed, Jul 05, 2023 at 07:17:25PM +0000, Edgecombe, Rick P wrote:

> Ah, interesting, thanks for the extra info. So which features is glibc
> planning to use? (probably more of a question for Szabolcs). Are push
> and pop controllable separately?

Push and pop are one control, you get both or neither.

I'll defer to Szabolcs on glibc plans.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-05 18:45                                             ` Edgecombe, Rick P
  2023-07-05 19:10                                               ` Mark Brown
@ 2023-07-06 13:07                                               ` szabolcs.nagy
  2023-07-06 18:25                                                 ` Edgecombe, Rick P
  1 sibling, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-07-06 13:07 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, Torvalds, Linus, peterz, corbet, nd,
	broonie, jannh, linux-kernel, debug, pavel, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora,
	mike.kravetz, dethoma, oleg, andrew.cooper3, keescook, gorcunov,
	fweimer, Yu, Yu-cheng, hpa, x86, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, akpm, Yang, Weijiang, dave.hansen,
	linux-doc, Eranian, Stephane

The 07/05/2023 18:45, Edgecombe, Rick P wrote:
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> > Could you spell out what "the issue" is that can be triggered?
> > 
> > i meant jumping back from the main to the alt stack:
> > 
> > in main:
> > setup sig alt stack
> > setjmp buf1
> >         raise signal on first return
> >         longjmp buf2 on second return
> > 
> > in signal handler:
> > setjmp buf2
> >         longjmp buf1 on first return
> >         can continue after second return
> > 
> > in my reading of posix this is valid (and works if signals are masked
> > such that the alt stack is not clobbered when jumping away from it).
> > 
> > but cannot work with a single shared shadow stack.
> 
> Ah, I see. To make this work seamlessly, you would need to have
> automatic alt shadow stacks, and as we previously discussed this is not
> possible with the existing sigaltstack API. (Or at least it seemed like
> a closed discussion to me).
> 
> If there is a solution, then we are currently missing a detailed
> proposal. It looks like further down you proposed leaking alt shadow
> stacks (quoted up here near the related discussion):
> 
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> > maybe not in glibc, but a libc can internally use alt shadow stack
> > in sigaltstack instead of exposing a separate sigaltshadowstack api.
> > (this is what a strict posix conform implementation has to do to
> > support shadow stacks), leaking shadow stacks is not a correctness
> > issue unless it prevents the program working (the shadow stack for
> > the main thread likely wastes more memory than all the alt stack
> > leaks. if the leaks become dominant in a thread the sigaltstack
> > libc api can just fail).
> 
> It seems like your priority must be to make sure pure C apps don't have
> to make any changes in order to not crash with shadow stack enabled.
> And this at the expense of any performance and memory usage. Do you
> have some formalized priorities or design philosophy you can share?
> 
> Earlier you suggested glibc should create new interfaces to handle
> makecontext() (makes sense). Shouldn't the same thing happen here? In
> which case we are in code-changes territory and we should ask ourselves
> what apps really need.

instead of priority, i'd say "posix conform c apps work
without change" is a benchmark i use to see if the design
is sound.

i do not have a particular workload (or distro) in mind, so
i have to reason through the cases that make sense and the
current linux syscall abi allows, but fail or difficult to
support with shadow stacks.

one such case is jumping back to an alt stack (i.e. inactive
live alt stack):

- with shared shadow stack this does not work in many cases.

- with alt shadow stack this extends the lifetime beyond the
  point it become inactive (so it cannot be freed).

if there are no inactive live alt stacks then *both* shared
and implicit alt shadow stack works. and to me it looked
like implicit alt shadow stack is simply better of those two
(more alt shadow stack use-cases are supported, shadow stack
overflow can be handled. drawback: complications due to the
discontinous shadow stack.)

on arm64 i personally don't like the idea of "deal with alt
shadow stack later" because it likely requires a v2 abi
affecting the unwinder and jump implementations. (later
extensions are fine if they are bw compat and discoverable)

one nasty case is shadow stack overflow handling, but i
think i have a solution for that (not the nicest thing:
it involves setting the top bit on the last entry on the
shadow stack instead of adding a new entry to it. + a new
syscall that can switch to this entry. i havent convinced
myself about this yet).

> 
> > 
> > > >   we
> > > >   can ignore that corner case and adjust the model so the shared
> > > >   shadow stack works for alt stack, but it likely does not change
> > > > the
> > > >   jump design: eventually we want alt shadow stack.)
> > > 
> > > As we discussed previously, alt shadow stack can't work
> > > transparently
> > > with existing code due to the sigaltstack API. I wonder if maybe
> > > you
> > > are trying to get at something else, and I'm not following.
> > 
> > i would like a jump design that works with alt shadow stack.
> 
> A shadow stack switch could happen based on the following scenarios:
>  1. Alt shadow stack
>  2. ucontext
>  3. custom stack switching logic
> 
> If we leave a token on signal, then 1 and 2 could be guaranteed to have
> a token *somewhere* above where setjmp() could have been called.
> 
> The algorithm could be to search from the target SSP up the stack until
> it finds a token, and then switch to it and INCSSP back to the SSP of
> the setjmp() point. This is what we are talking about, right?
> 
> And the two problems are:
>  - Alt shadow stack overflow problem
>  - In the case of (3) there might not be a token
> 
> Let's ignore these problems for a second - now we have a solution that
> allows you to longjmp() back from an alt stack or ucontext stack. Or at
> least it works functionally. But is it going to actually work for
> people who are using longjmp() for things that are supposed to be fast?

slow longjmp is bad. (well longjmp is actually always slow
in glibc because it sets the signalmask with a syscall, but
there are other jump operations that don't do this and want
to be fast so yes we want fast jump to be possible).

jumping up the shadow stack is at least linear time in the
number of frames jumped over (which already sounds significant
slowdown however this is amortized by the fact that the stack
frames had to be created at some point and that is actually a
lot more expensive because it involves write operations, so a
zero cost jump will not do any asymptotic speedup compared to
a linear cost jump as far as i can see.).

with my proposed solution the jump is still linear. (i know
x86 incssp can jump many entries at a time and does not have
to actually read and check the entries, but technically it's
linear time too: you have to do at least one read per page to
have the guardpage protection). this all looks fine to me
even for extreme made up workloads.

> Like, is this the tradeoff people want? I see some references to fiber
> switching implementations using longjmp(). I wonder if the existing
> INCSSP loops are not going to be ideal for every usage already, and
> this sounds like going further down that road.
> 
> For jumping out occasionally in some error case, it seems it would be
> useful. But I think we are then talking about targeting a subset of
> people using these stack switching patterns.
> 

i simply don't see any trade-off here (i expect no measurable
difference with a scanning vs no scanning approach even in
a microbenchmark that does longjmp in a loop independently of
the stack switch pattern and even if the non-scanning
implementation can do wrss).

> Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
> POP shadow stack instructions? Can ARM just push a restore token at
> setjmp time, like I was trying to figure out earlier with a push token
> arch_prctl? It would be good to understand how ARM is going to
> implement this with these differences in what is allowed by the HW.
> 
> If there are differences in how locked down/functional the hardware
> implementations are, and if we want to have some unified set of rules
> for apps, there will need to some give and take. The x86 approach was
> mostly to not support all behaviors and ask apps to either change or
> not enable shadow stacks. We don't want one architecture to have to do
> a bunch of strange things, but we also don't want one to lose some key
> end user value.
> 
> I'm thinking that for pure tracing users, glibc might do things a lot
> differently (use of WRSS to speed things up). So I'm guessing we will
> end up with at least one more "policy" on the x86 side.
> 
> I wonder if maybe we should have something like a "max compatibility"
> policy/mode where arm/x86/riscv could all behave the same from the
> glibc caller perspective. We could add kernel help to achieve this for
> any implementation that is more locked down. And maybe that is x86's v2
> ABI. I don't know, just sort of thinking out loud at this point. And
> this sort of gets back to the point I keep making: if we need to decide
> tradeoffs, it would be great to get some users to start using this and
> start telling us what they want. Are people caring mostly about
> security, compatibility or performance?
> 
> [snip]
> 

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-05 19:29                                                   ` Mark Brown
@ 2023-07-06 13:14                                                     ` szabolcs.nagy
  2023-07-06 14:24                                                       ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-07-06 13:14 UTC (permalink / raw)
  To: Mark Brown, Edgecombe, Rick P
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	david, kirill.shutemov, Schimpe, Christina, linux-doc, peterz,
	corbet, nd, linux-kernel, dethoma, jannh, debug, mike.kravetz,
	bp, rdunlap, linux-api, john.allen, jamorris, arnd, rppt,
	bsingharora, x86, pavel, andrew.cooper3, oleg, keescook,
	gorcunov, fweimer, Yu, Yu-cheng, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, Torvalds, Linus, akpm, dave.hansen, Yang,
	Weijiang, Eranian, Stephane

The 07/05/2023 20:29, Mark Brown wrote:
> On Wed, Jul 05, 2023 at 07:17:25PM +0000, Edgecombe, Rick P wrote:
> 
> > Ah, interesting, thanks for the extra info. So which features is glibc
> > planning to use? (probably more of a question for Szabolcs). Are push
> > and pop controllable separately?
> 
> Push and pop are one control, you get both or neither.
> 
> I'll defer to Szabolcs on glibc plans.

gcspopm is always available (esentially *ssp++, this is used
for longjmp).

i haven't planned anything yet for other modes (i dont know
anything where writable shadow stack is better than just
turning the feature off, so i expect we at most have a
glibc tunable env var to enable it but it will not affect
glibc behaviour otherwise).

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-06 13:14                                                     ` szabolcs.nagy
@ 2023-07-06 14:24                                                       ` Mark Brown
  2023-07-06 16:59                                                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-07-06 14:24 UTC (permalink / raw)
  To: szabolcs.nagy
  Cc: Edgecombe, Rick P, Xu, Pengfei, tglx, linux-arch, kcc,
	Lutomirski, Andy, nadav.amit, david, kirill.shutemov, Schimpe,
	Christina, linux-doc, peterz, corbet, nd, linux-kernel, dethoma,
	jannh, debug, mike.kravetz, bp, rdunlap, linux-api, john.allen,
	jamorris, arnd, rppt, bsingharora, x86, pavel, andrew.cooper3,
	oleg, keescook, gorcunov, fweimer, Yu, Yu-cheng, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, Torvalds, Linus,
	akpm, dave.hansen, Yang, Weijiang, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 346 bytes --]

On Thu, Jul 06, 2023 at 02:14:40PM +0100, szabolcs.nagy@arm.com wrote:
> The 07/05/2023 20:29, Mark Brown wrote:

> > Push and pop are one control, you get both or neither.

> gcspopm is always available (esentially *ssp++, this is used
> for longjmp).

Ah, sorry - I misremembered there.  You're right, it's only push that we
have control over.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-06 14:24                                                       ` Mark Brown
@ 2023-07-06 16:59                                                         ` Edgecombe, Rick P
  2023-07-06 19:03                                                           ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-06 16:59 UTC (permalink / raw)
  To: broonie, szabolcs.nagy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski, Andy, nadav.amit,
	kirill.shutemov, david, Schimpe, Christina, Yang, Weijiang,
	peterz, corbet, nd, dethoma, jannh, linux-kernel, debug, pavel,
	bp, mike.kravetz, linux-api, rppt, jamorris, arnd, john.allen,
	rdunlap, x86, oleg, andrew.cooper3, keescook, bsingharora,
	gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo, hjl.tools, linux-mm,
	Syromiatnikov, Eugene, linux-doc, Torvalds, Linus, dave.hansen,
	akpm, Eranian, Stephane

On Thu, 2023-07-06 at 15:24 +0100, Mark Brown wrote:
> On Thu, Jul 06, 2023 at 02:14:40PM +0100,
> szabolcs.nagy@arm.com wrote:
> > The 07/05/2023 20:29, Mark Brown wrote:
> 
> > > Push and pop are one control, you get both or neither.
> 
> > gcspopm is always available (esentially *ssp++, this is used
> > for longjmp).
> 
> Ah, sorry - I misremembered there.  You're right, it's only push that
> we
> have control over.

Ah, ok! So if you are not planning to enable the push mode then the
features are pretty well aligned, except:
 - On x86 it is possible to switch stacks without leaving a token 
   behind.
 - The GCSPOPM/INCSSP looping may require longer loops on ARM 
   because it only pops one at at time.

If you are not going to use GCSPUSHM by default, then I think we
*should* be able to have some unified set of rules for developers for
glibc behaviors at least.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-06 13:07                                               ` szabolcs.nagy
@ 2023-07-06 18:25                                                 ` Edgecombe, Rick P
  2023-07-07 15:25                                                   ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-06 18:25 UTC (permalink / raw)
  To: szabolcs.nagy, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, linux-doc, peterz, corbet, nd,
	broonie, dethoma, linux-kernel, debug, mike.kravetz, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora, x86,
	pavel, oleg, andrew.cooper3, keescook, gorcunov, fweimer, jannh,
	Yu, Yu-cheng, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov,
	Eugene, Torvalds, Linus, akpm, dave.hansen, Yang, Weijiang,
	Eranian, Stephane

On Thu, 2023-07-06 at 14:07 +0100, szabolcs.nagy@arm.com wrote:

[ snip ]
> 
> instead of priority, i'd say "posix conform c apps work
> without change" is a benchmark i use to see if the design
> is sound.

This involves leaking shadow stacks for sigaltstack and makecontext,
though, right? This seems kind of wrong. It might be useful for
avoiding crashes at all costs, but probably shouldn't be the long term
solution. I thought your API updates were the right direction.

But, I'm of course not a glibc developer. HJ and friends would have to
agree to all of that.

> 
> i do not have a particular workload (or distro) in mind, so
> i have to reason through the cases that make sense and the
> current linux syscall abi allows, but fail or difficult to
> support with shadow stacks.
> 
> one such case is jumping back to an alt stack (i.e. inactive
> live alt stack):
> 
> - with shared shadow stack this does not work in many cases.
> 
> - with alt shadow stack this extends the lifetime beyond the
>   point it become inactive (so it cannot be freed).
> 
> if there are no inactive live alt stacks then *both* shared
> and implicit alt shadow stack works. and to me it looked
> like implicit alt shadow stack is simply better of those two
> (more alt shadow stack use-cases are supported, shadow stack
> overflow can be handled. drawback: complications due to the
> discontinous shadow stack.)
> 
> on arm64 i personally don't like the idea of "deal with alt
> shadow stack later" because it likely requires a v2 abi
> affecting the unwinder and jump implementations. (later
> extensions are fine if they are bw compat and discoverable)

I think you could do it, if your signal handler can push data on the
shadow stack like x86 does. I'd start with a padded shadow stack signal
frame though, and not trust userspace to parse it.

> 
> one nasty case is shadow stack overflow handling, but i
> think i have a solution for that (not the nicest thing:
> it involves setting the top bit on the last entry on the
> shadow stack instead of adding a new entry to it. + a new
> syscall that can switch to this entry. i havent convinced
> myself about this yet).

There might be some complicated thing around storing the last shadow
stack entry into the shadow stack sigframe and restoring it on
sigreturn. Then writing a token from the kernel to where the saved
frame was to live there in the meantime.

But to me this whole search, restore and INCSSP thing is suspect at
this point though. We could also increase compatibility and performance
more simply, by adding kernel help, at the expense of security.


[ snip ]

> slow longjmp is bad. (well longjmp is actually always slow
> in glibc because it sets the signalmask with a syscall, but
> there are other jump operations that don't do this and want
> to be fast so yes we want fast jump to be possible).
> 
> jumping up the shadow stack is at least linear time in the
> number of frames jumped over (which already sounds significant
> slowdown however this is amortized by the fact that the stack
> frames had to be created at some point and that is actually a
> lot more expensive because it involves write operations, so a
> zero cost jump will not do any asymptotic speedup compared to
> a linear cost jump as far as i can see.).
> 
> with my proposed solution the jump is still linear. (i know
> x86 incssp can jump many entries at a time and does not have
> to actually read and check the entries, but technically it's
> linear time too: you have to do at least one read per page to
> have the guardpage protection). this all looks fine to me
> even for extreme made up workloads.

Well I guess we are talking about hypothetical performance. But linear
time is still worse than O(1). And I thought longjmp() was supposed to
be an O(1) type thing.


Separate from all of this...now that all the constraints are clearer,
if you have changed your mind on whether this series is ready, could
you comment at the top of this thread something to that effect? I'm
imagining not many are reading so far down at this point.

For my part, I think we should go forward with what we have on the
kernel side, unless glibc/gcc developers would like to start by
deprecating the existing binaries. I just talked with HJ, and he has
not changed his plans around this. If anyone else in that community has
(Florian?), please speak up. But otherwise I think it's better to start
getting real world feedback and grow based on that.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-06 16:59                                                         ` Edgecombe, Rick P
@ 2023-07-06 19:03                                                           ` Mark Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-07-06 19:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: szabolcs.nagy, Xu, Pengfei, tglx, linux-arch, kcc, Lutomirski,
	Andy, nadav.amit, kirill.shutemov, david, Schimpe, Christina,
	Yang, Weijiang, peterz, corbet, nd, dethoma, jannh, linux-kernel,
	debug, pavel, bp, mike.kravetz, linux-api, rppt, jamorris, arnd,
	john.allen, rdunlap, x86, oleg, andrew.cooper3, keescook,
	bsingharora, gorcunov, Yu, Yu-cheng, fweimer, hpa, mingo,
	hjl.tools, linux-mm, Syromiatnikov, Eugene, linux-doc, Torvalds,
	Linus, dave.hansen, akpm, Eranian, Stephane

[-- Attachment #1: Type: text/plain, Size: 1538 bytes --]

On Thu, Jul 06, 2023 at 04:59:45PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-07-06 at 15:24 +0100, Mark Brown wrote:
> > szabolcs.nagy@arm.com wrote:
> > > The 07/05/2023 20:29, Mark Brown wrote:

> > > gcspopm is always available (esentially *ssp++, this is used
> > > for longjmp).

> > Ah, sorry - I misremembered there.  You're right, it's only push that
> > we
> > have control over.

FWIW the confusion there was some of the hypervisor features which do
tie some of the push and pop instructions together.

> Ah, ok! So if you are not planning to enable the push mode then the
> features are pretty well aligned, except:
>  - On x86 it is possible to switch stacks without leaving a token 
>    behind.
>  - The GCSPOPM/INCSSP looping may require longer loops on ARM 
>    because it only pops one at at time.

> If you are not going to use GCSPUSHM by default, then I think we
> *should* be able to have some unified set of rules for developers for
> glibc behaviors at least.

Yes, the only case where I am aware of conciously diverging in any
substantial way is that we do not free the GCS when GCS is disabled by
userspace, we just disable the updates and checks, and reenabling after
disabling is not supported.  We have demand for disabling at runtime so
we want to keep the stack around for things like a running unwinder but
we don't see a practical use for reenabling so didn't worry about
figuring out what would make sense for userspace.  glibc isn't going to
be using that though.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH] x86/shstk: Move arch detail comment out of core mm
  2023-06-26 12:45             ` Mark Brown
@ 2023-07-06 23:32               ` Rick Edgecombe
  2023-07-07 15:08                 ` Mark Brown
  2023-08-01 16:52                 ` Mike Rapoport
  0 siblings, 2 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-07-06 23:32 UTC (permalink / raw)
  To: broonie
  Cc: akpm, andrew.cooper3, arnd, bp, bsingharora, christina.schimpe,
	corbet, dave.hansen, david, debug, dethoma, eranian, esyr,
	fweimer, gorcunov, hjl.tools, hpa, jamorris, jannh, john.allen,
	kcc, keescook, kirill.shutemov, linux-api, linux-arch, linux-doc,
	linux-kernel, linux-mm, luto, mike.kravetz, mingo, nadav.amit,
	oleg, pavel, pengfei.xu, peterz, rdunlap, rick.p.edgecombe, rppt,
	szabolcs.nagy, tglx, torvalds, weijiang.yang, willy, x86,
	yu-cheng.yu

The comment around VM_SHADOW_STACK in mm.h refers to a lot of x86
specific details that don't belong in a cross arch file. Remove these
out of core mm, and just leave the non-arch details.

Since the comment includes some useful details that would be good to
retain in the source somewhere, put the arch specifics parts in
arch/x86/shstk.c near alloc_shstk(), where memory of this type is
allocated. Include a reference to the existence of the x86 details near
the VM_SHADOW_STACK definition mm.h.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/shstk.c | 25 +++++++++++++++++++++++++
 include/linux/mm.h      | 32 ++++++--------------------------
 2 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index b26810c7cd1c..47f5204b0fa9 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -72,6 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
 	return 0;
 }
 
+/*
+ * VM_SHADOW_STACK will have a guard page. This helps userspace protect
+ * itself from attacks. The reasoning is as follows:
+ *
+ * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
+ * INCSSP instruction can increment the shadow stack pointer. It is the
+ * shadow stack analog of an instruction like:
+ *
+ *   addq $0x80, %rsp
+ *
+ * However, there is one important difference between an ADD on %rsp
+ * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
+ * memory of the first and last elements that were "popped". It can be
+ * thought of as acting like this:
+ *
+ * READ_ONCE(ssp);       // read+discard top element on stack
+ * ssp += nr_to_pop * 8; // move the shadow stack
+ * READ_ONCE(ssp-8);     // read+discard last popped stack element
+ *
+ * The maximum distance INCSSP can move the SSP is 2040 bytes, before
+ * it would read the memory. Therefore a single page gap will be enough
+ * to prevent any operation from shifting the SSP to an adjacent stack,
+ * since it would have to land in the gap at least once, causing a
+ * fault.
+ */
 static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
 				 unsigned long token_offset, bool set_res_tok)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 535c58d3b2e4..b647cf2e94ea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -343,33 +343,13 @@ extern unsigned int kobjsize(const void *objp);
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
 /*
- * This flag should not be set with VM_SHARED because of lack of support
- * core mm. It will also get a guard page. This helps userspace protect
- * itself from attacks. The reasoning is as follows:
+ * VM_SHADOW_STACK should not be set with VM_SHARED because of lack of
+ * support core mm.
  *
- * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
- * INCSSP instruction can increment the shadow stack pointer. It is the
- * shadow stack analog of an instruction like:
- *
- *   addq $0x80, %rsp
- *
- * However, there is one important difference between an ADD on %rsp
- * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
- * memory of the first and last elements that were "popped". It can be
- * thought of as acting like this:
- *
- * READ_ONCE(ssp);       // read+discard top element on stack
- * ssp += nr_to_pop * 8; // move the shadow stack
- * READ_ONCE(ssp-8);     // read+discard last popped stack element
- *
- * The maximum distance INCSSP can move the SSP is 2040 bytes, before
- * it would read the memory. Therefore a single page gap will be enough
- * to prevent any operation from shifting the SSP to an adjacent stack,
- * since it would have to land in the gap at least once, causing a
- * fault.
- *
- * Prevent using INCSSP to move the SSP between shadow stacks by
- * having a PAGE_SIZE guard gap.
+ * These VMAs will get a single end guard page. This helps userspace protect
+ * itself from attacks. A single page is enough for current shadow stack archs
+ * (x86). See the comments near alloc_shstk() in arch/x86/kernel/shstk.c
+ * for more details on the guard size.
  */
 # define VM_SHADOW_STACK	VM_HIGH_ARCH_5
 #else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [PATCH] x86/shstk: Don't retry vm_munmap() on -EINTR
  2023-06-28  0:37       ` Edgecombe, Rick P
@ 2023-07-06 23:38         ` Rick Edgecombe
  0 siblings, 0 replies; 151+ messages in thread
From: Rick Edgecombe @ 2023-07-06 23:38 UTC (permalink / raw)
  To: rick.p.edgecombe
  Cc: akpm, andrew.cooper3, arnd, bp, broonie, bsingharora,
	christina.schimpe, corbet, dave.hansen, dave.hansen, david,
	debug, dethoma, eranian, esyr, fweimer, gorcunov, hjl.tools, hpa,
	jamorris, jannh, john.allen, kcc, keescook, kirill.shutemov,
	linux-api, linux-arch, linux-doc, linux-kernel, linux-mm, luto,
	mike.kravetz, mingo, nadav.amit, oleg, pavel, pengfei.xu, peterz,
	rdunlap, rppt, szabolcs.nagy, tglx, torvalds, weijiang.yang, x86,
	yu-cheng.yu

The existing comment around handling vm_munmap() failure when freeing a
shadow stack is wrong. It asserts that vm_munmap() returns -EINTR when
the mmap lock is only being held for a short time, and so the caller
should retry. Based on this wrong understanding, unmap_shadow_stack() will
loop retrying vm_munmap().

What -EINTR actually means in this case is that the process is going
away (see ae79878), and the whole MM will be torn down soon. In order
to facilitate this, the task should not linger in the kernel, but
actually do the opposite. So don't loop in this scenario, just abandon
the operation and let exit_mmap() clean it up. Also, update the comment
to reflect the actual meaning of the error code.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/shstk.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 47f5204b0fa9..cd10d074a444 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -134,28 +134,24 @@ static unsigned long adjust_shstk_size(unsigned long size)
 
 static void unmap_shadow_stack(u64 base, u64 size)
 {
-	while (1) {
-		int r;
+	int r;
 
-		r = vm_munmap(base, size);
+	r = vm_munmap(base, size);
 
-		/*
-		 * vm_munmap() returns -EINTR when mmap_lock is held by
-		 * something else, and that lock should not be held for a
-		 * long time.  Retry it for the case.
-		 */
-		if (r == -EINTR) {
-			cond_resched();
-			continue;
-		}
+	/*
+	 * mmap_write_lock_killable() failed with -EINTR. This means
+	 * the process is about to die and have it's MM cleaned up.
+	 * This task shouldn't ever make it back to userspace. In this
+	 * case it is ok to leak a shadow stack, so just exit out.
+	 */
+	if (r == -EINTR)
+		return;
 
-		/*
-		 * For all other types of vm_munmap() failure, either the
-		 * system is out of memory or there is bug.
-		 */
-		WARN_ON_ONCE(r);
-		break;
-	}
+	/*
+	 * For all other types of vm_munmap() failure, either the
+	 * system is out of memory or there is bug.
+	 */
+	WARN_ON_ONCE(r);
 }
 
 static int shstk_setup(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [PATCH] x86/shstk: Move arch detail comment out of core mm
  2023-07-06 23:32               ` [PATCH] x86/shstk: Move arch detail comment out of core mm Rick Edgecombe
@ 2023-07-07 15:08                 ` Mark Brown
  2023-08-01 16:52                 ` Mike Rapoport
  1 sibling, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-07-07 15:08 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: akpm, andrew.cooper3, arnd, bp, bsingharora, christina.schimpe,
	corbet, dave.hansen, david, debug, dethoma, eranian, esyr,
	fweimer, gorcunov, hjl.tools, hpa, jamorris, jannh, john.allen,
	kcc, keescook, kirill.shutemov, linux-api, linux-arch, linux-doc,
	linux-kernel, linux-mm, luto, mike.kravetz, mingo, nadav.amit,
	oleg, pavel, pengfei.xu, peterz, rdunlap, rppt, szabolcs.nagy,
	tglx, torvalds, weijiang.yang, willy, x86, yu-cheng.yu

[-- Attachment #1: Type: text/plain, Size: 638 bytes --]

On Thu, Jul 06, 2023 at 04:32:48PM -0700, Rick Edgecombe wrote:
> The comment around VM_SHADOW_STACK in mm.h refers to a lot of x86
> specific details that don't belong in a cross arch file. Remove these
> out of core mm, and just leave the non-arch details.
> 
> Since the comment includes some useful details that would be good to
> retain in the source somewhere, put the arch specifics parts in
> arch/x86/shstk.c near alloc_shstk(), where memory of this type is
> allocated. Include a reference to the existence of the x86 details near
> the VM_SHADOW_STACK definition mm.h.

Reviewed-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-06 18:25                                                 ` Edgecombe, Rick P
@ 2023-07-07 15:25                                                   ` szabolcs.nagy
  2023-07-07 17:37                                                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-07-07 15:25 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, kcc, linux-arch, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, linux-doc, peterz, corbet, nd,
	broonie, dethoma, linux-kernel, debug, mike.kravetz, bp, rdunlap,
	linux-api, rppt, jamorris, arnd, john.allen, bsingharora, x86,
	pavel, oleg, andrew.cooper3, keescook, gorcunov, fweimer, jannh,
	Yu, Yu-cheng, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov,
	Eugene, Torvalds, Linus, akpm, dave.hansen, Yang, Weijiang,
	Eranian, Stephane

The 07/06/2023 18:25, Edgecombe, Rick P wrote:
> On Thu, 2023-07-06 at 14:07 +0100, szabolcs.nagy@arm.com wrote:
> 
> [ snip ]
> > 
> > instead of priority, i'd say "posix conform c apps work
> > without change" is a benchmark i use to see if the design
> > is sound.
> 
> This involves leaking shadow stacks for sigaltstack and makecontext,
> though, right? This seems kind of wrong. It might be useful for
> avoiding crashes at all costs, but probably shouldn't be the long term
> solution. I thought your API updates were the right direction.

new apis are not enough.

existing apis either must do something reasonable or shstk must
be disabled when that api is used (sigaltstack, makecontext).

the disable does not really work as the apis are widely used
and there is no 'disable shstk locally' it is viral when a
widely used dependency is affected.

so apis must do something reasonable. however there will be
remaining issues and that will need new apis which can take
a long time to transition to.

> > one nasty case is shadow stack overflow handling, but i
> > think i have a solution for that (not the nicest thing:
> > it involves setting the top bit on the last entry on the
> > shadow stack instead of adding a new entry to it. + a new
> > syscall that can switch to this entry. i havent convinced
> > myself about this yet).
> 
> There might be some complicated thing around storing the last shadow
> stack entry into the shadow stack sigframe and restoring it on
> sigreturn. Then writing a token from the kernel to where the saved
> frame was to live there in the meantime.

this only works if you jump from the alt stack to the overflowed
stack, not if you jump somewhere else in between.

this is a would be nice to solve but not the most important case.

> 
> But to me this whole search, restore and INCSSP thing is suspect at
> this point though. We could also increase compatibility and performance
> more simply, by adding kernel help, at the expense of security.

what is the kernel help? (and security trade-off)

and why is scan+restore+incssp suspect?

the reasoning i've seen are

- some ppl might not add restore token: sounds fine, it's already
  ub to jump to such stack either way.

- it is slow: please give an example where there is slowdown.
  (the slowdown has to be larger than the call/ret overhead)

- jump to overflowed shadow stack: i'm fairly sure this can be done
  (but indeed complicated), and if that's not acceptable then not
  supporting this case is better than not supporting reliable
  crash handling (alt stack handler can overflow the shadow stack
  and shadow stack overflow cannot be handled).

> > slow longjmp is bad. (well longjmp is actually always slow
> > in glibc because it sets the signalmask with a syscall, but
> > there are other jump operations that don't do this and want
> > to be fast so yes we want fast jump to be possible).
> > 
> > jumping up the shadow stack is at least linear time in the
> > number of frames jumped over (which already sounds significant
> > slowdown however this is amortized by the fact that the stack
> > frames had to be created at some point and that is actually a
> > lot more expensive because it involves write operations, so a
> > zero cost jump will not do any asymptotic speedup compared to
> > a linear cost jump as far as i can see.).
> > 
> > with my proposed solution the jump is still linear. (i know
> > x86 incssp can jump many entries at a time and does not have
> > to actually read and check the entries, but technically it's
> > linear time too: you have to do at least one read per page to
> > have the guardpage protection). this all looks fine to me
> > even for extreme made up workloads.
> 
> Well I guess we are talking about hypothetical performance. But linear
> time is still worse than O(1). And I thought longjmp() was supposed to
> be an O(1) type thing.

longjmp is not O(1) with your proposed abi.

and i don't think linear time is worse than O(1) in this case.

> Separate from all of this...now that all the constraints are clearer,
> if you have changed your mind on whether this series is ready, could
> you comment at the top of this thread something to that effect? I'm
> imagining not many are reading so far down at this point.
> 
> For my part, I think we should go forward with what we have on the
> kernel side, unless glibc/gcc developers would like to start by
> deprecating the existing binaries. I just talked with HJ, and he has
> not changed his plans around this. If anyone else in that community has
> (Florian?), please speak up. But otherwise I think it's better to start
> getting real world feedback and grow based on that.
> 

the x86 v1 abi tries to be compatible with existing unwinders.
(are there other binaries that constrains v1? portable code
should be fine as they rely on libc which we can still change)

i will have to discuss the arm plan with the arm kernel devs.
the ugly bit i want to avoid on arm is to have to reimplement
unwind and jump ops to make alt shadow stack work in a v2 abi.

i think the worse bit of the x86 v1 abi is that crash handlers
don't work reliably (e.g. a crash on a tiny makecontext stack
with the usual sigaltstack crash handler can unrecoverably fail
during crash handling). i guess this can be somewhat mitigated
by both linux and libc adding an extra page to the shadow stack
size to guarantee that alt stack handlers with certain depth
always work.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-07 15:25                                                   ` szabolcs.nagy
@ 2023-07-07 17:37                                                     ` Edgecombe, Rick P
  2023-07-10 16:54                                                       ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-07 17:37 UTC (permalink / raw)
  To: szabolcs.nagy, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, Yang, Weijiang, peterz, corbet, nd,
	dethoma, broonie, linux-kernel, x86, pavel, bp, debug, linux-api,
	rppt, jamorris, arnd, john.allen, rdunlap, mike.kravetz, jannh,
	oleg, andrew.cooper3, keescook, gorcunov, fweimer, Yu, Yu-cheng,
	bsingharora, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov,
	Eugene, linux-doc, Torvalds, Linus, dave.hansen, akpm, Eranian,
	Stephane

On Fri, 2023-07-07 at 16:25 +0100, szabolcs.nagy@arm.com wrote:
> > Separate from all of this...now that all the constraints are
> > clearer,
> > if you have changed your mind on whether this series is ready,
> > could
> > you comment at the top of this thread something to that effect? I'm
> > imagining not many are reading so far down at this point.
> > 
> > For my part, I think we should go forward with what we have on the
> > kernel side, unless glibc/gcc developers would like to start by
> > deprecating the existing binaries. I just talked with HJ, and he
> > has
> > not changed his plans around this. If anyone else in that community
> > has
> > (Florian?), please speak up. But otherwise I think it's better to
> > start
> > getting real world feedback and grow based on that.
> > 
> 
> the x86 v1 abi tries to be compatible with existing unwinders.
> (are there other binaries that constrains v1? portable code
> should be fine as they rely on libc which we can still change)
> 
> i will have to discuss the arm plan with the arm kernel devs.
> the ugly bit i want to avoid on arm is to have to reimplement
> unwind and jump ops to make alt shadow stack work in a v2 abi.
> 
> i think the worse bit of the x86 v1 abi is that crash handlers
> don't work reliably (e.g. a crash on a tiny makecontext stack
> with the usual sigaltstack crash handler can unrecoverably fail
> during crash handling). i guess this can be somewhat mitigated
> by both linux and libc adding an extra page to the shadow stack
> size to guarantee that alt stack handlers with certain depth
> always work.

Some mails back, I listed the three things you might be asking for from
the kernel side and pointedly asked you to clarify. The only one you
still were wishing for up front was "Leave a token on switching to an
alt shadow stack."

But how you want to use this involves a lot of details for how glibc
will work (automatic shadow stack for sigaltstack, scan-restore-incssp,
etc). I think you first need to get the story straight with other libc
developers, otherwise this is just brainstorming. I'm not a glibc
contributor, so winning me over is only half the battle.

Only after that is settled do we get to the problem of the old libgcc
unwinders, and how it is a challenge to even add alt shadow stack given
glibc's plans and the existing binaries.

Once that is solved we are at the overflow problem, and the current
state of thinking on that is "i'm fairly sure this can be done (but
indeed complicated)".

So I think we are still missing any actionable requests that should
hold this up.

Is this a reasonable summary?

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-07 17:37                                                     ` Edgecombe, Rick P
@ 2023-07-10 16:54                                                       ` szabolcs.nagy
  2023-07-10 22:56                                                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-07-10 16:54 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, Yang, Weijiang, peterz, corbet, nd,
	dethoma, broonie, linux-kernel, x86, pavel, bp, debug, linux-api,
	rppt, jamorris, arnd, john.allen, rdunlap, mike.kravetz, jannh,
	oleg, andrew.cooper3, keescook, gorcunov, fweimer, Yu, Yu-cheng,
	bsingharora, hpa, mingo, hjl.tools, linux-mm, Syromiatnikov,
	Eugene, linux-doc, Torvalds, Linus, dave.hansen, akpm, Eranian,
	Stephane

The 07/07/2023 17:37, Edgecombe, Rick P wrote:
> On Fri, 2023-07-07 at 16:25 +0100, szabolcs.nagy@arm.com wrote:
> > > Separate from all of this...now that all the constraints are
> > > clearer,
> > > if you have changed your mind on whether this series is ready,
> > > could
> > > you comment at the top of this thread something to that effect? I'm
> > > imagining not many are reading so far down at this point.
> > > 
> > > For my part, I think we should go forward with what we have on the
> > > kernel side, unless glibc/gcc developers would like to start by
> > > deprecating the existing binaries. I just talked with HJ, and he
> > > has
> > > not changed his plans around this. If anyone else in that community
> > > has
> > > (Florian?), please speak up. But otherwise I think it's better to
> > > start
> > > getting real world feedback and grow based on that.
> > > 
> > 
> > the x86 v1 abi tries to be compatible with existing unwinders.
> > (are there other binaries that constrains v1? portable code
> > should be fine as they rely on libc which we can still change)
> > 
> > i will have to discuss the arm plan with the arm kernel devs.
> > the ugly bit i want to avoid on arm is to have to reimplement
> > unwind and jump ops to make alt shadow stack work in a v2 abi.
> > 
> > i think the worse bit of the x86 v1 abi is that crash handlers
> > don't work reliably (e.g. a crash on a tiny makecontext stack
> > with the usual sigaltstack crash handler can unrecoverably fail
> > during crash handling). i guess this can be somewhat mitigated
> > by both linux and libc adding an extra page to the shadow stack
> > size to guarantee that alt stack handlers with certain depth
> > always work.
> 
> Some mails back, I listed the three things you might be asking for from
> the kernel side and pointedly asked you to clarify. The only one you
> still were wishing for up front was "Leave a token on switching to an
> alt shadow stack."
> 
> But how you want to use this involves a lot of details for how glibc
> will work (automatic shadow stack for sigaltstack, scan-restore-incssp,
> etc). I think you first need to get the story straight with other libc
> developers, otherwise this is just brainstorming. I'm not a glibc
> contributor, so winning me over is only half the battle.
> 
> Only after that is settled do we get to the problem of the old libgcc
> unwinders, and how it is a challenge to even add alt shadow stack given
> glibc's plans and the existing binaries.
> 
> Once that is solved we are at the overflow problem, and the current
> state of thinking on that is "i'm fairly sure this can be done (but
> indeed complicated)".
> 
> So I think we are still missing any actionable requests that should
> hold this up.
> 
> Is this a reasonable summary?

not entirely.

the high level requirement is a design that

a) does not break many existing sigaltstack uses,

b) allows implementing jump and unwind that support the
   relevant use-cases around signals and stack switches
   with minimal userspace changes.

where (b) has nothing to add to v1 abi: existing unwind
binaries mean this needs a v2 abi. (the point of discussing
v2 ahead of time is to understand the cost of v2 and the
divergence wrt targets without abi compat issue.)

for (a) my actionable suggestion was to account altstack
when sizing shadow stacks. to document an altstack call
depth limit on the libc level (e.g. fixed 100 is fine) we
need guarantees from the kernel. (consider recursive calls
overflowing the stack with altstack crash handler: for this
to be reliable shadow stack size > stack size is needed.
but the diff can be tiny e.g. 1 page is enough.)

your previous 3 actionable item list was

1. add token when handling signals on altstack.

this falls under (b). your summary is correct that this
requires sorting out many fiddly details.

2. top of stack token.

this can work differently across targets so i have nothing
against the x86 v1 abi, but on arm64 we plan to have this.

3. more shadow stack sizing policies.

this can be done in the future except the default policy
should be fixed for (a) and a smaller size introduces the
overflow issue which may require v2.

in short the only important change for v1 is shstk sizing.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-10 16:54                                                       ` szabolcs.nagy
@ 2023-07-10 22:56                                                         ` Edgecombe, Rick P
  2023-07-11  8:08                                                           ` szabolcs.nagy
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-10 22:56 UTC (permalink / raw)
  To: szabolcs.nagy, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, akpm, peterz, corbet, nd, broonie,
	dethoma, linux-kernel, x86, pavel, bp, debug, linux-api, rppt,
	john.allen, jamorris, rdunlap, mike.kravetz, jannh, oleg,
	andrew.cooper3, keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer,
	hpa, mingo, hjl.tools, bsingharora, linux-mm, Syromiatnikov,
	Eugene, Yang, Weijiang, linux-doc, dave.hansen, Torvalds, Linus,
	Eranian, Stephane

On Mon, 2023-07-10 at 17:54 +0100, szabolcs.nagy@arm.com wrote:
> > Some mails back, I listed the three things you might be asking for
> > from
> > the kernel side and pointedly asked you to clarify. The only one
> > you
> > still were wishing for up front was "Leave a token on switching to
> > an
> > alt shadow stack."
> > 
> > But how you want to use this involves a lot of details for how
> > glibc
> > will work (automatic shadow stack for sigaltstack, scan-restore-
> > incssp,
> > etc). I think you first need to get the story straight with other
> > libc
> > developers, otherwise this is just brainstorming. I'm not a glibc
> > contributor, so winning me over is only half the battle.
> > 
> > Only after that is settled do we get to the problem of the old
> > libgcc
> > unwinders, and how it is a challenge to even add alt shadow stack
> > given
> > glibc's plans and the existing binaries.
> > 
> > Once that is solved we are at the overflow problem, and the current
> > state of thinking on that is "i'm fairly sure this can be done (but
> > indeed complicated)".
> > 
> > So I think we are still missing any actionable requests that should
> > hold this up.
> > 
> > Is this a reasonable summary?
> 
> not entirely.
> 
> the high level requirement is a design that
> 
> a) does not break many existing sigaltstack uses,
> 
> b) allows implementing jump and unwind that support the
>    relevant use-cases around signals and stack switches
>    with minimal userspace changes.

Please open a discussion with the other glibc developers that have been
involved with shadow stack regarding this subject(b). Please include me
(and probably AndyL would be interested?). I think we've talked it
through as much as you and I can at this point. Let's at least start a
new more focused thread on the "unwind across stacks" problem. And also
get some consensus on the wisdom of the related suggestion to leak
shadow stacks in order to transparently support existing posix APIs.

> 
> where (b) has nothing to add to v1 abi: existing unwind
> binaries mean this needs a v2 abi. (the point of discussing
> v2 ahead of time is to understand the cost of v2 and the
> divergence wrt targets without abi compat issue.)
> 
> for (a) my actionable suggestion was to account altstack
> when sizing shadow stacks. to document an altstack call
> depth limit on the libc level (e.g. fixed 100 is fine) we
> need guarantees from the kernel. (consider recursive calls
> overflowing the stack with altstack crash handler: for this
> to be reliable shadow stack size > stack size is needed.
> but the diff can be tiny e.g. 1 page is enough.)
> 
> your previous 3 actionable item list was
> 
> 1. add token when handling signals on altstack.
> 
> this falls under (b). your summary is correct that this
> requires sorting out many fiddly details.
> 
> 2. top of stack token.
> 
> this can work differently across targets so i have nothing
> against the x86 v1 abi, but on arm64 we plan to have this.
> 
> 3. more shadow stack sizing policies.
> 
> this can be done in the future except the default policy
> should be fixed for (a) and a smaller size introduces the
> overflow issue which may require v2.
> 
> in short the only important change for v1 is shstk sizing.

I tried searching through this long thread and AFAICT this is a new
idea. Sorry if I missed something, but your previous answer on this(3)
seemed concerned with the opposite problem (oversized shadow stacks).

Quoted from a past mail:
On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> i think it can be added later.
> 
> but it may be important for deployment on some platforms, since a
> libc (or other language runtime) may want to set the shadow stack
> size differently than the kernel default, because
> 
> - languages allocating large arrays on the stack
>   (too big shadow stack can cause OOM with overcommit off and
>   rlimits can be hit like RLIMIT_DATA, RLIMIT_AS because of it)
> 
> - tiny thread stack but big sigaltstack (musl libc, go).

So you can probably see how I got the impression that 3 was closed.

But anyways, ok, so if we add a page to every thread allocated shadow
stack, then you can guarantee that an alt stack can have some room to
handle at least a single alt stack signal, even in the case of
exhausting the entire stack by recursively making calls and pushing
nothing else to the stack. SS_AUTODISARM remains a bit muddy.

Also glibc would have to size ucontext shadow stacks with an additional
page as well. I think it would be good to get some other signs of
interest on this tweak due to the requirements for glibc to participate
on the scheme. Can you gather that quickly, so we can get this all
prepped again?

To me (unless I'm missing something), it seems like complicating the
equation for probably no real world benefit due to the low chances of
exhausting a shadow stack. But if there is consensus on the glibc side,
then I'm happy to make the change to finally settle this discussion.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-10 22:56                                                         ` Edgecombe, Rick P
@ 2023-07-11  8:08                                                           ` szabolcs.nagy
  2023-07-12  9:39                                                             ` Szabolcs Nagy
  0 siblings, 1 reply; 151+ messages in thread
From: szabolcs.nagy @ 2023-07-11  8:08 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, akpm, peterz, corbet, nd, broonie,
	dethoma, linux-kernel, x86, pavel, bp, debug, linux-api, rppt,
	john.allen, jamorris, rdunlap, mike.kravetz, jannh, oleg,
	andrew.cooper3, keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer,
	hpa, mingo, hjl.tools, bsingharora, linux-mm, Syromiatnikov,
	Eugene, Yang, Weijiang, linux-doc, dave.hansen, Torvalds, Linus,
	Eranian, Stephane, libc-alpha, dalias, branislav.rankov

The 07/10/2023 22:56, Edgecombe, Rick P wrote:
> On Mon, 2023-07-10 at 17:54 +0100, szabolcs.nagy@arm.com wrote:
> > in short the only important change for v1 is shstk sizing.
> 
> I tried searching through this long thread and AFAICT this is a new
> idea. Sorry if I missed something, but your previous answer on this(3)
> seemed concerned with the opposite problem (oversized shadow stacks).
> 
> Quoted from a past mail:
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
...
> > - tiny thread stack but big sigaltstack (musl libc, go).

and 4 months earlier:

> Date: Fri, 3 Mar 2023 16:30:37 +0000
> Subject: Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
> > Looking at this again, I'm not sure why a new rlimit is needed. It
> > seems many of those points were just formulations of that the clone3
> > stack size was not used, but it actually is and just not documented. If
> > you disagree perhaps you could elaborate on what the requirements are
> > and we can see if it seems tricky to do in a follow up.
> 
> - tiny thread stack and deep signal stack.
> (note that this does not really work with glibc because it has
> implementation internal signals that don't run on alt stack,
> cannot be masked and don't fit on a tiny thread stack, but
> with other runtimes this can be a valid use-case, e.g. musl
> allows tiny thread stacks, < pagesize.)
...

that analysis was wrong: it considered handling any signals
at all, but if the stack overflows due to a recursive call
with empty stack frames then the stack gets used up at the
same rate as the shadow stack and since the signal handler
on the alt stack uses the same shadow stack that can easily
overflow too independently of the stack size.

this is a big deal as it affects what operations the libc
can support and handling stack overflow is a common
requirement.

i originally argued for a fix using separate alt shadow
stacks, but since it became clear that does not work for
the x86 v1 abi i am recommending the size increase.

with shadow stack size = stack size + 1 page, libc can
document a depth limit for the alt stack that works.
(longjumping back to the alt stack is still broken
though, but that's rarely a requirement)

> Also glibc would have to size ucontext shadow stacks with an additional

yes, assuming glibc wants to support sigaltstack.

> page as well. I think it would be good to get some other signs of
> interest on this tweak due to the requirements for glibc to participate
> on the scheme. Can you gather that quickly, so we can get this all
> prepped again?

i can cc libc-alpha.

the decision is for x86 shadow stack linux abi to use

  shadow stack size = stack size

or

  shadow stack size = stack size + 1 page

as default policy when alt stack signals use the same
shadow stack, not a separate one.

note: smallest stack frame size is 8bytes, same as the
shadow stack entry. on a target where smallest frame
size is 2x shadow stack entry size, the formula would
use (stack size / 2).

note: there is no api to change the policy from userspace
at this point.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-07-11  8:08                                                           ` szabolcs.nagy
@ 2023-07-12  9:39                                                             ` Szabolcs Nagy
  0 siblings, 0 replies; 151+ messages in thread
From: Szabolcs Nagy @ 2023-07-12  9:39 UTC (permalink / raw)
  To: Edgecombe, Rick P, Lutomirski, Andy
  Cc: Xu, Pengfei, tglx, linux-arch, kcc, nadav.amit, kirill.shutemov,
	david, Schimpe, Christina, akpm, peterz, corbet, nd, broonie,
	dethoma, linux-kernel, x86, pavel, bp, debug, linux-api, rppt,
	john.allen, jamorris, rdunlap, mike.kravetz, jannh, oleg,
	andrew.cooper3, keescook, gorcunov, arnd, Yu, Yu-cheng, fweimer,
	hpa, mingo, hjl.tools, bsingharora, linux-mm, Syromiatnikov,
	Eugene, Yang, Weijiang, linux-doc, dave.hansen, Torvalds, Linus,
	Eranian, Stephane, libc-alpha, dalias, branislav.rankov

The 07/11/2023 09:08, szabolcs.nagy--- via Libc-alpha wrote:
> the decision is for x86 shadow stack linux abi to use
> 
>   shadow stack size = stack size
> 
> or
> 
>   shadow stack size = stack size + 1 page
> 
> as default policy when alt stack signals use the same
> shadow stack, not a separate one.
> 
> note: smallest stack frame size is 8bytes, same as the
> shadow stack entry. on a target where smallest frame
> size is 2x shadow stack entry size, the formula would
> use (stack size / 2).

i convinced myself that shadow stack size = stack size
works:

libc can reserve N bytes on the initial stack frame so
when the stack overflows there will be at least N bytes
on the shadow stack usable for signal handling.

this is only bad for tiny user allocated stacks where libc
should not consume too much stack space. but e.g. glibc
already uses >128 bytes on the initial stack frame for its
cancellation jumpbuf so 16 deep signal call stack is
already guaranteed to work.

the glibc makecontext code has to be adjusted, but that's
a libc side discussion.

the shadow stack of the main stack can still overflow, but
that requires increasing RLIMIT_STACK at runtime which is
not very common.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
                     ` (3 preceding siblings ...)
  2023-06-19  4:27   ` Helge Deller
@ 2023-07-14 22:57   ` Mark Brown
  2023-07-17 15:55     ` Edgecombe, Rick P
  4 siblings, 1 reply; 151+ messages in thread
From: Mark Brown @ 2023-07-14 22:57 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, szabolcs.nagy, torvalds, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-csky, linux-hexagon,
	linux-ia64, loongarch, linux-m68k, Michal Simek, Dinh Nguyen,
	linux-mips, openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Mon, Jun 12, 2023 at 05:10:27PM -0700, Rick Edgecombe wrote:
> The x86 Shadow stack feature includes a new type of memory called shadow
> stack. This shadow stack memory has some unusual properties, which requires
> some core mm changes to function properly.

This seems to break sparc64_defconfig when applied on top of v6.5-rc1:

In file included from /home/broonie/git/bisect/include/linux/mm.h:29,
                 from /home/broonie/git/bisect/net/core/skbuff.c:40:
/home/broonie/git/bisect/include/linux/pgtable.h: In function 'pmd_mkwrite':
/home/broonie/git/bisect/include/linux/pgtable.h:528:9: error: implicit declaration of function 'pmd_mkwrite_novma'; did you mean 'pte_mkwrite_novma'? [-Werror=implicit-function-declaration]
  return pmd_mkwrite_novma(pmd);
         ^~~~~~~~~~~~~~~~~
         pte_mkwrite_novma
/home/broonie/git/bisect/include/linux/pgtable.h:528:9: error: incompatible types when returning type 'int' but 'pmd_t' {aka 'struct <anonymous>'} was expected
  return pmd_mkwrite_novma(pmd);
         ^~~~~~~~~~~~~~~~~~~~~~

The same issue seems to apply with the version that was in -next based
on v6.4-rc4 too.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-07-14 22:57   ` Mark Brown
@ 2023-07-17 15:55     ` Edgecombe, Rick P
  2023-07-17 16:51       ` Mark Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Edgecombe, Rick P @ 2023-07-17 15:55 UTC (permalink / raw)
  To: broonie
  Cc: Schimpe, Christina, Yang, Weijiang, hjl.tools, x86, monstr, rppt,
	dave.hansen, linux-snps-arc, Torvalds, Linus, kirill.shutemov,
	linux-api, dinguyen, rdunlap, tglx, sparclinux, arnd, linux-ia64,
	Lutomirski, Andy, szabolcs.nagy, linux-kernel, linux-parisc,
	akpm, pavel, keescook, linuxppc-dev, gorcunov, andrew.cooper3,
	david, hpa, loongarch, peterz, linux-sh, nadav.amit, linux-m68k,
	linux-doc, openrisc, jamorris, mike.kravetz, debug, fweimer, kcc,
	linux-arch, mingo, linux-csky, linux-mips, john.allen, Eranian,
	Stephane, bsingharora, linux-alpha, linux-s390, linux-riscv,
	linux-um, linux-arm-kernel, torvalds, bp, corbet, linux-hexagon,
	dethoma, jannh, Syromiatnikov, Eugene, oleg, linux-mm

On Fri, 2023-07-14 at 23:57 +0100, Mark Brown wrote:
> On Mon, Jun 12, 2023 at 05:10:27PM -0700, Rick Edgecombe wrote:
> > The x86 Shadow stack feature includes a new type of memory called
> > shadow
> > stack. This shadow stack memory has some unusual properties, which
> > requires
> > some core mm changes to function properly.
> 
> This seems to break sparc64_defconfig when applied on top of v6.5-
> rc1:
> 
> In file included from /home/broonie/git/bisect/include/linux/mm.h:29,
>                  from /home/broonie/git/bisect/net/core/skbuff.c:40:
> /home/broonie/git/bisect/include/linux/pgtable.h: In function
> 'pmd_mkwrite':
> /home/broonie/git/bisect/include/linux/pgtable.h:528:9: error:
> implicit declaration of function 'pmd_mkwrite_novma'; did you mean
> 'pte_mkwrite_novma'? [-Werror=implicit-function-declaration]
>   return pmd_mkwrite_novma(pmd);
>          ^~~~~~~~~~~~~~~~~
>          pte_mkwrite_novma
> /home/broonie/git/bisect/include/linux/pgtable.h:528:9: error:
> incompatible types when returning type 'int' but 'pmd_t' {aka 'struct
> <anonymous>'} was expected
>   return pmd_mkwrite_novma(pmd);
>          ^~~~~~~~~~~~~~~~~~~~~~
> 
> The same issue seems to apply with the version that was in -next
> based
> on v6.4-rc4 too.

The version in your branch is not the same as the version in tip (which
had a squashed build fix). I was able to reproduce the build error with
your branch. But not with the one in tip rebased on v6.5-rc1. So can
you try this version:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/shstk&id=899223d69ce9f338056f4c41ef870d70040fc860



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  2023-07-17 15:55     ` Edgecombe, Rick P
@ 2023-07-17 16:51       ` Mark Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Mark Brown @ 2023-07-17 16:51 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Schimpe, Christina, Yang, Weijiang, hjl.tools, x86, monstr, rppt,
	dave.hansen, linux-snps-arc, Torvalds, Linus, kirill.shutemov,
	linux-api, dinguyen, rdunlap, tglx, sparclinux, arnd, linux-ia64,
	Lutomirski, Andy, szabolcs.nagy, linux-kernel, linux-parisc,
	akpm, pavel, keescook, linuxppc-dev, gorcunov, andrew.cooper3,
	david, hpa, loongarch, peterz, linux-sh, nadav.amit, linux-m68k,
	linux-doc, openrisc, jamorris, mike.kravetz, debug, fweimer, kcc,
	linux-arch, mingo, linux-csky, linux-mips, john.allen, Eranian,
	Stephane, bsingharora, linux-alpha, linux-s390, linux-riscv,
	linux-um, linux-arm-kernel, torvalds, bp, corbet, linux-hexagon,
	dethoma, jannh, Syromiatnikov, Eugene, oleg, linux-mm

[-- Attachment #1: Type: text/plain, Size: 793 bytes --]

On Mon, Jul 17, 2023 at 03:55:50PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2023-07-14 at 23:57 +0100, Mark Brown wrote:

> > The same issue seems to apply with the version that was in -next
> > based
> > on v6.4-rc4 too.

> The version in your branch is not the same as the version in tip (which
> had a squashed build fix). I was able to reproduce the build error with
> your branch. But not with the one in tip rebased on v6.5-rc1. So can
> you try this version:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/shstk&id=899223d69ce9f338056f4c41ef870d70040fc860

Ah, I'd not seen that patch or that tip had been rebased - I'd actually
been using literally the branch from tip as my base at whatever point I
last noticed it changing up until I rebased onto -rc1.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description
  2023-06-13  0:10 ` [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description Rick Edgecombe
  2023-06-13 11:55   ` Mark Brown
@ 2023-07-18 19:32   ` Szabolcs Nagy
  1 sibling, 0 replies; 151+ messages in thread
From: Szabolcs Nagy @ 2023-07-18 19:32 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug, torvalds, broonie
  Cc: Yu-cheng Yu, Pengfei Xu

The 06/12/2023 17:10, Rick Edgecombe wrote:
> Introduce a new document on Control-flow Enforcement Technology (CET).
>
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>

i don't have more comments on this.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH] x86/shstk: Move arch detail comment out of core mm
  2023-07-06 23:32               ` [PATCH] x86/shstk: Move arch detail comment out of core mm Rick Edgecombe
  2023-07-07 15:08                 ` Mark Brown
@ 2023-08-01 16:52                 ` Mike Rapoport
  1 sibling, 0 replies; 151+ messages in thread
From: Mike Rapoport @ 2023-08-01 16:52 UTC (permalink / raw)
  To: Rick Edgecombe, Dave Hansen
  Cc: broonie, akpm, andrew.cooper3, arnd, bp, bsingharora,
	christina.schimpe, corbet, david, debug, dethoma, eranian, esyr,
	fweimer, gorcunov, hjl.tools, hpa, jamorris, jannh, john.allen,
	kcc, keescook, kirill.shutemov, linux-api, linux-arch, linux-doc,
	linux-kernel, linux-mm, luto, mike.kravetz, mingo, nadav.amit,
	oleg, pavel, pengfei.xu, peterz, rdunlap, szabolcs.nagy, tglx,
	torvalds, weijiang.yang, willy, x86, yu-cheng.yu

Hi Dave, Rick,

It seems it didn't get into the current tip.

On Thu, Jul 06, 2023 at 04:32:48PM -0700, Rick Edgecombe wrote:
> The comment around VM_SHADOW_STACK in mm.h refers to a lot of x86
> specific details that don't belong in a cross arch file. Remove these
> out of core mm, and just leave the non-arch details.
> 
> Since the comment includes some useful details that would be good to
> retain in the source somewhere, put the arch specifics parts in
> arch/x86/shstk.c near alloc_shstk(), where memory of this type is
> allocated. Include a reference to the existence of the x86 details near
> the VM_SHADOW_STACK definition mm.h.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kernel/shstk.c | 25 +++++++++++++++++++++++++
>  include/linux/mm.h      | 32 ++++++--------------------------
>  2 files changed, 31 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index b26810c7cd1c..47f5204b0fa9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -72,6 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
>  	return 0;
>  }
>  
> +/*
> + * VM_SHADOW_STACK will have a guard page. This helps userspace protect
> + * itself from attacks. The reasoning is as follows:
> + *
> + * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
> + * INCSSP instruction can increment the shadow stack pointer. It is the
> + * shadow stack analog of an instruction like:
> + *
> + *   addq $0x80, %rsp
> + *
> + * However, there is one important difference between an ADD on %rsp
> + * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
> + * memory of the first and last elements that were "popped". It can be
> + * thought of as acting like this:
> + *
> + * READ_ONCE(ssp);       // read+discard top element on stack
> + * ssp += nr_to_pop * 8; // move the shadow stack
> + * READ_ONCE(ssp-8);     // read+discard last popped stack element
> + *
> + * The maximum distance INCSSP can move the SSP is 2040 bytes, before
> + * it would read the memory. Therefore a single page gap will be enough
> + * to prevent any operation from shifting the SSP to an adjacent stack,
> + * since it would have to land in the gap at least once, causing a
> + * fault.
> + */
>  static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
>  				 unsigned long token_offset, bool set_res_tok)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 535c58d3b2e4..b647cf2e94ea 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -343,33 +343,13 @@ extern unsigned int kobjsize(const void *objp);
>  
>  #ifdef CONFIG_X86_USER_SHADOW_STACK
>  /*
> - * This flag should not be set with VM_SHARED because of lack of support
> - * core mm. It will also get a guard page. This helps userspace protect
> - * itself from attacks. The reasoning is as follows:
> + * VM_SHADOW_STACK should not be set with VM_SHARED because of lack of
> + * support core mm.
>   *
> - * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
> - * INCSSP instruction can increment the shadow stack pointer. It is the
> - * shadow stack analog of an instruction like:
> - *
> - *   addq $0x80, %rsp
> - *
> - * However, there is one important difference between an ADD on %rsp
> - * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
> - * memory of the first and last elements that were "popped". It can be
> - * thought of as acting like this:
> - *
> - * READ_ONCE(ssp);       // read+discard top element on stack
> - * ssp += nr_to_pop * 8; // move the shadow stack
> - * READ_ONCE(ssp-8);     // read+discard last popped stack element
> - *
> - * The maximum distance INCSSP can move the SSP is 2040 bytes, before
> - * it would read the memory. Therefore a single page gap will be enough
> - * to prevent any operation from shifting the SSP to an adjacent stack,
> - * since it would have to land in the gap at least once, causing a
> - * fault.
> - *
> - * Prevent using INCSSP to move the SSP between shadow stacks by
> - * having a PAGE_SIZE guard gap.
> + * These VMAs will get a single end guard page. This helps userspace protect
> + * itself from attacks. A single page is enough for current shadow stack archs
> + * (x86). See the comments near alloc_shstk() in arch/x86/kernel/shstk.c
> + * for more details on the guard size.
>   */
>  # define VM_SHADOW_STACK	VM_HIGH_ARCH_5
>  #else
> -- 
> 2.34.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 151+ messages in thread

end of thread, other threads:[~2023-08-01 16:53 UTC | newest]

Thread overview: 151+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-13  0:10 [PATCH v9 00/42] Shadow stacks for userspace Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 01/42] mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() Rick Edgecombe
2023-06-13  7:19   ` Geert Uytterhoeven
2023-06-13 16:14     ` Edgecombe, Rick P
2023-06-13  7:43   ` Mike Rapoport
2023-06-13 16:14     ` Edgecombe, Rick P
2023-06-13 12:26   ` David Hildenbrand
2023-06-13 16:14     ` Edgecombe, Rick P
2023-06-19  4:27   ` Helge Deller
2023-07-14 22:57   ` Mark Brown
2023-07-17 15:55     ` Edgecombe, Rick P
2023-07-17 16:51       ` Mark Brown
2023-06-13  0:10 ` [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma() Rick Edgecombe
2023-06-13  7:44   ` Mike Rapoport
2023-06-13 16:19     ` Edgecombe, Rick P
2023-06-13 17:00       ` David Hildenbrand
2023-06-14 17:00         ` Edgecombe, Rick P
2023-06-13 12:27   ` David Hildenbrand
2023-06-13 16:20     ` Edgecombe, Rick P
2023-06-13  0:10 ` [PATCH v9 03/42] mm: Make pte_mkwrite() take a VMA Rick Edgecombe
2023-06-13  7:42   ` Mike Rapoport
2023-06-13 16:20     ` Edgecombe, Rick P
2023-06-13 12:28   ` David Hildenbrand
2023-06-13 16:21     ` Edgecombe, Rick P
2023-06-13  0:10 ` [PATCH v9 04/42] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
2023-06-14  8:49   ` David Hildenbrand
2023-06-14 23:30   ` Mark Brown
2023-06-13  0:10 ` [PATCH v9 05/42] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
2023-06-14  8:50   ` David Hildenbrand
2023-06-13  0:10 ` [PATCH v9 06/42] x86/shstk: Add Kconfig option for shadow stack Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 07/42] x86/traps: Move control protection handler to separate file Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 08/42] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 09/42] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 10/42] x86/mm: Introduce _PAGE_SAVED_DIRTY Rick Edgecombe
2023-06-13 16:01   ` Edgecombe, Rick P
2023-06-13 17:58   ` Linus Torvalds
2023-06-13 19:37     ` Edgecombe, Rick P
2023-06-13  0:10 ` [PATCH v9 11/42] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY Rick Edgecombe
2023-06-13 18:01   ` Linus Torvalds
2023-06-13  0:10 ` [PATCH v9 12/42] x86/mm: Start actually marking _PAGE_SAVED_DIRTY Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 13/42] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 14/42] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
2023-06-14  8:50   ` David Hildenbrand
2023-06-14 23:31   ` Mark Brown
2023-06-13  0:10 ` [PATCH v9 15/42] x86/mm: Check shadow stack page fault errors Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 16/42] mm: Add guard pages around a shadow stack Rick Edgecombe
2023-06-14 23:34   ` Mark Brown
2023-06-22 18:21   ` Matthew Wilcox
2023-06-22 18:27     ` Edgecombe, Rick P
2023-06-23  7:40       ` Mike Rapoport
2023-06-23 12:17         ` Mark Brown
2023-06-25 16:44           ` Edgecombe, Rick P
2023-06-26 12:45             ` Mark Brown
2023-07-06 23:32               ` [PATCH] x86/shstk: Move arch detail comment out of core mm Rick Edgecombe
2023-07-07 15:08                 ` Mark Brown
2023-08-01 16:52                 ` Mike Rapoport
2023-06-13  0:10 ` [PATCH v9 17/42] mm: Warn on shadow stack memory in wrong vma Rick Edgecombe
2023-06-14 23:35   ` Mark Brown
2023-06-13  0:10 ` [PATCH v9 18/42] x86/mm: Warn if create Write=0,Dirty=1 with raw prot Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 19/42] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 20/42] x86/mm: Introduce MAP_ABOVE4G Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 21/42] x86/mm: Teach pte_mkwrite() about stack memory Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 22/42] mm: Don't allow write GUPs to shadow " Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description Rick Edgecombe
2023-06-13 11:55   ` Mark Brown
2023-06-13 12:37     ` Florian Weimer
2023-06-13 15:15       ` Mark Brown
2023-06-13 17:11         ` Edgecombe, Rick P
2023-06-13 17:57           ` Mark Brown
2023-06-13 19:57             ` Edgecombe, Rick P
2023-06-14 10:43               ` szabolcs.nagy
2023-06-14 16:57                 ` Edgecombe, Rick P
2023-06-19  8:47                   ` szabolcs.nagy
2023-06-19 16:44                     ` Edgecombe, Rick P
2023-06-20  9:17                       ` szabolcs.nagy
2023-06-20 19:34                         ` Edgecombe, Rick P
2023-06-21 11:36                           ` szabolcs.nagy
2023-06-21 18:54                             ` Edgecombe, Rick P
2023-06-21 22:22                               ` Edgecombe, Rick P
2023-06-21 23:05                                 ` H.J. Lu
2023-06-21 23:15                                   ` Edgecombe, Rick P
2023-06-22  1:07                                     ` Edgecombe, Rick P
2023-06-22  3:23                                       ` H.J. Lu
2023-06-22  8:27                                 ` szabolcs.nagy
2023-06-22 16:47                                   ` Edgecombe, Rick P
2023-06-23 16:25                                     ` szabolcs.nagy
2023-06-25 18:48                                       ` Edgecombe, Rick P
2023-06-21 23:02                               ` H.J. Lu
2023-06-22  7:40                                 ` szabolcs.nagy
2023-06-22 16:46                                   ` Edgecombe, Rick P
2023-06-26 14:08                                     ` szabolcs.nagy
2023-06-28  1:23                                       ` Edgecombe, Rick P
2023-06-22  9:18                               ` szabolcs.nagy
2023-06-22 15:26                                 ` Andy Lutomirski
2023-06-22 16:42                                   ` szabolcs.nagy
2023-06-22 23:18                                     ` Edgecombe, Rick P
2023-06-29 16:07                                       ` szabolcs.nagy
2023-07-02 18:03                                         ` Edgecombe, Rick P
2023-07-03 13:32                                           ` Mark Brown
2023-07-03 18:19                                           ` szabolcs.nagy
2023-07-03 18:38                                             ` Mark Brown
2023-07-03 18:49                                             ` Florian Weimer
2023-07-04 11:33                                               ` Szabolcs Nagy
2023-07-05 18:45                                             ` Edgecombe, Rick P
2023-07-05 19:10                                               ` Mark Brown
2023-07-05 19:17                                                 ` Edgecombe, Rick P
2023-07-05 19:29                                                   ` Mark Brown
2023-07-06 13:14                                                     ` szabolcs.nagy
2023-07-06 14:24                                                       ` Mark Brown
2023-07-06 16:59                                                         ` Edgecombe, Rick P
2023-07-06 19:03                                                           ` Mark Brown
2023-07-06 13:07                                               ` szabolcs.nagy
2023-07-06 18:25                                                 ` Edgecombe, Rick P
2023-07-07 15:25                                                   ` szabolcs.nagy
2023-07-07 17:37                                                     ` Edgecombe, Rick P
2023-07-10 16:54                                                       ` szabolcs.nagy
2023-07-10 22:56                                                         ` Edgecombe, Rick P
2023-07-11  8:08                                                           ` szabolcs.nagy
2023-07-12  9:39                                                             ` Szabolcs Nagy
2023-06-25 23:52                                     ` Andy Lutomirski
2023-06-14 13:12               ` Mark Brown
2023-07-18 19:32   ` Szabolcs Nagy
2023-06-13  0:10 ` [PATCH v9 24/42] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 25/42] x86/fpu: Add helper for modifying xstate Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 26/42] x86: Introduce userspace API for shadow stack Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 27/42] x86/shstk: Add user control-protection fault handler Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 28/42] x86/shstk: Add user-mode shadow stack support Rick Edgecombe
2023-06-27 17:20   ` Mark Brown
2023-06-27 23:46     ` Dave Hansen
2023-06-28  0:37       ` Edgecombe, Rick P
2023-07-06 23:38         ` [PATCH] x86/shstk: Don't retry vm_munmap() on -EINTR Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 29/42] x86/shstk: Handle thread shadow stack Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 30/42] x86/shstk: Introduce routines modifying shstk Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 31/42] x86/shstk: Handle signals for shadow stack Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 32/42] x86/shstk: Check that SSP is aligned on sigreturn Rick Edgecombe
2023-06-13  0:10 ` [PATCH v9 33/42] x86/shstk: Check that signal frame is shadow stack mem Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 34/42] x86/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 35/42] x86/shstk: Support WRSS for userspace Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 36/42] x86: Expose thread features in /proc/$PID/status Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 37/42] x86/shstk: Wire in shadow stack interface Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 38/42] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 39/42] selftests/x86: Add shadow stack test Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 40/42] x86: Add PTRACE interface for shadow stack Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 41/42] x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe
2023-06-13  0:11 ` [PATCH v9 42/42] x86/shstk: Add ARCH_SHSTK_STATUS Rick Edgecombe
2023-06-13  1:34 ` [PATCH v9 00/42] Shadow stacks for userspace Linus Torvalds
2023-06-13  3:12   ` Edgecombe, Rick P
2023-06-13 17:44     ` Linus Torvalds
2023-06-13 18:27       ` Linus Torvalds
2023-06-13 19:38         ` Edgecombe, Rick P
2023-06-14 23:45 ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).