All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/39] Shadowstacks for userspace
@ 2022-09-29 22:28 Rick Edgecombe
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
                   ` (39 more replies)
  0 siblings, 40 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:28 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

Hi,

This is an overdue followup to the “Shadow stacks for userspace” CET series. 
Thanks for all the comments on the first version [0]. They drove a decent 
amount of changes for v2. Since it has been awhile, I’ll try to summarize the 
areas that got major changes since last time. Smaller changes are listed in 
each patch.

The coverletter is organized into the following sections:
1. Shadow Stack Memory Solution
2. FPU API
3. Alt Shadow Stacks
4. Compatibility of Existing Binaries/Enabling Interface
5. CRIU Support
6. Bigger Selftest

Last time, two bigger pieces of new functionality were requested (Alt shadow
stack and CRIU support). Alt shadow stack support was requested, not because
there was an immediate need, but more because of the risk of signal ABI
decisions made now, creating implementation problems if alt shadow stacks were
added later.

A POC for alt shadow stacks may be enough to gauge this risk. CRIU support may 
also not be something critical for day one, if glibc disables all existing 
binaries as described in section 4. So I marked the patches at the end that 
support those two things as RFC/OPTIONAL. The earlier patches will support a 
smaller, basic initial implementation. So I’m wondering if we could consider
just enabling the basics upstream first, assuming the RFC pieces here look
passable.

1. Shadow Stack Memory Solution
===============================
Dave had a lot of questions and feedback about how shadow stack memory is 
handled, including why shadow stack VMAs were not VM_WRITE. These questions 
prompted a revisit of the design, and in the end shadow stack’s were switched 
to be VM_WRITE. I’ve tried to summarize how shadow stack memory is supposed to 
work, with some examples of how MM features interact with shadow stack memory.

Shadow Stack Memory Summary
---------------------------
Integrating shadow stack memory into the kernel has two main challenges. One, 
Write=0,Dirty=1 PTEs are already created by the kernel, and now they can’t be 
or they will inadvertently create shadow stack memory.

And, two, shadow stack memory fits strangely into the existing concepts of 
Copy-On-Write and “writable” memory. It is *sort of* writable, in that it can 
be changed by userspace, but sort of not in that it has Write=0 and can’t be 
written by normal mov-type accesses. So we still have the “writable” memory we 
always had, but now we also have another type of memory that is changeable from 
userspace. Another weird aspect is that memory has to be shadow stack, in order 
to serve a “shadow stack read”, so a shadow stack read also needs to cause 
something like a Copy-On-Write, as the result will be changeable from 
userspace.


Dealing with the new meaning of Dirty and Write bits
----------------------------------------------------
The first issue is solved with creating PAGE_COW using a software PTE bit. This 
is hidden inside the pgtable.h helpers, such that it *mostly* (more on this 
later) happens without changing core code. Basically in pte_wrprotect() will 
clear Dirty and set Cow=1, if the pte was dirty. In pte_mkdirty(), it set’s COW 
if the PTE was Write=0. Then pte_dirty() returns true for Dirty=1 or Cow=1. 
Since this requires a little extra work, this behavior is compiled out when 
shadow stack support is not enabled for the kernel.


Dealing with a new type of writable memory
------------------------------------------
The other side of the problem - dealing with the concept-splitting new type of 
userspace changeable memory - leaves a bit more loose ends. Probably the most 
important thing is that we don’t want the kernel thinking that shadow stack 
memory is protected from changes from userspace. But we also don’t want the 
kernel to treat it like normal writable memory in some ways either, for example 
to get confused and inadvertently make it writable in the normal (PTE Write=1) 
sense.

The solution here is to treat shadow stack memory as a special class of
writable memory by updating places where memory is made writable to be aware of
it, and treat all shadow stack accesses as if they are writes.

Shadow stack accesses are always treated as write faults because even shadow 
stack reads need to be made (shadow stack) writable in order to service them. 
Logic creating PTE’s then decides whether to create shadow stack or normal 
writable memory by the VMA type. Most of this is encapsulated in 
maybe_mkwrite() but some differentiation needs to be open coded where 
pte_mkwrite() is called directly.

Shadow stack VMA’s are a special type of writable and so they are created as 
VM_WRITE | VM_SHADOW_STACK. The benefit of making them also VM_WRITE is that 
there is some existing logic around using VM_WRITE to make decisions in the 
kernel that apply to shadow stack memory as well.
        - Scheduling code decides whether to migrate a VMA depending on	whether
          it’s VM_WRITE. The same reasoning should apply for shadow stack
          memory.
        - While there is no current interface for mmap()ing files as shadow
          stack, various drivers enforce non-writable mappings by checking
          !VM_WRITE and clearing VM_MAYWRITE. Because there is no longer a way
          to mmap() something arbitrarily as shadow stack, this can’t be hit.
          But this un-hittable wrong logic makes the design confusing and
          brittle.

The downside of having shadow stack memory have VM_WRITE is that any logic that 
assumes VM_WRITE means normally writable, for example open coded like:
if (flags & VM_WRITE)
	pte_mkwrite()
...will no longer be correct. It will need to be changed to have additional 
logic that knows about shadow stack. It turns out there are not too many of 
these cases and so this series just adds the logic.

This solution for this second issue also tweaks the behavior of pte_write() and 
pte_dirty(). pte_write() check’s whether a pte is writable or not, previously 
this was only the case when Write=1, but now pte_write() also returns true for 
shadow stack memory.

There are some additional areas that are probably worth commenting on:

        COW
        ---
        When a shadow stack page is shared as part of COW, it becomes read-only,
        just like normally writable memory would be. As part of the Dirty bit
        solution described above, pte_wrprotect() will move Dirty=1 to COW=1.
        This will leave the PTE in a read-only state automatically. Then when
        it takes a shadow stack access, it will perform COW, copying the page
        and making it writable. Logic added as part of the shadow stack memory
        solution will detect that the VMA is shadow stack and make the PTE a
        shadow stack PTE.

        mprotect()/VM_WRITE
        -------------------
        Shadow stack memory doesn’t have a PROT flag. It is created either
        internally in the kernel or via a special syscall. When it is created
        this way, the VMA gets VM_WRITE|VM_SHADOW_STACK. However, some
        functionality of the kernel will remove VM_WRITE, for example
        mprotect(). When this happens the memory is expected to be read only. So
        without any intervention, there may be a VMA that is VM_SHADOW_STACK and
        not VM_WRITE. We could try to prevent this from happening, (for example
        block mprotect() from operating on shadow stack memory), however some
        things like userfaulfd call mprotect internally and depend on it to
        work.

        So mprotect()ing shadow stack memory can make it read-only (non-shadow
        stack). It can then become shadow stack again by mprotect()ing it with
        PROT_WRITE. It always keeps the VM_SHADOW_STACK, so that it can never
        become normally writable memory.

        GUP
        ---
        Shadow stack memory is generally treated as writable by the kernel, but
        it behaves differently then other writable memory with respect to GUP.
        FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
        set. Shadow stack memory is writable from the perspective of being
        changeable by userspace, but it is also protected memory from
        userspace’s perspective. So preventing it from being writable via
        FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
        it. However, like read-only memory, FOLL_FORCE can still write through
        it. This means shadow stacks can be written to via things like
        “/proc/self/mem”. Apps that want extra security will have to prevent
        access to kernel features that can write with FOLL_FORCE.

2. FPU API
==========
The last version of this had an interface for modifying the FPU state in either 
the buffer or the registers to try to minimize saves and restores. Shortly 
after that, Thomas experimented with a different fpu optimization that was 
incompatible with how the interface kept state in the caller. So it doesn't
seem like a robust interface and for this version the optimization piece of the
API is dropped in this series, and the force restore technique is used again.

3. Alt Shadow Stacks
====================
Andy Lutomirski asked about alt shadow stack support. The following describes 
the design of shadow stack support for signals and alt shadow stacks.

Signal handling and shadow stacks
---------------------------------
Signals push information about the execution context to the stack that will 
handle the signal. The data pushed is use to restore registers and other state 
after the signal. In the case of handling the signal on a normal stack, the 
stack just needs to be unwound over the stack frame, but in the case of alt 
stacks, the saved stack pointer is important for the sigreturn to find it’s way 
back to the thread stack. With shadow stack there is a new type of stack 
pointer, the shadow stack pointer (SSP), that needs to be restored. Just like 
the regular stack pointer, it needs to be saved somewhere in order to implement 
shadow alt stacks. Beyond supporting basic functionality, it would be nice if 
shadow stack’s could make sigreturn oriented programming (SROP) attacks harder.

Alt stacks
----------
The automatically-created thread shadow stacks are sized such that shadow stack 
overflows should not normally be expected. However, especially since userspace 
can create and pivot to arbitrarily sized shadow stacks and we now optionally 
have WRSS, overflows are not impossible. To cover the case of shadow stack 
overflow, user’s may want to handle a signal on an alternate shadow stack.

Normal signal alt stacks had problems with using swapcontext() in the signal 
handler. Apps couldn’t do it safely, because a subsequent signal would 
overwrite the previous signal’s stack. The kernel would see the current stack 
pointer was not on the shadow stack (since it swapcontext()ed off of it), so 
would restart the signal from the end of the alt stack, clobbering the previous 
signal. The solution was to create a new flag that would change the signal 
behavior to disable alt stack switching while on the alt stack. Then new 
signals would be pushed onto the alt stack. On sigreturn, when the sigframe for 
the first signal that switched to the alt stack is encountered, the alt signal 
stack would be re-enabled. Then subsequent signals would start at the end of 
the alt stack again.

For regular alt stacks, this swapcontext() capable behavior is enabled by 
having the kernel clear its copy of the alt signal stack address and length 
after this data is saved to the sigframe. So when the first sigframe on the alt 
stack is sigreturn-ed, the alt stack is automatically restored.

In order to support swapcontext() on alt shadow stacks, we can have something 
similar where we push the SSP, alt shadow stack base and length to some kind of 
shadow stack sigframe. This leaves the question of where to push this data.

SROP
----
Similar to normal returns, sigreturn’s can be security sensitive. One exploit 
technique (SROP) is to call sigreturn directly with the stack pointer at a 
forged sigframe. So this involves being somewhere else on the stack, than a 
real kernel placed sigframe. These attacks can be made harder by placing 
something on the protected shadow stack to signify that a specific location on 
the shadow stack corresponds to where sigreturn is supposed to be called. The 
kernel can check for this token during sigreturn, and then sigreturn can’t be 
called at arbitrary places on the stack.

Shadow stack signal format
--------------------------
So to handle alt shadow stacks we need to push some data onto a stack. To 
prevent SROP we need to push something to the shadow stack that the kernel can 
know it must have placed there itself. To support both we can push a special 
shadow stack sigframe to the shadow stack that contains the necessary alt stack 
restore data, in a format that couldn't possibly occur naturally. To be extra 
careful, this data should be written such that it can't be used as a regular 
shadow stack return address or a shadow stack tokens. To make sure it can’t be 
used, data is pushed with the high bit (bit 63) set. This bit is a linear 
address bit in both the token format and a normal return address, so it should 
not conflict with anything. It puts any return address in the kernel half of 
the address space, so would never be created naturally by a userspace program. 
It will not be a valid restore token either, as the kernel address will never 
be pointing to the previous frame in the shadow stack.

When a signal hits, the format pushed to the stack that is handling the signal 
is four 8 byte values (since we are 64 bit only):
|1...old SSP|1...alt stack size|1...alt stack base|0|

The zero (without high bit set) at the end is pushed to act as a guard frame. 
An attacker cannot restore from a point where the frame processed would span 
two shadow stack sigframes because the kernel would detect the missing high 
bit.

setjmp()/longjmp()
------------------
In past designs for userspace shadow stacks, shadow alt stacks were not 
supported. Since there was only one shadow stack, longjmp() could jump out of a 
signal by using incssp to unwind the SSP to the place where the setjmp() was 
called. In order to support longjmp() off of an alt shadow stack, a restore 
token could be pushed to the original stack before switching to the alt stack. 
Userspace could search the alt stack for the alt stack sigframe to find the 
restore token, then restore back to it and continue unwinding. However, the 
main point of alt shadow stacks is to handle shadow stack overflows. So 
requiring there be space to push a token would prevent the feature from being 
used for it’s main purpose. So in this design nothing is pushed to the old 
stack.

Since shadow alt stacks are a new feature, longjmp()ing from an alt shadow stack 
will simply not be supported. If a libc want’s to support this it will need to 
enable WRSS and write it’s own restore token. This could likely even let it 
jump straight back to the setjmp() point and skip the whole incssp piece. It 
could even work for longjmp() after a swapcontext(). So this kernel design 
makes longjmp() support a security/compatibility tradeoff that the kernel is 
not entirely in charge of making.

sigaltshstk() syscall
---------------------
The sigaltstack() syscall works pretty well and is familiar interface, so 
sigaltshstk() is just a copy. It uses the same stack_t struct for transferring 
the shadow stack point, size and flags. For the flags however, it will not 
honor the meaning of the existing flags. Future flags may not have sensible 
meanings for shadow stack, so sigaltshstk() will start from scratch for flag 
meanings. As long as we are making new flag meanings, we can make SS_AUTODISARM 
the default behavior for sigaltshstk(), and not require a flag. Today the only 
flag supported is SS_DISABLE, and a !SS_AUTODISARM mode is not supported.

sigaltshstk() is separate from sigaltstack(). You can have one without the 
other, neither or both together. Because the shadow stack specific state is 
pushed to the shadow stack, the two features don’t need to know about each 
other.

Preventing use as an arbitrary “set SSP”
----------------------------------------
So now when a signal hits it will jump to the location specified in 
sigaltshstk(). Currently (without WRSS), userspace doesn’t have the ability to 
arbitrarily set the SSP. But telling the kernel to set the SSP to an arbitrary 
point on signal is kind of like that. So there would be a weakening of the 
shadow stack protections unless additional checks are made. With the 
SS_AUTODISARM-style behavior, the SSP will only jump to the shadow stack if the 
SSP is not already on the shadow stack, otherwise it will just push the SSP. So 
we really only need to worry about the transition to the start of the alt 
shadow stack. So the kernel checks for a token whenever transitioning to the 
alt stack from a place other than the alt stack. This token can be placed when 
doing the allocation using the existing map_shadow_stack syscall.

RFC
---
Lastly, Andy Lutomirski raised the issue of alt shadow stacks (I think) out of 
concern that we might settle on an ABI that wouldn’t support them if there was 
later demand. The ABI of the sigreturn token was actually changed to support alt
shadow stacks here. So if this whole series feels like a lot of code, I wanted
to toss out the option of settling on how we could do alt shadow stacks
someday, but then leave the implementation until later.


4. Compatibility of Existing Binaries/Enabling Interface
========================================================
The last version of this dealt with the problem of old glib’s breaking against 
future upstream shadow stack enabled kernels. Unfortunately, more userspace 
issues have been found. In anticipation of kernel support, some distro’s have 
been apparently force compiling applications with shadow stack support. Of 
course compiling with shadow stack really mostly means marking the elf header 
bit as “this binary supports shadow stack”. And having this bit doesn’t
necessarily mean that the binary actually supports shadow stack. In the case of
JITing or other custom stack switching programs, it often doesn’t. I have come
across at least one popular distro package that completely fails to even start
up, so there are likely more issues hidden in less common code paths. None of
these apps will break until glibc is updated to use the new kernel API for
enabling shadow stack. They will simply not run with shadow stack.

Waiting until glibc updates to break packages might not technically be a kernel 
regression, but it’s not good either. With the current kernel API, the decision
of which binaries to enable shadow stack is left to userspace. So to prevent
breakages my plan is to engage the glibc community to detect and not enable CET
for these old binaries as part of the upstream of glibc CET support that will
work with the new kernel interface. Then only enable CET on future more
carefully compiled binaries. This will also lessen the impact of old CRIU’s
(pre-Mike’s changes) failing to save shadow stack enabled programs, as most
existing binaries wouldn't all turn on with CET at once.

5. CRIU Support
===============
Big thanks to Mike Rapoport for a POC [1] that fixes CRIU to work with 
processes that enable shadow stacks. The general design is to allow CET 
features to be unlocked via ptrace only, then WRSS can be used to manipulate 
the shadow stack to allow CRIU’s sigreturn-oriented operation to continue to 
work. He needed a few tweaks to the kernel in order for CRIU to do this, 
including the general CET ptrace support that was missing in recent postings of 
CET. So this is added back in, as well as his new UNLOCK ptrace-only 
arch_prctl(). With the new plan of not trying to enable shadow stack for most 
apps all at once, I wonder if this functionality might also be a good candidate 
for a fast follow up. Note, this CRIU POC will need to be updated to target the 
final signal shadow stack format.

6. Bigger Selftest
==================
A new selftest that exercises the shadow stack kernel features without any 
special glibc requirements. It manually enables shadow stack with the 
arch_prctl() and exercises shadow stack arch_prctl(), shadow stack MM, 
userfaultfd, signal, and the 2 new syscalls.

[0] https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/YpYDKVjMEYVlV6Ya@kernel.org/


Kirill A. Shutemov (2):
  x86: Introduce userspace API for CET enabling
  x86: Expose thread features status in /proc/$PID/arch_status

Mike Rapoport (1):
  x86/cet/shstk: Add ARCH_CET_UNLOCK

Rick Edgecombe (11):
  x86/fpu: Add helper for modifying xstate
  mm: Don't allow write GUPs to shadow stack memory
  x86/cet/shstk: Introduce map_shadow_stack syscall
  x86/cet/shstk: Support wrss for userspace
  x86/cet/shstk: Wire in CET interface
  selftests/x86: Add shadow stack test
  x86/cpufeatures: Limit shadow stack to Intel CPUs
  x86: Separate out x86_regset for 32 and 64 bit
  x86: Improve formatting of user_regset arrays
  x86/fpu: Add helper for initing features
  x86: Add alt shadow stack support

Yu-cheng Yu (25):
  Documentation/x86: Add CET description
  x86/cet/shstk: Add Kconfig option for Shadow Stack
  x86/cpufeatures: Add CPU feature flags for shadow stacks
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/cet: Add user control-protection fault handler
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Introduce _PAGE_COW
  x86/mm: Update pte_modify for _PAGE_COW
  x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
    transition from _PAGE_DIRTY to _PAGE_COW
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory
  x86/mm: Check Shadow Stack page fault errors
  x86/mm: Update maybe_mkwrite() for shadow stack
  mm: Fixup places that call pte_mkwrite() directly
  mm: Add guard pages around a shadow stack.
  mm/mmap: Add shadow stack pages to memory accounting
  mm/mprotect: Exclude shadow stack from preserve_write
  mm: Re-introduce vm_flags to do_mmap()
  x86/cet/shstk: Add user-mode shadow stack support
  x86/cet/shstk: Handle thread shadow stack
  x86/cet/shstk: Introduce routines modifying shstk
  x86/cet/shstk: Handle signals for shadow stack
  x86/cet: Add PTRACE interface for CET

 Documentation/filesystems/proc.rst            |   1 +
 Documentation/x86/cet.rst                     | 143 ++++
 Documentation/x86/index.rst                   |   1 +
 arch/arm/kernel/signal.c                      |   2 +-
 arch/arm64/kernel/signal.c                    |   2 +-
 arch/arm64/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal_64.c                 |   2 +-
 arch/x86/Kconfig                              |  18 +
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   2 +
 arch/x86/ia32/ia32_signal.c                   |   1 +
 arch/x86/include/asm/cet.h                    |  49 ++
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/fpu/api.h                |   6 +
 arch/x86/include/asm/fpu/regset.h             |   7 +-
 arch/x86/include/asm/fpu/sched.h              |   3 +-
 arch/x86/include/asm/fpu/types.h              |  14 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   2 +-
 arch/x86/include/asm/mmu_context.h            |   2 +
 arch/x86/include/asm/msr-index.h              |   5 +
 arch/x86/include/asm/msr.h                    |  11 +
 arch/x86/include/asm/pgtable.h                | 314 ++++++++-
 arch/x86/include/asm/pgtable_types.h          |  48 +-
 arch/x86/include/asm/processor.h              |  11 +
 arch/x86/include/asm/special_insns.h          |  13 +
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/uapi/asm/mman.h              |   2 +
 arch/x86/include/uapi/asm/prctl.h             |  10 +
 arch/x86/kernel/Makefile                      |   4 +
 arch/x86/kernel/cpu/common.c                  |  30 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/fpu/core.c                    |  59 +-
 arch/x86/kernel/fpu/regset.c                  |  95 +++
 arch/x86/kernel/fpu/xstate.c                  | 198 +++---
 arch/x86/kernel/fpu/xstate.h                  |   6 +
 arch/x86/kernel/idt.c                         |   2 +-
 arch/x86/kernel/proc.c                        |  63 ++
 arch/x86/kernel/process.c                     |  24 +-
 arch/x86/kernel/process_64.c                  |   8 +-
 arch/x86/kernel/ptrace.c                      | 188 +++--
 arch/x86/kernel/shstk.c                       | 628 +++++++++++++++++
 arch/x86/kernel/signal.c                      |  10 +
 arch/x86/kernel/signal_compat.c               |   2 +-
 arch/x86/kernel/traps.c                       |  98 ++-
 arch/x86/mm/fault.c                           |  21 +
 arch/x86/mm/mmap.c                            |  25 +
 arch/x86/mm/pat/set_memory.c                  |   2 +-
 arch/x86/xen/enlighten_pv.c                   |   2 +-
 arch/x86/xen/xen-asm.S                        |   2 +-
 fs/aio.c                                      |   2 +-
 fs/proc/task_mmu.c                            |   3 +
 include/linux/mm.h                            |  38 +-
 include/linux/pgtable.h                       |  14 +
 include/linux/syscalls.h                      |   2 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/asm-generic/unistd.h             |   2 +-
 include/uapi/linux/elf.h                      |   1 +
 ipc/shm.c                                     |   2 +-
 kernel/sys_ni.c                               |   2 +
 mm/gup.c                                      |   2 +-
 mm/huge_memory.c                              |  16 +-
 mm/memory.c                                   |   3 +-
 mm/migrate_device.c                           |   3 +-
 mm/mmap.c                                     |  22 +-
 mm/mprotect.c                                 |   7 +
 mm/nommu.c                                    |   4 +-
 mm/userfaultfd.c                              |  10 +-
 mm/util.c                                     |   2 +-
 tools/testing/selftests/x86/Makefile          |   4 +-
 .../testing/selftests/x86/test_shadow_stack.c | 646 ++++++++++++++++++
 73 files changed, 2670 insertions(+), 281 deletions(-)
 create mode 100644 Documentation/x86/cet.rst
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/proc.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c


base-commit: f76349cf41451c5c42a99f18a9163377e4b364ff
-- 
2.17.1


^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
@ 2022-09-29 22:28 ` Rick Edgecombe
  2022-09-30  3:41   ` Bagas Sanjaya
                     ` (3 more replies)
  2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
                   ` (38 subsequent siblings)
  39 siblings, 4 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:28 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce a new document on Control-flow Enforcement Technology (CET).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Updated to new arch_prctl() API
 - Add bit about new proc status

v1:
 - Update and clarify the docs.
 - Moved kernel parameters documentation to other patch.

 Documentation/x86/cet.rst   | 140 ++++++++++++++++++++++++++++++++++++
 Documentation/x86/index.rst |   1 +
 2 files changed, 141 insertions(+)
 create mode 100644 Documentation/x86/cet.rst

diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
new file mode 100644
index 000000000000..4a0dfb6830f9
--- /dev/null
+++ b/Documentation/x86/cet.rst
@@ -0,0 +1,140 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+Overview
+========
+
+Control-flow Enforcement Technology (CET) is term referring to several
+related x86 processor features that provides protection against control
+flow hijacking attacks. The HW feature itself can be set up to protect
+both applications and the kernel. Only user-mode protection is implemented
+in the 64-bit kernel.
+
+CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
+a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. Indirect branch tracking verifies indirect
+CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
+opcodes. Not all CPU's have both Shadow Stack and Indirect Branch Tracking
+and only Shadow Stack is currently supported in the kernel.
+
+The Kconfig options is X86_SHADOW_STACK, and it can be disabled with
+the kernel parameter clearcpuid, like this: "clearcpuid=shstk".
+
+To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1
+or later are required. To build a CET-enabled application, GLIBC v2.28 or
+later is also required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF header and can be
+verified from readelf/llvm-readelf output:
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: SHSTK
+
+The kernel does not process these applications directly. Applications must
+enable them using the interface descriped in section 4. Typically this
+would be done in dynamic loader or static runtime objects, as is the case
+in glibc.
+
+Backward Compatibility
+======================
+
+GLIBC provides a few CET tunables via the GLIBC_TUNABLES environment
+variable:
+
+GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-WRSS
+    Turn off SHSTK/WRSS.
+
+GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
+    This controls how dlopen() handles SHSTK legacy libraries::
+
+        on         - continue with SHSTK enabled;
+        permissive - continue with SHSTK off.
+
+Details can be found in the GLIBC manual pages.
+
+CET arch_prctl()'s
+==================
+
+Elf features should be enabled by the loader using the below arch_prctl's.
+
+arch_prctl(ARCH_CET_ENABLE, unsigned int feature)
+    Enable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
+    Disable features specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_CET_LOCK, unsigned int features)
+    Lock in features at their current enabled or disabled status.
+
+The return values are as following:
+    On success, return 0. On error, errno can be::
+
+        -EPERM if any of the passed feature are locked.
+        -EOPNOTSUPP if the feature is not supported by the hardware or
+         disabled by kernel parameter.
+        -EINVAL arguments (non existing feature, etc)
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/arch_status. It will report "wrss" or
+"shstk" depending on what is enabled.
+
+The implementation of the Shadow Stack
+======================================
+
+Shadow Stack size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+By default, the main program and its signal handlers use the same shadow
+stack. Because the shadow stack stores only return addresses, a large
+shadow stack covers the condition that both the program stack and the
+signal alternate stack run out.
+
+The kernel creates a restore token for the shadow stack and pushes the
+restorer address to the shadow stack. Then verifies that token when
+restoring from the signal handler.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index c73d133fd37c..9ac03055c4b5 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -22,6 +22,7 @@ x86-specific Documentation
    mtrr
    pat
    intel-hfi
+   cet
    iommu
    intel_txt
    amd-memory-encryption
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
@ 2022-09-29 22:28 ` Rick Edgecombe
  2022-10-03 13:40   ` Kirill A . Shutemov
                     ` (3 more replies)
  2022-09-29 22:29 ` [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
                   ` (37 subsequent siblings)
  39 siblings, 4 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:28 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow Stack provides protection against function return address
corruption. It is active when the processor supports it, the kernel has
CONFIG_X86_SHADOW_STACK enabled, and the application is built for the
feature. This is only implemented for the 64-bit kernel. When it is
enabled, legacy non-Shadow Stack applications continue to work, but without
protection.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Remove already wrong kernel size increase info (tlgx)
 - Change prompt to remove "Intel" (tglx)
 - Update line about what CPUs are supported (Dave)

Yu-cheng v25:
 - Remove X86_CET and use X86_SHADOW_STACK directly.

Yu-cheng v24:
 - Update for the splitting X86_CET to X86_SHADOW_STACK and X86_IBT.

 arch/x86/Kconfig           | 18 ++++++++++++++++++
 arch/x86/Kconfig.assembler |  5 +++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..b68eb75887b8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86_64
 	depends on 64BIT
 	# Options that are inherently 64-bit kernel only:
 	select ARCH_HAS_GIGANTIC_PAGE
+	select ARCH_HAS_SHADOW_STACK
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select HAVE_ARCH_SOFT_DIRTY
@@ -1936,6 +1937,23 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config ARCH_HAS_SHADOW_STACK
+	def_bool n
+
+config X86_SHADOW_STACK
+	prompt "X86 Shadow Stack"
+	def_bool n
+	depends on ARCH_HAS_SHADOW_STACK
+	select ARCH_USES_HIGH_VMA_FLAGS
+	help
+	  Shadow Stack protection is a hardware feature that detects function
+	  return address corruption. Today the kernel's support is limited to
+	  virtualizing it in KVM guests.
+
+	  CPUs supporting shadow stacks were first released in 2020.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08e2fc4..00c79dd93651 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -19,3 +19,8 @@ config AS_TPAUSE
 	def_bool $(as-instr,tpause %ecx)
 	help
 	  Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
+
+config AS_WRUSS
+	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+	help
+	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
  2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 17:26   ` Kees Cook
  2022-10-14 16:20   ` Borislav Petkov
  2022-09-29 22:29 ` [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
                   ` (36 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Remove IBT reference in commit log (Kees)
 - Describe xsaves dependency using text from (Dave)

v1:
 - Remove IBT, can be added in a follow on IBT series.

Yu-cheng v25:
 - Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.

Yu-cheng v24:
 - Update for splitting CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK and
   CONFIG_X86_IBT.
 - Move DISABLE_IBT definition to the IBT series.

 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/kernel/cpu/cpuid-deps.c         | 1 +
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..d0b49da95c70 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -365,6 +365,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* Shadow Stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 33d2cd04d254..00fe41eee92d 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -87,6 +87,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_SHADOW_STACK
+#define DISABLE_SHSTK	0
+#else
+#define DISABLE_SHSTK	(1 << (X86_FEATURE_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -107,7 +113,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_SHSTK)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK19	0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..bf1b55a1ba21 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (2 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 17:31   ` Kees Cook
                     ` (2 more replies)
  2022-09-29 22:29 ` [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
                   ` (35 subsequent siblings)
  39 siblings, 3 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Utilizing CET features requires a CR4 bit to be enabled as well as bits
to be set in CET MSRs. Setting the CR4 bit does two things:
 1. Enables the usage of WRUSS instruction, which the kernel can use to
    write to userspace shadow stacks.
 2. Allows those individual aspects of CET to be enabled later via the MSR.
 3. Allows CET to be enabled in guests

While future patches will allow the MSR values to be saved and restored
per task, the CR4 bit will allow for WRUSS to be used regardless of if a
tasks CET MSRs have been restored.

Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
and is configured with kernel IBT. However future patches that enable
userspace shadow stack support will need the bit set as well. So change
the logic to enable it in either case.

Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - In the shadow stack case, go back to only setting CR4.CET if the
   kernel is compiled with user shadow stack support.
 - Clear MSR_IA32_U_CET as well. (PeterZ)

KVM refresh:
 - Set CR4.CET if SHSTK or IBT are supported by HW, so that KVM can
   support CET even if IBT is disabled.
 - Drop no_user_shstk (Dave Hansen)
 - Elaborate on what the CR4 bit does in the commit log
 - Integrate with Kernel IBT logic

v1:
 - Moved kernel-parameters.txt changes here from patch 1.

Yu-cheng v25:
 - Remove software-defined X86_FEATURE_CET.

 arch/x86/kernel/cpu/common.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..d7415bb556b2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -598,16 +598,21 @@ __noendbr void ibt_restore(u64 save)
 
 static __always_inline void setup_cet(struct cpuinfo_x86 *c)
 {
-	u64 msr = CET_ENDBR_EN;
+	bool kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
+	bool user_shstk = IS_ENABLED(CONFIG_X86_SHADOW_STACK) &&
+			  cpu_feature_enabled(X86_FEATURE_SHSTK);
+	u64 msr = 0;
 
-	if (!HAS_KERNEL_IBT ||
-	    !cpu_feature_enabled(X86_FEATURE_IBT))
+	if (!kernel_ibt && !user_shstk)
 		return;
 
+	if (kernel_ibt)
+		msr = CET_ENDBR_EN;
+
 	wrmsrl(MSR_IA32_S_CET, msr);
 	cr4_set_bits(X86_CR4_CET);
 
-	if (!ibt_selftest()) {
+	if (kernel_ibt && !ibt_selftest()) {
 		pr_err("IBT selftest: Failed!\n");
 		setup_clear_cpu_cap(X86_FEATURE_IBT);
 		return;
@@ -616,10 +621,15 @@ static __always_inline void setup_cet(struct cpuinfo_x86 *c)
 
 __noendbr void cet_disable(void)
 {
-	if (cpu_feature_enabled(X86_FEATURE_IBT))
-		wrmsrl(MSR_IA32_S_CET, 0);
+	if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
+	      cpu_feature_enabled(X86_FEATURE_SHSTK)))
+		return;
+
+	wrmsrl(MSR_IA32_S_CET, 0);
+	wrmsrl(MSR_IA32_U_CET, 0);
 }
 
+
 /*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (3 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 17:40   ` Kees Cook
  2022-10-15  9:46   ` Borislav Petkov
  2022-09-29 22:29 ` [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
                   ` (34 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stack register state can be managed with XSAVE. The registers
can logically be separated into two groups:
        * Registers controlling user-mode operation
        * Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for each group
of those groups of registers. This lets an OS manage them separately if
it chooses. Future patches for host userspace and KVM guests will only
utilize the user-mode registers, so only configure XSAVE to save
user-mode registers. This state will add 16 bytes to the xsave buffer
size.

Future patches will use the user-mode XSAVE area to save guest user-mode
CET state. However, VMCS includes new fields for guest CET supervisor
states. KVM can use these to save and restore guest supervisor state, so
host supervisor XSAVE support is not required.

Adding this exacerbates the already unwieldy if statement in
check_xstate_against_struct() that handles warning about un-implemented
xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
it actually check's the xfeature. This ends up exceeding 80 chars, but was
better on balance than other options explored. Pass the bool as pointer to
make it clear that XCHECK_SZ() can change the variable.

While configuring user-mode XSAVE, clarify kernel-mode registers are not
managed by XSAVE by defining the xfeature in
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
This serves more of a documentation as code purpose, and functionally,
only enables a few safety checks.

Both XSAVE state components are supervisor states, even the state
controlling user-mode operation. This is a departure from earlier features
like protection keys where the PKRU state a normal user (non-supervisor)
state. Having the user state be supervisor-managed ensures there is no
direct, unprivileged access to it, making it harder for an attacker to
subvert CET.

To facilitate this privileged access, define the two user-mode CET MSRs,
and the bits defined in those MSRs relevant to future shadow stack
enablement patches.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Change name to XFEATURE_CET_KERNEL_UNUSED (peterz)

KVM refresh:
 - Reword commit log using some verbiage posted by Dave Hansen
 - Remove unlikely to be used supervisor cet xsave struct
 - Clarify that supervisor cet state is not saved by xsave
 - Remove unused supervisor MSRs

v1:
 - Remove outdated reference to sigreturn checks on msr's.

Yu-cheng v29:
 - Move CET MSR definition up in msr-index.h.

 arch/x86/include/asm/fpu/types.h  | 14 ++++-
 arch/x86/include/asm/fpu/xstate.h |  6 +-
 arch/x86/kernel/fpu/xstate.c      | 93 ++++++++++++++++---------------
 3 files changed, 63 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb7cd1139d97..344baad02b97 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
 	XFEATURE_PASID,
-	XFEATURE_RSRVD_COMP_11,
-	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL_UNUSED,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL_UNUSED)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 #define XFEATURE_MASK_XTILE_CFG		(1 << XFEATURE_XTILE_CFG)
 #define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,14 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	u64 user_cet;			/* user control-flow settings */
+	u64 user_ssp;			/* user shadow stack pointer */
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
 #define XFEATURE_MASK_USER_DYNAMIC	XFEATURE_MASK_XTILE_DATA
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+					    XFEATURE_MASK_CET_USER)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+					      XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8340156bfd2..5e6a4867fd05 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,26 +39,26 @@
  */
 static const char *xfeature_names[] =
 {
-	"x87 floating point registers"	,
-	"SSE registers"			,
-	"AVX registers"			,
-	"MPX bounds registers"		,
-	"MPX CSR"			,
-	"AVX-512 opmask"		,
-	"AVX-512 Hi256"			,
-	"AVX-512 ZMM_Hi256"		,
-	"Processor Trace (unused)"	,
-	"Protection Keys User registers",
-	"PASID state",
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"unknown xstate feature"	,
-	"AMX Tile config"		,
-	"AMX Tile data"			,
-	"unknown xstate feature"	,
+	"x87 floating point registers"			,
+	"SSE registers"					,
+	"AVX registers"					,
+	"MPX bounds registers"				,
+	"MPX CSR"					,
+	"AVX-512 opmask"				,
+	"AVX-512 Hi256"					,
+	"AVX-512 ZMM_Hi256"				,
+	"Processor Trace (unused)"			,
+	"Protection Keys User registers"		,
+	"PASID state"					,
+	"Control-flow User registers"			,
+	"Control-flow Kernel registers (unused)"	,
+	"unknown xstate feature"			,
+	"unknown xstate feature"			,
+	"unknown xstate feature"			,
+	"unknown xstate feature"			,
+	"AMX Tile config"				,
+	"AMX Tile data"					,
+	"unknown xstate feature"			,
 };
 
 static unsigned short xsave_cpuid_features[] __initdata = {
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_PKU,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
+	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
 	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
 	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
@@ -276,6 +277,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
 	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
 	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
@@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
 	 XFEATURE_MASK_PASID |			\
+	 XFEATURE_MASK_CET_USER |		\
 	 XFEATURE_MASK_XTILE)
 
 /*
@@ -446,13 +449,14 @@ static void __init __xstate_dump_leaves(void)
 	}									\
 } while (0)
 
-#define XCHECK_SZ(sz, nr, nr_macro, __struct) do {			\
-	if ((nr == nr_macro) &&						\
-	    WARN_ONCE(sz != sizeof(__struct),				\
-		"%s: struct is %zu bytes, cpu state %d bytes\n",	\
-		__stringify(nr_macro), sizeof(__struct), sz)) {		\
-		__xstate_dump_leaves();					\
-	}								\
+#define XCHECK_SZ(checked, sz, nr, nr_macro, __struct) do {			\
+	if (nr == nr_macro) {							\
+		*checked = true;						\
+		if (WARN_ONCE(sz != sizeof(__struct),				\
+			      "%s: struct is %zu bytes, cpu state %d bytes\n",	\
+			      __stringify(nr_macro), sizeof(__struct), sz))	\
+			__xstate_dump_leaves();					\
+	}									\
 } while (0)
 
 /**
@@ -527,33 +531,30 @@ static bool __init check_xstate_against_struct(int nr)
 	 * Ask the CPU for the size of the state.
 	 */
 	int sz = xfeature_size(nr);
+	bool chked = false;
+
 	/*
 	 * Match each CPU state with the corresponding software
 	 * structure.
 	 */
-	XCHECK_SZ(sz, nr, XFEATURE_YMM,       struct ymmh_struct);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDREGS,   struct mpx_bndreg_state);
-	XCHECK_SZ(sz, nr, XFEATURE_BNDCSR,    struct mpx_bndcsr_state);
-	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
-	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
-	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
-	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
-	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_YMM,       struct ymmh_struct);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_BNDREGS,   struct mpx_bndreg_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_BNDCSR,    struct mpx_bndcsr_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_PKRU,      struct pkru_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+	XCHECK_SZ(&chked, sz, nr, XFEATURE_CET_USER,  struct cet_user_state);
 
 	/* The tile data size varies between implementations. */
-	if (nr == XFEATURE_XTILE_DATA)
+	if (nr == XFEATURE_XTILE_DATA) {
 		check_xtile_data_against_struct(sz);
+		chked = true;
+	}
 
-	/*
-	 * Make *SURE* to add any feature numbers in below if
-	 * there are "holes" in the xsave state component
-	 * numbers.
-	 */
-	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX) ||
-	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+	if (!chked) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 		return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (4 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 17:48   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
                   ` (33 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.

When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.

The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.

Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Drop optimization of writing directly the buffer, and change API
   accordingly.
 - fpregs_lock_and_load() suggested by tglx
 - Some commit log verbiage from dhansen

v1:
 - New patch.

 arch/x86/include/asm/fpu/api.h |  6 ++++++
 arch/x86/kernel/fpu/core.c     | 19 +++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..3a86ee18ae99 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,12 @@ static inline void fpregs_unlock(void)
 		preempt_enable();
 }
 
+/*
+ * Lock and load the fpu state into the registers, if they are not already
+ * loaded.
+ */
+void fpu_lock_and_load(void);
+
 #ifdef CONFIG_X86_DEBUG_FPU
 extern void fpregs_assert_state_consistent(void);
 #else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 3b28c5b25e12..778d3054ccc7 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -756,6 +756,25 @@ void switch_fpu_return(void)
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
 
+void fpu_lock_and_load(void)
+{
+	/*
+	 * fpregs_lock() only disables preemption (mostly). So modifing state
+	 * in an interrupt could screw up some in progress fpregs operation,
+	 * but appear to work. Warn about it.
+	 */
+	WARN_ON_ONCE(!irq_fpu_usable());
+	WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+	fpregs_lock();
+
+	fpregs_assert_state_consistent();
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		fpregs_restore_userregs();
+}
+EXPORT_SYMBOL_GPL(fpu_lock_and_load);
+
 #ifdef CONFIG_X86_DEBUG_FPU
 /*
  * If current FPU state according to its tracking (loaded FPU context on this
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (5 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 14:01   ` Kirill A . Shutemov
                     ` (4 more replies)
  2022-09-29 22:29 ` [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
                   ` (32 subsequent siblings)
  39 siblings, 5 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT. Refactor this fault handler into sparate user and kernel handlers,
like the page fault handler. Add a control-protection handler for usermode.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>

---

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.

v1:
 - Update static asserts for NSIGSEGV

Yu-cheng v29:
 - Remove pr_emerg() since it is followed by die().
 - Change boot_cpu_has() to cpu_feature_enabled().

Yu-cheng v25:
 - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
 - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/arm/kernel/signal.c           |  2 +-
 arch/arm64/kernel/signal.c         |  2 +-
 arch/arm64/kernel/signal32.c       |  2 +-
 arch/sparc/kernel/signal32.c       |  2 +-
 arch/sparc/kernel/signal_64.c      |  2 +-
 arch/x86/include/asm/idtentry.h    |  2 +-
 arch/x86/kernel/idt.c              |  2 +-
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 98 ++++++++++++++++++++++++++----
 arch/x86/xen/enlighten_pv.c        |  2 +-
 arch/x86/xen/xen-asm.S             |  2 +-
 include/uapi/asm-generic/siginfo.h |  3 +-
 12 files changed, 97 insertions(+), 24 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index ea128e32e8ca..fa47b8754624 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 9ad911f1647c..81b13a21046e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1166,7 +1166,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..6768c9d4468c 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..90cce3614ead 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 879ef8c72f5c..d441804443d5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d62b2cb85cea..b7dde8730236 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -211,12 +211,6 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -229,16 +223,74 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_control_protection_fault(struct pt_regs *regs,
+					     unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/* Read SSP before enabling interrupts. */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned int cpec;
+
+		cpec = error_code & CP_EC;
+		if (cpec >= ARRAY_SIZE(control_protection_err))
+			cpec = 0;
+
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpec],
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
-		return;
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#else
+static void do_user_control_protection_fault(struct pt_regs *regs,
+					     unsigned long error_code)
+{
+	WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
+}
+#endif
+
+#ifdef CONFIG_X86_KERNEL_IBT
+
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
 
+static void do_kernel_control_protection_fault(struct pt_regs *regs)
+{
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
 		return;
@@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
-
+#else
+static void do_kernel_control_protection_fault(struct pt_regs *regs)
+{
+	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");
+}
 #endif /* CONFIG_X86_KERNEL_IBT */
 
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
+	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pr_err("Unexpected #CP\n");
+		BUG();
+	}
+
+	if (user_mode(regs))
+		do_user_control_protection_fault(regs, error_code);
+	else
+		do_kernel_control_protection_fault(regs);
+}
+#endif /* defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK) */
+
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 0ed2e487a693..57faa287163f 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -628,7 +628,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 6b4fdf6b9542..e45ff6300c7d 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (6 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 14:17   ` Kirill A . Shutemov
  2022-10-05  1:31   ` Andrew Cooper
  2022-09-29 22:29 ` [PATCH v2 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
                   ` (31 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu, Christoph Hellwig

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Processors sometimes directly create Write=0,Dirty=1 PTEs. These PTEs are
created by software. One such case is that kernel read-only pages are
historically set up as Dirty.

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages. When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some
instructions can write to such supervisor memory. The kernel does not set
IA32_S_CET.SH_STK_EN, but to reduce ambiguity between shadow stack and
regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>

---

v2:
 - Normalize PTE bit descriptions between patches

 arch/x86/include/asm/pgtable_types.h | 6 +++---
 arch/x86/mm/pat/set_memory.c         | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index aa174fed3a71..ff82237e7b6b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@ enum page_cache_mode {
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1abd5438f126..ed9193b469ba 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1977,7 +1977,7 @@ int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
 }
 
 int set_memory_rw(unsigned long addr, int numpages)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 09/39] x86/mm: Move pmd_write(), pud_write() up in the file
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (7 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:06   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
                   ` (30 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

To prepare the introduction of _PAGE_COW, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below.  No functional changes.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 44e2d6f1dbaa..6496ec84b953 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -159,6 +159,18 @@ static inline int pte_write(pte_t pte)
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_RW;
+}
+
 static inline int pte_huge(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_PSE;
@@ -1102,12 +1114,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pmd_t *pmdp)
@@ -1137,12 +1143,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
-	return pud_flags(pud) & _PAGE_RW;
-}
-
 #ifndef pmdp_establish
 #define pmdp_establish pmdp_establish
 static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (8 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-30 15:16   ` Jann Horn
                     ` (5 more replies)
  2022-09-29 22:29 ` [PATCH v2 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
                   ` (29 subsequent siblings)
  39 siblings, 6 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux). That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0,Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set _before_ a
write. A write with a Write=0 PTE would typically only generate a fault,
not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
stacks will no longer exhibit this oddity.

The kernel should avoid inadvertently creating shadow stack memory because
it is security sensitive. So given the above, all it needs to do is avoid
manually crating Write=0,Dirty=1 PTEs in software.

In places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1. This
clearly separates shadow stack from other data, and results in the
following:

(a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
    Previously when a typical anonymous writable mapping was made COW via
    fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead
    use the Cow bit.
(b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
    is in a R/O VMA, and get_user_pages() needs a writable copy. The page
    fault handler creates a copy of the page and sets the new copy's PTE
    as Write=0 and Cow=1.
(c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
(d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack
    page is being shared among processes (this happens at fork()), its PTE
    is made Dirty=0, so the next shadow stack access causes a fault, and
    the page is duplicated and Dirty=1 is set again. This is the COW
    equivalent for shadow stack pages, even though it's copy-on-access
    rather than copy-on-write.
(e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without
    shadow stack support set Dirty=1.

Define _PAGE_COW and update pte_*() helpers and apply the same changes to
pmd and pud.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
because shadow stacks are not enabled there.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Update commit log with comments (Dave Hansen)
 - Add comments in code to explain pte modification code better (Dave)
 - Clarify info on the meaning of various Write,Cow,Dirty combinations

 arch/x86/include/asm/pgtable.h       | 210 ++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h |  42 +++++-
 2 files changed, 231 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6496ec84b953..ad201dae7316 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -134,9 +142,17 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -144,9 +160,9 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -156,13 +172,21 @@ static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+/*
+ * Normally the Dirty bit is used to denote COW memory on x86. But
+ * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
+ * since the Dirty=1,Write=0 will result in the memory being treated
+ * as shaodw stack by the HW. So when creating COW memory, a software
+ * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and
+ * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
+ * transition it to the shadow stack compatible version of COW (Cow=1).
+ */
+
+static inline pte_t pte_mkcow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+	/*
+	 * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
+	 * See the _PAGE_COW definition for more details.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	/*
+	 * PTE is getting copied-on-write, so it will be dirtied
+	 * if writable, or made shadow stack if shadow stack and
+	 * being copied on access. Set they dirty bit for both
+	 * cases.
+	 */
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -319,7 +381,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -329,7 +391,16 @@ static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pte_dirty(pte))
+		pte = pte_mkcow(pte);
+	return pte;
 }
 
 static inline pte_t pte_mkexec(pte_t pte)
@@ -339,7 +410,19 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pteval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating Dirty=1,Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_COW;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	/* pte_clear_cow() also sets Dirty=1 */
+	return pte_clear_cow(pte);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -349,7 +432,12 @@ static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_RW);
+	pte = pte_set_flags(pte, _PAGE_RW);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_cow(pte);
+
+	return pte;
 }
 
 static inline pte_t pte_mkhuge(pte_t pte)
@@ -396,6 +484,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pmd_uffd_wp(pmd_t pmd)
 {
@@ -420,17 +528,36 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pmd_dirty(pmd))
+		pmd = pmd_mkcow(pmd);
+	return pmd;
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pmdval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pmd_write(pmd))
+		dirty = _PAGE_COW;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd_clear_cow(pmd);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -450,7 +577,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_RW);
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_cow(pmd);
+	return pmd;
 }
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -467,6 +598,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+/* See comments above pte_mkcow() */
+static inline pud_t pud_mkcow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_COW);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
@@ -474,17 +625,32 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pud_dirty(pud))
+		pud = pud_mkcow(pud);
+	return pud;
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pud_write(pud))
+		dirty = _PAGE_COW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -504,7 +670,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	if (pud_dirty(pud))
+		pud = pud_clear_cow(pud);
+	return pud;
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ff82237e7b6b..85d88c0f9618 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +127,36 @@
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ *  (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
+ *	Previously when a typical anonymous writable mapping was made COW via
+ *	fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead
+ *	use the Cow bit.
+ *  (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
+ *	is in a R/O VMA, and get_user_pages() needs a writable copy. The page
+ *	fault handler creates a copy of the page and sets the new copy's PTE
+ *	as Write=0 and Cow=1.
+ *  (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
+ *  (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack
+ *	page is being shared among processes (this happens at fork()), its PTE
+ *	is made Dirty=0, so the next shadow stack access causes a fault, and
+ *	the page is duplicated and Dirty=1 is set again. This is the COW
+ *	equivalent for shadow stack pages, even though it's copy-on-access
+ *	rather than copy-on-write.
+ *  (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without
+ *	shadow stack support set Dirty=1.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW	(_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 11/39] x86/mm: Update pte_modify for _PAGE_COW
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (9 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-29 22:29 ` [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
                   ` (28 subsequent siblings)
  39 siblings, 0 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The Write=0,Dirty=1 PTE has been used to indicate copy-on-write pages.
However, newer x86 processors also regard a Write=0,Dirty=1 PTE as a
shadow stack page. In order to separate the two, the software-defined
_PAGE_DIRTY is changed to _PAGE_COW for the copy-on-write case, and
pte_*() are updated to do this.

pte_modify() takes a "raw" pgprot_t which was not necessarily created
with any of the existing PTE bit helpers. That means that it can return a
pte_t with Write=0,Dirty=1, a shadow stack PTE, when it did not intend to
create one.

However pte_modify() changes a PTE to 'newprot', but it doesn't use the
pte_*(). Modify it to also move _PAGE_DIRTY to _PAGE_COW. Apply the same
changes to pmd_modify().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Update commit log with text and suggestions from (Dave Hansen)
 - Drop fixup_dirty_pte() in favor of clearing the HW dirty bit along
   with the _PAGE_CHG_MASK masking, then calling pte_mkdirty() (Dave
   Hansen)

 arch/x86/include/asm/pgtable.h | 41 +++++++++++++++++++++++++++++-----
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ad201dae7316..2f2963429f48 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -790,26 +790,55 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
+	pteval_t _page_chg_mask_no_dirty = _PAGE_CHG_MASK & ~_PAGE_DIRTY;
 	pteval_t val = pte_val(pte), oldval = val;
+	pte_t pte_result;
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
 	 * the newprot (if present):
 	 */
-	val &= _PAGE_CHG_MASK;
-	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+	val &= _page_chg_mask_no_dirty;
+	val |= check_pgprot(newprot) & ~_page_chg_mask_no_dirty;
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
-	return __pte(val);
+
+	pte_result = __pte(val);
+
+	/*
+	 * Dirty bit is not preserved above so it can be done
+	 * in a special way for the shadow stack case, where it
+	 * needs to set _PAGE_COW. pte_mkdirty() will do this in
+	 * the case of shadow stack.
+	 */
+	if (pte_dirty(pte))
+		pte_result = pte_mkdirty(pte_result);
+
+	return pte_result;
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
+	pteval_t _hpage_chg_mask_no_dirty = _HPAGE_CHG_MASK & ~_PAGE_DIRTY;
 	pmdval_t val = pmd_val(pmd), oldval = val;
+	pmd_t pmd_result;
 
-	val &= _HPAGE_CHG_MASK;
-	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val &= _hpage_chg_mask_no_dirty;
+	val |= check_pgprot(newprot) & ~_hpage_chg_mask_no_dirty;
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
-	return __pmd(val);
+
+
+	pmd_result = __pmd(val);
+
+	/*
+	 * Dirty bit is not preserved above so it can be done
+	 * specially for the shadow stack case. It needs to move
+	 * the HW dirty bit to the software COW bit. Set in the
+	 * result if it was set in the original value.
+	 */
+	if (pmd_dirty(pmd))
+		pmd_result = pmd_mkdirty(pmd_result);
+
+	return pmd_result;
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (10 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 17:43   ` Kirill A . Shutemov
  2022-10-03 18:11   ` Nadav Amit
  2022-09-29 22:29 ` [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
                   ` (27 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
stack. Copy-on-write PTes then have Write=0,Cow=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a Write=0 PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.

The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue.  Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Compile out some code due to clang build error
 - Clarify commit log (dhansen)
 - Normalize PTE bit descriptions between patches (dhansen)
 - Update comment with text from (dhansen)

Yu-cheng v30:
 - Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).

 arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2f2963429f48..58c7bf9d7392 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+#ifdef CONFIG_X86_SHADOW_STACK
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		old_pte = READ_ONCE(*ptep);
+		do {
+			new_pte = pte_wrprotect(old_pte);
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1339,6 +1356,25 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+#ifdef CONFIG_X86_SHADOW_STACK
+	/*
+	 * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+	 * to _PAGE_COW (see comments at pmd_wrprotect()).
+	 * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+	 * to RW=0, Dirty=0, another thread could have written to the page
+	 * and the PMD is RW=1, Dirty=1 now.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		old_pmd = READ_ONCE(*pmdp);
+		do {
+			new_pmd = pmd_wrprotect(old_pmd);
+		} while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (11 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:11   ` Kees Cook
  2022-10-03 18:24   ` Peter Xu
  2022-09-29 22:29 ` [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
                   ` (26 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu, Peter Xu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..be80fc827212 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -365,7 +365,7 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT	37
+# define VM_UFFD_MINOR_BIT	38
 # define VM_UFFD_MINOR		BIT(VM_UFFD_MINOR_BIT)	/* UFFD minor faults */
 #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 # define VM_UFFD_MINOR		VM_NONE
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (12 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 17:47   ` Kirill A . Shutemov
  2022-10-03 18:17   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors Rick Edgecombe
                   ` (25 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A shadow stack PTE must be read-only and have _PAGE_DIRTY set.  However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages.  These
two cases are handled differently for page faults. Introduce
VM_SHADOW_STACK to track shadow stack VMAs.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
 Documentation/filesystems/proc.rst | 1 +
 arch/x86/mm/mmap.c                 | 2 ++
 fs/proc/task_mmu.c                 | 3 +++
 include/linux/mm.h                 | 8 ++++++++
 4 files changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e7aafc82be99..d54ff397947a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -560,6 +560,7 @@ encoded manner. The codes are the following:
     mt    arm64 MTE allocation tags are enabled
     um    userfaultfd missing tracking
     uw    userfaultfd wr-protect tracking
+    ss    shadow stack page
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..f3f52c5e2fd6 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
 
 const char *arch_vma_name(struct vm_area_struct *vma)
 {
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return "[shadow stack]";
 	return NULL;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4e0023643f8b..a20899392c8d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -700,6 +700,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+		[ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index be80fc827212..8cd413c5a329 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -314,11 +314,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -334,6 +336,12 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_X86_SHADOW_STACK
+# define VM_SHADOW_STACK	VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK	VM_NONE
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (13 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:20   ` Kees Cook
  2022-10-14 10:07   ` Peter Zijlstra
  2022-09-29 22:29 ` [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
                   ` (24 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stacks accesses to shadow-stack mappings can see faults in normal,
valid operation just like regular accesses to regular mappings. Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write. The kernel needs to use faults to implement those features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Update commit log with verbiage/feedback from Dave Hansen
 - Clarify reasoning for FAULT_FLAG_WRITE for all shadow stack accesses
 - Update comments with some verbiage from Dave Hansen

Yu-cheng v30:
 - Update Subject line and add a verb

 arch/x86/include/asm/trap_pf.h |  2 ++
 arch/x86/mm/fault.c            | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  *   bit 15 ==				1: SGX MMU page-fault
  */
 enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 	X86_PF_SGX	=		1 << 15,
 };
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa71a5d12e87..e5697b393069 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1107,8 +1107,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Shadow stack accesses (PF_SHSTK=1) are only permitted to
+	 * shadow stack VMAs. All other accesses result in an error.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
+			return 1;
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
+		if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
+			return 1;
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
 		return 0;
@@ -1300,6 +1314,13 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * In order to fullfull a shadow stack access, the page needs
+	 * to be made (shadow stack) writable. So treat all shadow stack
+	 * accesses as writes.
+	 */
+	if (error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (error_code & X86_PF_INSTR)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (14 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:22   ` Kees Cook
                     ` (2 more replies)
  2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
                   ` (23 subsequent siblings)
  39 siblings, 3 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When serving a page fault, maybe_mkwrite() makes a PTE writable if there is
a write access to it, and its vma has VM_WRITE. Shadow stack accesses to
shadow stack vma's are also treated as write accesses by the fault handler.
This is because setting shadow stack memory makes it writable via some
instructions, so COW has to happen even for shadow stack reads.

So maybe_mkwrite() should continue to set VM_WRITE vma's as normally
writable, but also set VM_WRITE|VM_SHADOW_STACK vma's as shadow stack.

Do this by adding a pte_mkwrite_shstk() and a cross-arch stub. Check for
VM_SHADOW_STACK in maybe_mkwrite() and call pte_mkwrite_shstk()
accordingly.

Apply the same changes to maybe_pmd_mkwrite().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Change to handle shadow stacks that are VM_WRITE|VM_SHADOW_STACK
 - Ditch arch specific maybe_mkwrite(), and make the code generic

Yu-cheng v29:
 - Remove likely()'s.

 arch/x86/include/asm/pgtable.h |  2 ++
 include/linux/mm.h             | 14 +++++++++++++-
 include/linux/pgtable.h        | 14 ++++++++++++++
 mm/huge_memory.c               |  9 ++++++++-
 mm/memory.c                    |  3 +--
 5 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 58c7bf9d7392..7a769c4dbc1c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -419,6 +419,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
 }
 
+#define pte_mkwrite_shstk pte_mkwrite_shstk
 static inline pte_t pte_mkwrite_shstk(pte_t pte)
 {
 	/* pte_clear_cow() also sets Dirty=1 */
@@ -555,6 +556,7 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
 }
 
+#define pmd_mkwrite_shstk pmd_mkwrite_shstk
 static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
 {
 	return pmd_clear_cow(pmd);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8cd413c5a329..fef14ab3abcb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -981,13 +981,25 @@ void free_compound_page(struct page *page);
  * servicing faults for write access.  In the normal case, do always want
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
+ *
+ * If a vma is shadow stack (a type of writable memory), mark the pte shadow
+ * stack.
  */
+#ifndef maybe_mkwrite
 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
-	if (likely(vma->vm_flags & VM_WRITE))
+	if (!(vma->vm_flags & VM_WRITE))
+		goto out;
+
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		pte = pte_mkwrite_shstk(pte);
+	else
 		pte = pte_mkwrite(pte);
+
+out:
 	return pte;
 }
+#endif
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 014ee8f0fbaa..21115b4895ca 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -480,6 +480,13 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
 #define pte_mk_savedwrite pte_mkwrite
 #endif
 
+#ifndef pte_mkwrite_shstk
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	return pte;
+}
+#endif
+
 #ifndef pte_clear_savedwrite
 #define pte_clear_savedwrite pte_wrprotect
 #endif
@@ -488,6 +495,13 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
 #define pmd_savedwrite pmd_write
 #endif
 
+#ifndef pmd_mkwrite_shstk
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef pmd_mk_savedwrite
 #define pmd_mk_savedwrite pmd_mkwrite
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..11fc69eb4717 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -554,8 +554,15 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
 
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
-	if (likely(vma->vm_flags & VM_WRITE))
+	if (!(vma->vm_flags & VM_WRITE))
+		goto out;
+
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		pmd = pmd_mkwrite_shstk(pmd);
+	else
 		pmd = pmd_mkwrite(pmd);
+
+out:
 	return pmd;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4ba73f5aa8bb..6e8379f6793c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4098,8 +4098,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
 	entry = mk_pte(page, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (15 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:24   ` Kees Cook
                     ` (3 more replies)
  2022-09-29 22:29 ` [PATCH v2 18/39] mm: Add guard pages around a shadow stack Rick Edgecombe
                   ` (22 subsequent siblings)
  39 siblings, 4 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

With the introduction of shadow stack memory there are two ways a pte can
be writable: regular writable memory and shadow stack memory.

In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
where a PTE is made writable. However, there are places where pte_mkwrite()
is called directly and the logic should now also create a shadow stack PTE
in the case of a shadow stack VMA.

 - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
   directly and call pte_mkwrite(), which is the same as maybe_mkwrite()
   in logic and intention. Just change them to maybe_mkwrite().

 - When userfaultfd is creating a PTE after userspace handles the fault
   it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()

In other cases where pte_mkwrite() is called directly, the VMA will not
be VM_SHADOW_STACK, and so shadow stack memory should not be created.
 - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
 - In the case of the "dirty_accountable" optimization in mprotect(),
   shadow stack VMA's won't be VM_SHARED, so it is not nessary.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Updated commit log with comment's from Dave Hansen
 - Dave also suggested (I understood) to maybe tweak vm_get_page_prot()
   to avoid having to call maybe_mkwrite(). After playing around with
   this I opted to *not* do this. Shadow stack memory memory is
   effectively writable, so having the default permissions be writable
   ended up mapping the zero page as writable and other surprises. So
   creating shadow stack memory needs to be done with manual logic
   like pte_mkwrite().
 - Drop change in change_pte_range() because it couldn't actually trigger
   for shadow stack VMAs.
 - Clarify reasoning for skipped cases of pte_mkwrite().

Yu-cheng v25:
 - Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().

 mm/migrate_device.c |  3 +--
 mm/userfaultfd.c    | 10 +++++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..eba3164736b3 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -606,8 +606,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 			goto abort;
 		}
 		entry = mk_pte(page, vma->vm_page_prot);
-		if (vma->vm_flags & VM_WRITE)
-			entry = pte_mkwrite(pte_mkdirty(entry));
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7327b2573f7c..b49372c7de41 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	int ret;
 	pte_t _dst_pte, *dst_pte;
 	bool writable = dst_vma->vm_flags & VM_WRITE;
+	bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
 	bool page_in_cache = page->mapping;
 	spinlock_t *ptl;
@@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		writable = false;
 	}
 
-	if (writable)
-		_dst_pte = pte_mkwrite(_dst_pte);
-	else
+	if (writable) {
+		if (shstk)
+			_dst_pte = pte_mkwrite_shstk(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	} else
 		/*
 		 * We need this to make sure write bit removed; as mk_pte()
 		 * could return a pte with write bit set.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (16 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:30   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
                   ` (21 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to  prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the spp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET  and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

	addq    $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp);       // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8);     // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Use __weak instead of #ifdef (Dave Hansen)
 - Only have start gap on shadow stack (Andy Luto)
 - Create stack_guard_start_gap() to not duplicate code
   in an arch version of vm_start_gap() (Dave Hansen)
 - Improve commit log partly with verbiage from (Dave Hansen)

Yu-cheng v25:
 - Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.

Yu-cheng v24:
 - Instead changing vm_*_gap(), create x86-specific versions.

 arch/x86/mm/mmap.c | 23 +++++++++++++++++++++++
 include/linux/mm.h | 11 ++++++-----
 mm/mmap.c          |  7 +++++++
 3 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index f3f52c5e2fd6..b0427bd2da30 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -250,3 +250,26 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
 		return false;
 	return true;
 }
+
+unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_GROWSDOWN)
+		return stack_guard_gap;
+
+	/*
+	 * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).
+	 * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+	 * (~1KB for INCSSPD) and touches the first and the last element
+	 * in the range, which triggers a page fault if the range is not
+	 * in a shadow stack. Because of this, creating 4-KB guard pages
+	 * around a shadow stack prevents these instructions from going
+	 * beyond.
+	 *
+	 * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+	 * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+	 */
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		return PAGE_SIZE;
+
+	return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fef14ab3abcb..09458e77bf52 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2775,15 +2775,16 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
+unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
+
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
+	unsigned long gap = stack_guard_start_gap(vma);
 	unsigned long vm_start = vma->vm_start;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
-		if (vm_start > vma->vm_start)
-			vm_start = 0;
-	}
+	vm_start -= gap;
+	if (vm_start > vma->vm_start)
+		vm_start = 0;
 	return vm_start;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d780f415be3..f0d2e9143bd0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -247,6 +247,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	return origbrk;
 }
 
+unsigned long __weak stack_guard_start_gap(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_GROWSDOWN)
+		return stack_guard_gap;
+	return 0;
+}
+
 static inline unsigned long vma_compute_gap(struct vm_area_struct *vma)
 {
 	unsigned long gap, prev_end;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (17 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 18/39] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 18:31   ` Kees Cook
  2022-10-04  0:03   ` Kirill A . Shutemov
  2022-09-29 22:29 ` [PATCH v2 20/39] mm/mprotect: Exclude shadow stack from preserve_write Rick Edgecombe
                   ` (20 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Account shadow stack pages to stack memory.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Remove is_shadow_stack_mapping() and just change it to directly bitwise
   and VM_SHADOW_STACK.

Yu-cheng v26:
 - Remove redundant #ifdef CONFIG_MMU.

Yu-cheng v25:
 - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().

 mm/mmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index f0d2e9143bd0..8569ef09614c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1682,6 +1682,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	if (file && is_file_hugepages(file))
 		return 0;
 
+	if (vm_flags & VM_SHADOW_STACK)
+		return 1;
+
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
@@ -3289,6 +3292,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->exec_vm += npages;
 	else if (is_stack_mapping(flags))
 		mm->stack_vm += npages;
+	else if (flags & VM_SHADOW_STACK)
+		mm->stack_vm += npages;
 	else if (is_data_mapping(flags))
 		mm->data_vm += npages;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 20/39] mm/mprotect: Exclude shadow stack from preserve_write
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (18 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-29 22:29 ` [PATCH v2 21/39] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
                   ` (19 subsequent siblings)
  39 siblings, 0 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
preserved to avoid the additional write fault after the NUMA hinting fault.
However, pte_write() now includes both normal writable and shadow stack
(Write=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no
need to preserve it.

Exclude shadow stack from preserve_write test, and apply the same change to
change_huge_pmd().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

Yu-cheng v25:
 - Move is_shadow_stack_mapping() to a separate line.

Yu-cheng v24:
 - Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

 mm/huge_memory.c | 7 +++++++
 mm/mprotect.c    | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 11fc69eb4717..492c4f190f55 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1800,6 +1800,13 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		return 0;
 
 	preserve_write = prot_numa && pmd_write(*pmd);
+
+	/*
+	 * Preserve only normal writable huge PMD, but not shadow
+	 * stack (RW=0, Dirty=1).
+	 */
+	if (vma->vm_flags & VM_SHADOW_STACK)
+		preserve_write = false;
 	ret = 1;
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bc6bddd156ca..983206529dce 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -114,6 +114,13 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
 
+			/*
+			 * Preserve only normal writable PTE, but not shadow
+			 * stack (RW=0, Dirty=1).
+			 */
+			if (vma->vm_flags & VM_SHADOW_STACK)
+				preserve_write = false;
+
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 21/39] mm: Re-introduce vm_flags to do_mmap()
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (19 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 20/39] mm/mprotect: Exclude shadow stack from preserve_write Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-29 22:29 ` [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
                   ` (18 subsequent siblings)
  39 siblings, 0 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu, Andrew Morton

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

    commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now.  Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap().  Thus, re-introduce vm_flags to do_mmap().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: linux-mm@kvack.org
---
 fs/aio.c           |  2 +-
 include/linux/mm.h |  3 ++-
 ipc/shm.c          |  2 +-
 mm/mmap.c          | 10 +++++-----
 mm/nommu.c         |  4 ++--
 mm/util.c          |  2 +-
 6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 606613e9d1f4..a54b5ee72f1c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -554,7 +554,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
 				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+				 MAP_SHARED, 0, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09458e77bf52..6aa0ffe3666c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2667,7 +2667,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf);
 extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
diff --git a/ipc/shm.c b/ipc/shm.c
index b3048ebd5c31..f236b3e14ec4 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1646,7 +1646,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 8569ef09614c..e1006c41b1cc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1375,11 +1375,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  */
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
-			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long flags, vm_flags_t vm_flags,
+			unsigned long pgoff, unsigned long *populate,
+			struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	vm_flags_t vm_flags;
 	int pkey = 0;
 
 	*populate = 0;
@@ -1439,7 +1439,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
@@ -2964,7 +2964,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, 0, pgoff, &populate, NULL);
 	fput(file);
 out:
 	mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index e819cbc21b39..85b41107a192 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1059,6 +1059,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long len,
 			unsigned long prot,
 			unsigned long flags,
+			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
 			struct list_head *uf)
@@ -1066,7 +1067,6 @@ unsigned long do_mmap(struct file *file,
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	struct rb_node *rb;
-	vm_flags_t vm_flags;
 	unsigned long capabilities, result;
 	int ret;
 
@@ -1085,7 +1085,7 @@ unsigned long do_mmap(struct file *file,
 
 	/* we've determined that we can make the mapping, now translate what we
 	 * now know into VMA flags */
-	vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+	vm_flags |= determine_vm_flags(file, prot, flags, capabilities);
 
 	/* we're going to need to record the mapping */
 	region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c
index c9439c66d8cf..f15929f2c5bd 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -549,7 +549,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
 			      &uf);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (20 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 21/39] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-30 19:16   ` Dave Hansen
  2022-10-03 18:39   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 23/39] x86: Introduce userspace API for CET enabling Rick Edgecombe
                   ` (17 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

Shadow stack memory is writable only in very specific, controlled ways.
However, since it is writable, the kernel treats it as such. As a result
there remain many ways for userspace to trigger the kernel to write to
shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
little less exposed, block writable GUPs for shadow stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v2:
 - New patch

 arch/x86/include/asm/pgtable.h | 3 +++
 mm/gup.c                       | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7a769c4dbc1c..2e6a5ee70034 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1633,6 +1633,9 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
 {
 	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
 
+	if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY)
+		return 0;
+
 	if (write)
 		need_pte_bits |= _PAGE_RW;
 
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..56da98f3335c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1043,7 +1043,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;
 
 	if (write) {
-		if (!(vm_flags & VM_WRITE)) {
+		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
 			/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (21 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 19:01   ` Kees Cook
  2022-10-10 10:56   ` Florian Weimer
  2022-09-29 22:29 ` [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support Rick Edgecombe
                   ` (16 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Add three new arch_prctl() handles:

 - ARCH_CET_ENABLE/DISABLE enables or disables the specified
   feature. Returns 0 on success or an error.

 - ARCH_CET_LOCK prevents future disabling or enabling of the
   specified feature. Returns 0 on success or an error

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

This is preparation patch. It does not impelement any features.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[tweaked with feedback from tglx]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Only allow one enable/disable per call (tglx)
 - Return error code like a normal arch_prctl() (Alexander Potapenko)
 - Make CET only (tglx)

 arch/x86/include/asm/cet.h        | 20 ++++++++++++++++
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/include/uapi/asm/prctl.h |  6 +++++
 arch/x86/kernel/process.c         |  4 ++++
 arch/x86/kernel/process_64.c      |  5 +++-
 arch/x86/kernel/shstk.c           | 38 +++++++++++++++++++++++++++++++
 6 files changed, 75 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..0fa4dbc98c49
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_X86_SHADOW_STACK
+long cet_prctl(struct task_struct *task, int option,
+		      unsigned long features);
+#else
+static inline long cet_prctl(struct task_struct *task, int option,
+		      unsigned long features) { return -EINVAL; }
+#endif /* CONFIG_X86_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 356308c73951..a92bf76edafe 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -530,6 +530,9 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+	unsigned long		features;
+	unsigned long		features_locked;
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..028158e35269 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,10 @@
 #define ARCH_MAP_VDSO_32		0x2002
 #define ARCH_MAP_VDSO_64		0x2003
 
+/* Don't use 0x3001-0x3004 because of old glibcs */
+
+#define ARCH_CET_ENABLE			0x4001
+#define ARCH_CET_DISABLE		0x4002
+#define ARCH_CET_LOCK			0x4003
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 58a6ea472db9..034880311e6b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -367,6 +367,10 @@ void arch_setup_new_exec(void)
 		task_clear_spec_ssb_noexec(current);
 		speculation_ctrl_update(read_thread_flags());
 	}
+
+	/* Reset thread features on exec */
+	current->thread.features = 0;
+	current->thread.features_locked = 0;
 }
 
 #ifdef CONFIG_X86_IOPL_IOPERM
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1962008fe743..8fa2c2b7de65 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -829,7 +829,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_MAP_VDSO_64:
 		return prctl_map_vdso(&vdso_image_64, arg2);
 #endif
-
+	case ARCH_CET_ENABLE:
+	case ARCH_CET_DISABLE:
+	case ARCH_CET_LOCK:
+		return cet_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..e3276ac9e9b9
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <asm/prctl.h>
+
+long cet_prctl(struct task_struct *task, int option, unsigned long features)
+{
+	if (option == ARCH_CET_LOCK) {
+		task->thread.features_locked |= features;
+		return 0;
+	}
+
+	/* Don't allow via ptrace */
+	if (task != current)
+		return -EINVAL;
+
+	/* Do not allow to change locked features */
+	if (features & task->thread.features_locked)
+		return -EPERM;
+
+	/* Only support enabling/disabling one feature at a time. */
+	if (hweight_long(features) > 1)
+		return -EINVAL;
+
+	if (option == ARCH_CET_DISABLE) {
+		return -EINVAL;
+	}
+
+	/* Handle ARCH_CET_ENABLE */
+	return -EINVAL;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (22 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 23/39] x86: Introduce userspace API for CET enabling Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 19:43   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack Rick Edgecombe
                   ` (15 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

Do not support IA32 emulation.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Get rid of unnessary shstk->base checks
 - Don't support IA32 emulation

v1:
 - Switch to xsave helpers.
 - Expand commit log.

Yu-cheng v30:
 - Remove superfluous comments for struct thread_shstk.
 - Replace 'populate' with 'unused'.

Yu-cheng v28:
 - Update shstk_setup() with wrmsrl_safe(), returns success when shadow
   stack feature is not present (since this is a setup function).

 arch/x86/include/asm/cet.h        |  13 +++
 arch/x86/include/asm/msr.h        |  11 +++
 arch/x86/include/asm/processor.h  |   5 ++
 arch/x86/include/uapi/asm/prctl.h |   2 +
 arch/x86/kernel/Makefile          |   2 +
 arch/x86/kernel/process_64.c      |   2 +
 arch/x86/kernel/shstk.c           | 143 ++++++++++++++++++++++++++++++
 7 files changed, 178 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 0fa4dbc98c49..a4a1f4c0089b 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -7,12 +7,25 @@
 
 struct task_struct;
 
+struct thread_shstk {
+	u64	base;
+	u64	size;
+};
+
 #ifdef CONFIG_X86_SHADOW_STACK
 long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features);
+int shstk_setup(void);
+void shstk_free(struct task_struct *p);
+int shstk_disable(void);
+void reset_thread_shstk(void);
 #else
 static inline long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features) { return -EINVAL; }
+static inline int shstk_setup(void) { return -EOPNOTSUPP; }
+static inline void shstk_free(struct task_struct *p) {}
+static inline int shstk_disable(void) { return -EOPNOTSUPP; }
+static inline void reset_thread_shstk(void) {}
 #endif /* CONFIG_X86_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..a9cb4c434e60 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
 int msr_set_bit(u32 msr, u8 bit);
 int msr_clear_bit(u32 msr, u8 bit);
 
+static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
+{
+	u64 val, new_val;
+
+	rdmsrl(msr, val);
+	new_val = (val & ~clear) | set;
+
+	if (new_val != val)
+		wrmsrl(msr, new_val);
+}
+
 #ifdef CONFIG_SMP
 int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
 int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a92bf76edafe..3a0c9d9d4d1d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/cet.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -533,6 +534,10 @@ struct thread_struct {
 	unsigned long		features;
 	unsigned long		features_locked;
 
+#ifdef CONFIG_X86_SHADOW_STACK
+	struct thread_shstk	shstk;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 028158e35269..41af3a8c4fa4 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,4 +26,6 @@
 #define ARCH_CET_DISABLE		0x4002
 #define ARCH_CET_LOCK			0x4003
 
+#define CET_SHSTK			0x1
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a20a5ebfacd7..8950d1f71226 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
 
+obj-$(CONFIG_X86_SHADOW_STACK)		+= shstk.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 8fa2c2b7de65..be544b4b4c8b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		load_gs_index(__USER_DS);
 	}
 
+	reset_thread_shstk();
+
 	loadsegment(fs, 0);
 	loadsegment(es, _ds);
 	loadsegment(ds, _ds);
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index e3276ac9e9b9..a0b8d4adb2bf 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,8 +8,151 @@
 
 #include <linux/sched.h>
 #include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+static bool feature_enabled(unsigned long features)
+{
+	return current->thread.features & features;
+}
+
+static void feature_set(unsigned long features)
+{
+	current->thread.features |= features;
+}
+
+static void feature_clr(unsigned long features)
+{
+	current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, unused;
+
+	mmap_write_lock(mm);
+	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+	mmap_write_unlock(mm);
+
+	return addr;
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+	while (1) {
+		int r;
+
+		r = vm_munmap(base, size);
+
+		/*
+		 * vm_munmap() returns -EINTR when mmap_lock is held by
+		 * something else, and that lock should not be held for a
+		 * long time.  Retry it for the case.
+		 */
+		if (r == -EINTR) {
+			cond_resched();
+			continue;
+		}
+
+		/*
+		 * For all other types of vm_munmap() failure, either the
+		 * system is out of memory or there is bug.
+		 */
+		WARN_ON_ONCE(r);
+		break;
+	}
+}
+
+int shstk_setup(void)
+{
+	struct thread_shstk *shstk = &current->thread.shstk;
+	unsigned long addr, size;
+
+	/* Already enabled */
+	if (feature_enabled(CET_SHSTK))
+		return 0;
+
+	/* Also not supported for 32 bit */
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || in_ia32_syscall())
+		return -EOPNOTSUPP;
+
+	size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	fpu_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+	fpregs_unlock();
+
+	shstk->base = addr;
+	shstk->size = size;
+	feature_set(CET_SHSTK);
+
+	return 0;
+}
+
+void reset_thread_shstk(void)
+{
+	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
+	current->thread.features = 0;
+	current->thread.features_locked = 0;
+}
+
+void shstk_free(struct task_struct *tsk)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+	    !feature_enabled(CET_SHSTK))
+		return;
+
+	if (!tsk->mm)
+		return;
+
+	unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+int shstk_disable(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return -EOPNOTSUPP;
+
+	/* Already disabled? */
+	if (!feature_enabled(CET_SHSTK))
+		return 0;
+
+	fpu_lock_and_load();
+	/* Disable WRSS too when disabling shadow stack */
+	set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+	wrmsrl(MSR_IA32_PL3_SSP, 0);
+	fpregs_unlock();
+
+	shstk_free(current);
+	feature_clr(CET_SHSTK);
+
+	return 0;
+}
+
 long cet_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_CET_LOCK) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (23 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 10:36   ` Mike Rapoport
  2022-10-03 20:29   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk Rick Edgecombe
                   ` (14 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Have fpu_clone() take new shadow stack pointer and update SSP in
   xsave buffer for new task. (tglx)

v1:
 - Expand commit log.
 - Add more comments.
 - Switch to xsave helpers.

Yu-cheng v30:
 - Update comments about clone()/clone3(). (Borislav Petkov)

Yu-cheng v29:
 - WARN_ON_ONCE() when get_xsave_addr() returns NULL, and update comments.
   (Dave Hansen)

 arch/x86/include/asm/cet.h         |  7 +++++
 arch/x86/include/asm/fpu/sched.h   |  3 +-
 arch/x86/include/asm/mmu_context.h |  2 ++
 arch/x86/kernel/fpu/core.c         | 40 ++++++++++++++++++++++++-
 arch/x86/kernel/process.c          | 17 ++++++++++-
 arch/x86/kernel/shstk.c            | 48 +++++++++++++++++++++++++++++-
 6 files changed, 113 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index a4a1f4c0089b..924de99e0c61 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -16,6 +16,9 @@ struct thread_shstk {
 long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features);
 int shstk_setup(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+			     unsigned long stack_size,
+			     unsigned long *shstk_addr);
 void shstk_free(struct task_struct *p);
 int shstk_disable(void);
 void reset_thread_shstk(void);
@@ -23,6 +26,10 @@ void reset_thread_shstk(void);
 static inline long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features) { return -EINVAL; }
 static inline int shstk_setup(void) { return -EOPNOTSUPP; }
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+					   unsigned long clone_flags,
+					   unsigned long stack_size,
+					   unsigned long *shstk_addr) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 static inline int shstk_disable(void) { return -EOPNOTSUPP; }
 static inline void reset_thread_shstk(void) {}
diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index b2486b2cbc6e..54c9c2fd1907 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@
 
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
 extern void fpu__drop(struct fpu *fpu);
-extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+		      unsigned long shstk_addr);
 extern void fpu_flush_thread(void);
 
 /*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index b8d40ddeab00..d29988cbdf20 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -146,6 +146,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 778d3054ccc7..f332e9b42b6d 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -555,8 +555,40 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
 	}
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+	struct cet_user_state *xstate;
+
+	/* If ssp update is not needed. */
+	if (!ssp)
+		return 0;
+
+	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+				XFEATURE_CET_USER);
+
+	/*
+	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+	 * stack and the fpu state should be up to date since it was just copied
+	 * from the parent in fpu_clone(). So there must be a valid non-init CET
+	 * state location in the buffer.
+	 */
+	if (WARN_ON_ONCE(!xstate))
+		return 1;
+
+	xstate->user_ssp = (u64)ssp;
+
+	return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
+{
+}
+#endif
+
 /* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+	      unsigned long ssp)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
@@ -616,6 +648,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
 	if (use_xsave())
 		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
 
+	/*
+	 * Update shadow stack pointer, in case it changed during clone.
+	 */
+	if (update_fpu_shstk(dst, ssp))
+		return 1;
+
 	trace_x86_fpu_copy_src(src_fpu);
 	trace_x86_fpu_copy_dst(dst_fpu);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 034880311e6b..5e63d190becd 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,6 +47,7 @@
 #include <asm/frame.h>
 #include <asm/unwind.h>
 #include <asm/tdx.h>
+#include <asm/cet.h>
 
 #include "process.h"
 
@@ -118,6 +119,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -139,6 +141,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
+	unsigned long shstk_addr = 0;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -173,7 +176,12 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-	fpu_clone(p, clone_flags, args->fn);
+	/* Allocate a new shadow stack for pthread if needed */
+	ret = shstk_alloc_thread_stack(p, clone_flags, args->flags, &shstk_addr);
+	if (ret)
+		return ret;
+
+	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
@@ -219,6 +227,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
+	/*
+	 * If copy_thread() if failing, don't leak the shadow stack possibly
+	 * allocated in shstk_alloc_thread_stack() above.
+	 */
+	if (ret)
+		shstk_free(p);
+
 	return ret;
 }
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index a0b8d4adb2bf..db4e53f9fdaf 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -118,6 +118,46 @@ void reset_thread_shstk(void)
 	current->thread.features_locked = 0;
 }
 
+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+			     unsigned long stack_size, unsigned long *shstk_addr)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long addr;
+
+	/*
+	 * If shadow stack is not enabled on the new thread, skip any
+	 * switch to a new shadow stack.
+	 */
+	if (!feature_enabled(CET_SHSTK))
+		return 0;
+
+	/*
+	 * clone() does not pass stack_size, which was added to clone3().
+	 * Use RLIMIT_STACK and cap to 4 GB.
+	 */
+	if (!stack_size)
+		stack_size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+
+	stack_size = PAGE_ALIGN(stack_size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	shstk->base = addr;
+	shstk->size = stack_size;
+
+	*shstk_addr = addr + stack_size;
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -126,7 +166,13 @@ void shstk_free(struct task_struct *tsk)
 	    !feature_enabled(CET_SHSTK))
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	unmap_shadow_stack(shstk->base, shstk->size);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (24 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 20:44   ` Kees Cook
  2022-10-05  2:43   ` Andrew Cooper
  2022-09-29 22:29 ` [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack Rick Edgecombe
                   ` (13 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Shadow stack's are normally written to via CALL/RET or specific CET
instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
operations the kernel will need to write to directly using the ring-0 only
WRUSS instruction.

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.

In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.

To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It is also can't be a
valid restore token.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Add data helpers for writing to shadow stack.

v1:
 - Use xsave helpers.

Yu-cheng v30:
 - Update commit log, remove description about signals.
 - Update various comments.
 - Remove variable 'ssp' init and adjust return value accordingly.
 - Check get_user_shstk_addr() return value.
 - Replace 'ia32' with 'proc32'.

Yu-cheng v29:
 - Update comments for the use of get_xsave_addr().

 arch/x86/include/asm/special_insns.h |  13 ++++
 arch/x86/kernel/shstk.c              | 108 +++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 35f709f619fb..f096f52bd059 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: [addr] "r" (addr), [val] "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EFAULT;
+}
+#endif /* CONFIG_X86_SHADOW_STACK */
+
 #define nop() asm volatile ("nop")
 
 static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index db4e53f9fdaf..8904aef487bf 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
 #include <asm/fpu/api.h>
 #include <asm/prctl.h>
 
+#define SS_FRAME_SIZE 8
+
 static bool feature_enabled(unsigned long features)
 {
 	return current->thread.features & features;
@@ -40,6 +42,31 @@ static void feature_clr(unsigned long features)
 	current->thread.features &= ~features;
 }
 
+/*
+ * Create a restore token on the shadow stack.  A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+	unsigned long addr;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(ssp, 8))
+		return -EINVAL;
+
+	addr = ssp - SS_FRAME_SIZE;
+
+	/* Mark the token 64-bit */
+	ssp |= BIT(0);
+
+	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+		return -EFAULT;
+
+	*token_addr = addr;
+
+	return 0;
+}
+
 static unsigned long alloc_shstk(unsigned long size)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
@@ -158,6 +185,87 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 	return 0;
 }
 
+static unsigned long get_user_shstk_addr(void)
+{
+	unsigned long long ssp;
+
+	fpu_lock_and_load();
+
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	fpregs_unlock();
+
+	return ssp;
+}
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+	WARN_ON(data & BIT(63));
+
+	/*
+	 * Mark the high bit so that the sigframe can't be processed as a
+	 * return address.
+	 */
+	if (write_user_shstk_64(addr, data | BIT(63)))
+		return -EFAULT;
+	return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+	unsigned long ldata;
+
+	if (unlikely(get_user(ldata, addr)))
+		return -EFAULT;
+
+	if (!(ldata & BIT(63)))
+		return -EINVAL;
+
+	*data = ldata & ~BIT(63);
+
+	return 0;
+}
+
+/*
+ * Verify the user shadow stack has a valid token on it, and then set
+ * *new_ssp according to the token.
+ */
+static int shstk_check_rstor_token(unsigned long *new_ssp)
+{
+	unsigned long token_addr;
+	unsigned long token;
+
+	token_addr = get_user_shstk_addr();
+	if (!token_addr)
+		return -EINVAL;
+
+	if (get_user(token, (unsigned long __user *)token_addr))
+		return -EFAULT;
+
+	/* Is mode flag correct? */
+	if (!(token & BIT(0)))
+		return -EINVAL;
+
+	/* Is busy flag set? */
+	if (token & BIT(1))
+		return -EINVAL;
+
+	/* Mask out flags */
+	token &= ~3UL;
+
+	/* Restore address aligned? */
+	if (!IS_ALIGNED(token, 8))
+		return -EINVAL;
+
+	/* Token placed properly? */
+	if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	*new_ssp = token;
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (25 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 20:52   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
                   ` (12 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.

For this token we can use the shadow stack data format defined earlier.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.

So, when handling a signal push
 - the SSP pointing in the shadow stack data format
 - the restorer address below the restore token.

In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>

---

v2:
 - Switch to new shstk signal format

v1:
 - Use xsave helpers.
 - Expand commit log.

Yu-cheng v27:
 - Eliminate saving shadow stack pointer to signal context.

Yu-cheng v25:
 - Update commit log/comments for the sc_ext struct.
 - Use restorer address already calculated.
 - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
 - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
 - Eliminate writing to MSR_IA32_U_CET for shadow stack.
 - Change wrmsrl() to wrmsrl_safe() and handle error.

 arch/x86/ia32/ia32_signal.c |   1 +
 arch/x86/include/asm/cet.h  |   5 ++
 arch/x86/kernel/shstk.c     | 126 ++++++++++++++++++++++++++++++------
 arch/x86/kernel/signal.c    |  10 +++
 4 files changed, 123 insertions(+), 19 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index c9c3859322fa..88d71b9de616 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,6 +34,7 @@
 #include <asm/sigframe.h>
 #include <asm/sighandling.h>
 #include <asm/smap.h>
+#include <asm/cet.h>
 
 static inline void reload_segments(struct sigcontext_32 *sc)
 {
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 924de99e0c61..8c6fab9f402a 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct ksignal;
 
 struct thread_shstk {
 	u64	base;
@@ -22,6 +23,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
 void shstk_free(struct task_struct *p);
 int shstk_disable(void);
 void reset_thread_shstk(void);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
 #else
 static inline long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features) { return -EINVAL; }
@@ -33,6 +36,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
 static inline void shstk_free(struct task_struct *p) {}
 static inline int shstk_disable(void) { return -EOPNOTSUPP; }
 static inline void reset_thread_shstk(void) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
 #endif /* CONFIG_X86_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 8904aef487bf..04442134aadd 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -227,41 +227,129 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
 }
 
 /*
- * Verify the user shadow stack has a valid token on it, and then set
- * *new_ssp according to the token.
+ * Create a restore token on shadow stack, and then push the user-mode
+ * function return address.
  */
-static int shstk_check_rstor_token(unsigned long *new_ssp)
+static int shstk_setup_rstor_token(unsigned long ret_addr, unsigned long *new_ssp)
 {
-	unsigned long token_addr;
-	unsigned long token;
+	unsigned long ssp, token_addr;
+	int err;
+
+	if (!ret_addr)
+		return -EINVAL;
+
+	ssp = get_user_shstk_addr();
+	if (!ssp)
+		return -EINVAL;
+
+	err = create_rstor_token(ssp, &token_addr);
+	if (err)
+		return err;
+
+	ssp = token_addr - sizeof(u64);
+	err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
+
+	if (!err)
+		*new_ssp = ssp;
+
+	return err;
+}
+
+static int shstk_push_sigframe(unsigned long *ssp)
+{
+	unsigned long target_ssp = *ssp;
+
+	/* Token must be aligned */
+	if (!IS_ALIGNED(*ssp, 8))
+		return -EINVAL;
 
-	token_addr = get_user_shstk_addr();
-	if (!token_addr)
+	if (!IS_ALIGNED(target_ssp, 8))
 		return -EINVAL;
 
-	if (get_user(token, (unsigned long __user *)token_addr))
+	*ssp -= SS_FRAME_SIZE;
+	if (put_shstk_data((void *__user)*ssp, target_ssp))
 		return -EFAULT;
 
-	/* Is mode flag correct? */
-	if (!(token & BIT(0)))
+	return 0;
+}
+
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+	unsigned long token_addr;
+	int err;
+
+	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Restore SSP aligned? */
+	if (unlikely(!IS_ALIGNED(token_addr, 8)))
 		return -EINVAL;
 
-	/* Is busy flag set? */
-	if (token & BIT(1))
+	/* SSP in userspace? */
+	if (unlikely(token_addr >= TASK_SIZE_MAX))
 		return -EINVAL;
 
-	/* Mask out flags */
-	token &= ~3UL;
+	*ssp = token_addr;
+
+	return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+	void __user *restorer = ksig->ka.sa.sa_restorer;
+	unsigned long ssp;
+	int err;
 
-	/* Restore address aligned? */
-	if (!IS_ALIGNED(token, 8))
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+	    !feature_enabled(CET_SHSTK))
+		return 0;
+
+	if (!restorer)
 		return -EINVAL;
 
-	/* Token placed properly? */
-	if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
+		return -EINVAL;
+
+	err = shstk_push_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	/* Push restorer address */
+	ssp -= SS_FRAME_SIZE;
+	err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+	if (unlikely(err))
+		return -EFAULT;
+
+	fpu_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+
+	return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+	unsigned long ssp;
+	int err;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+	    !feature_enabled(CET_SHSTK))
+		return 0;
+
+	ssp = get_user_shstk_addr();
+	if (unlikely(!ssp))
 		return -EINVAL;
 
-	*new_ssp = token;
+	err = shstk_pop_sigframe(&ssp);
+	if (unlikely(err))
+		return err;
+
+	fpu_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
 
 	return 0;
 }
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9c7265b524c7..d2081305f698 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -47,6 +47,7 @@
 #include <asm/syscall.h>
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/cet.h>
 
 #ifdef CONFIG_X86_64
 /*
@@ -472,6 +473,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
 	frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp);
 	uc_flags = frame_uc_flags(regs);
 
+	if (setup_signal_shadow_stack(ksig))
+		return -EFAULT;
+
 	if (!user_access_begin(frame, sizeof(*frame)))
 		return -EFAULT;
 
@@ -675,6 +679,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
@@ -992,6 +999,9 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
 	if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
 		goto badframe;
 
+	if (restore_signal_shadow_stack())
+		goto badframe;
+
 	if (compat_restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (26 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 22:23   ` Kees Cook
  2022-10-10 11:13   ` Florian Weimer
  2022-09-29 22:29 ` [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace Rick Edgecombe
                   ` (11 subsequent siblings)
  39 siblings, 2 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(adrr, stack_size, SHADOW_STACK_SET_TOKEN);

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Change syscall to take address like mmap() for CRIU's usage

v1:
 - New patch (replaces PROT_SHADOW_STACK).

 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/include/uapi/asm/mman.h       |  2 ++
 arch/x86/kernel/shstk.c                | 48 +++++++++++++++++++++-----
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +-
 kernel/sys_ni.c                        |  1 +
 6 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..d9639e3e0a33 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	map_shadow_stack	sys_map_shadow_stack
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 775dbd3aff73..c9fc57c88fcc 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -12,6 +12,8 @@
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+#define SHADOW_STACK_SET_TOKEN	0x1	/* Set up a restore token in the shadow stack */
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 04442134aadd..873830d63adc 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@
 #include <linux/compat.h>
 #include <linux/sizes.h>
 #include <linux/user.h>
+#include <linux/syscalls.h>
 #include <asm/msr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
@@ -62,24 +63,34 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
 	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
 		return -EFAULT;
 
-	*token_addr = addr;
+	if (token_addr)
+		*token_addr = addr;
 
 	return 0;
 }
 
-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+				 unsigned long token_offset, bool set_res_tok)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, unused;
+	unsigned long mapped_addr, unused;
 
 	mmap_write_lock(mm);
-	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
-		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
-
+	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 
-	return addr;
+	if (!set_res_tok || IS_ERR_VALUE(addr))
+		goto out;
+
+	if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+		vm_munmap(mapped_addr, size);
+		return -EINVAL;
+	}
+
+out:
+	return mapped_addr;
 }
 
 static void unmap_shadow_stack(u64 base, u64 size)
@@ -122,7 +133,7 @@ int shstk_setup(void)
 		return -EOPNOTSUPP;
 
 	size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, size, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -174,6 +185,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 
 
 	stack_size = PAGE_ALIGN(stack_size);
+	addr = alloc_shstk(0, stack_size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -395,6 +407,26 @@ int shstk_disable(void)
 	return 0;
 }
 
+
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+	unsigned long aligned_size;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return -ENOSYS;
+
+	/*
+	 * An overflow would result in attempting to write the restore token
+	 * to the wrong location. Not catastrophic, but just return the right
+	 * error code and block it.
+	 */
+	aligned_size = PAGE_ALIGN(size);
+	if (aligned_size < size)
+		return -EOVERFLOW;
+
+	return alloc_shstk(addr, aligned_size, size, flags & SHADOW_STACK_SET_TOKEN);
+}
+
 long cet_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_CET_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..3ae05cbdea5b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
 COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (27 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 22:28   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status Rick Edgecombe
                   ` (10 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

For the current shadow stack implementation, shadow stacks contents easily
be arbitrarily provisioned with data. This property helps apps protect
themselves better, but also restricts any potential apps that may want to
do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, wrss, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace wrss instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Add some commit log verbiage from (Dave Hansen)

v1:
 - New patch.

 arch/x86/include/asm/cet.h        |  2 ++
 arch/x86/include/uapi/asm/prctl.h |  1 +
 arch/x86/kernel/shstk.c           | 34 +++++++++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 8c6fab9f402a..edf681d4843a 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -25,6 +25,7 @@ int shstk_disable(void);
 void reset_thread_shstk(void);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
+int wrss_control(bool enable);
 #else
 static inline long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features) { return -EINVAL; }
@@ -38,6 +39,7 @@ static inline int shstk_disable(void) { return -EOPNOTSUPP; }
 static inline void reset_thread_shstk(void) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
+static inline int wrss_control(bool enable) { return -EOPNOTSUPP; }
 #endif /* CONFIG_X86_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 41af3a8c4fa4..d811f0c5fc4f 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -27,5 +27,6 @@
 #define ARCH_CET_LOCK			0x4003
 
 #define CET_SHSTK			0x1
+#define CET_WRSS			0x2
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 873830d63adc..fc64a04366aa 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -386,6 +386,36 @@ void shstk_free(struct task_struct *tsk)
 	unmap_shadow_stack(shstk->base, shstk->size);
 }
 
+int wrss_control(bool enable)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Only enable wrss if shadow stack is enabled. If shadow stack is not
+	 * enabled, wrss will already be disabled, so don't bother clearing it
+	 * when disabling.
+	 */
+	if (!feature_enabled(CET_SHSTK))
+		return -EPERM;
+
+	/* Already enabled/disabled? */
+	if (feature_enabled(CET_WRSS) == enable)
+		return 0;
+
+	fpu_lock_and_load();
+	if (enable) {
+		set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
+		feature_set(CET_WRSS);
+	} else {
+		set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
+		feature_clr(CET_WRSS);
+	}
+	fpregs_unlock();
+
+	return 0;
+}
+
 int shstk_disable(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
@@ -397,12 +427,12 @@ int shstk_disable(void)
 
 	fpu_lock_and_load();
 	/* Disable WRSS too when disabling shadow stack */
-	set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+	set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN | CET_WRSS_EN);
 	wrmsrl(MSR_IA32_PL3_SSP, 0);
 	fpregs_unlock();
 
 	shstk_free(current);
-	feature_clr(CET_SHSTK);
+	feature_clr(CET_SHSTK | CET_WRSS);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (28 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 22:37   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 31/39] x86/cet/shstk: Wire in CET interface Rick Edgecombe
                   ` (9 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Applications and loaders can have logic to decide whether to enable CET.
They usually don't report whether CET has been enabled or not, so there
is no way to verify whether an application actually is protected by CET
features.

Add two lines in /proc/$PID/arch_status to report enabled and locked
features.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[Switched to CET, added to commit log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - New patch

 arch/x86/kernel/Makefile     |  2 ++
 arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
 arch/x86/kernel/proc.c       | 63 ++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+), 47 deletions(-)
 create mode 100644 arch/x86/kernel/proc.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8950d1f71226..b87b8a0a3749 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -141,6 +141,8 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
 
 obj-$(CONFIG_X86_SHADOW_STACK)		+= shstk.o
 
+obj-$(CONFIG_PROC_FS)			+= proc.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5e6a4867fd05..9258fc1169cc 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,8 +10,6 @@
 #include <linux/mman.h>
 #include <linux/nospec.h>
 #include <linux/pkeys.h>
-#include <linux/seq_file.h>
-#include <linux/proc_fs.h>
 #include <linux/vmalloc.h>
 
 #include <asm/fpu/api.h>
@@ -1746,48 +1744,3 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
 		return -EINVAL;
 	}
 }
-
-#ifdef CONFIG_PROC_PID_ARCH_STATUS
-/*
- * Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
- */
-static void avx512_status(struct seq_file *m, struct task_struct *task)
-{
-	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
-	long delta;
-
-	if (!timestamp) {
-		/*
-		 * Report -1 if no AVX512 usage
-		 */
-		delta = -1;
-	} else {
-		delta = (long)(jiffies - timestamp);
-		/*
-		 * Cap to LONG_MAX if time difference > LONG_MAX
-		 */
-		if (delta < 0)
-			delta = LONG_MAX;
-		delta = jiffies_to_msecs(delta);
-	}
-
-	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
-	seq_putc(m, '\n');
-}
-
-/*
- * Report architecture specific information
- */
-int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
-			struct pid *pid, struct task_struct *task)
-{
-	/*
-	 * Report AVX512 state if the processor and build option supported.
-	 */
-	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
-		avx512_status(m, task);
-
-	return 0;
-}
-#endif /* CONFIG_PROC_PID_ARCH_STATUS */
diff --git a/arch/x86/kernel/proc.c b/arch/x86/kernel/proc.c
new file mode 100644
index 000000000000..fa9cbe13c298
--- /dev/null
+++ b/arch/x86/kernel/proc.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <uapi/asm/prctl.h>
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+static void dump_features(struct seq_file *m, unsigned long features)
+{
+	if (features & CET_SHSTK)
+		seq_puts(m, "shstk ");
+	if (features & CET_WRSS)
+		seq_puts(m, "wrss ");
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+			struct pid *pid, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+
+	seq_puts(m, "Thread_features:\t");
+	dump_features(m, task->thread.features);
+	seq_putc(m, '\n');
+
+	seq_puts(m, "Thread_features_locked:\t");
+	dump_features(m, task->thread.features_locked);
+	seq_putc(m, '\n');
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 31/39] x86/cet/shstk: Wire in CET interface
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (29 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 22:41   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 32/39] selftests/x86: Add shadow stack test Rick Edgecombe
                   ` (8 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

The kernel now has the main CET functionality to support applications.
Wire in the WRSS and shadow stack enable/disable functions into the
existing CET API skeleton.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Split from other patches

 arch/x86/kernel/shstk.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index fc64a04366aa..0efec02dbe6b 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -477,9 +477,17 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
 		return -EINVAL;
 
 	if (option == ARCH_CET_DISABLE) {
+		if (features & CET_WRSS)
+			return wrss_control(false);
+		if (features & CET_SHSTK)
+			return shstk_disable();
 		return -EINVAL;
 	}
 
 	/* Handle ARCH_CET_ENABLE */
+	if (features & CET_SHSTK)
+		return shstk_setup();
+	if (features & CET_WRSS)
+		return wrss_control(true);
 	return -EINVAL;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 32/39] selftests/x86: Add shadow stack test
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (30 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 31/39] x86/cet/shstk: Wire in CET interface Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 23:56   ` Kees Cook
  2022-09-29 22:29 ` [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs Rick Edgecombe
                   ` (7 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

Add a simple selftest for exercising some shadow stack behavior:
 - map_shadow_stack syscall and pivot
 - Faulting in shadow stack memory
 - Handling shadow stack violations
 - GUP of shadow stack memory
 - mprotect() of shadow stack memory
 - Userfaultfd on shadow stack memory

Since this test exercises a recently added syscall manually, it needs
to find the automatically created __NR_foo defines. Per the selftest
documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
find the headers from the kernel source. This way the new selftest can
be built inside the kernel source tree without installing the headers
to the system. So also add KHDR_INCLUDES as described in the selftest
docs, to facilitate this.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Change print statements to more align with other selftests
 - Add more tests
 - Add KHDR_INCLUDES to Makefile

v1:
 - New patch.

 tools/testing/selftests/x86/Makefile          |   4 +-
 .../testing/selftests/x86/test_shadow_stack.c | 571 ++++++++++++++++++
 2 files changed, 573 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..cfc8a26ad151 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
-			corrupt_xstate_header amx
+			corrupt_xstate_header amx test_shadow_stack
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
@@ -34,7 +34,7 @@ BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
 BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))
 
-CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
+CFLAGS := -O2 -g -std=gnu99 -pthread -Wall $(KHDR_INCLUDES)
 
 # call32_from_64 in thunks.S uses absolute addresses.
 ifeq ($(CAN_BUILD_WITH_NOPIE),1)
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
new file mode 100644
index 000000000000..249397736d0d
--- /dev/null
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -0,0 +1,571 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program test's basic kernel shadow stack support. It enables shadow
+ * stack manual via the arch_prctl(), instead of relying on glibc. It's
+ * Makefile doesn't compile with shadow stack support, so it doesn't rely on
+ * any particular glibc. As a result it can't do any operations that require
+ * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just
+ * stick to the basics and hope the compiler doesn't do anything strange.
+ */
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <stdint.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+
+#define SS_SIZE 0x200000
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+	printf("[SKIP]\tCompiler does not support CET.\n");
+	return 0;
+}
+#else
+void write_shstk(unsigned long *addr, unsigned long val)
+{
+	asm volatile("wrssq %[val], (%[addr])\n"
+		     : "+m" (addr)
+		     : [addr] "r" (addr), [val] "r" (val));
+}
+
+static inline unsigned long __attribute__((always_inline)) get_ssp(void)
+{
+	unsigned long ret = 0;
+
+	asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
+	return ret;
+}
+
+/*
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack get's enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull in all
+ * of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2)					\
+({								\
+	long _ret;						\
+	register long _num  asm("eax") = __NR_arch_prctl;	\
+	register long _arg1 asm("rdi") = (long)(arg1);		\
+	register long _arg2 asm("rsi") = (long)(arg2);		\
+								\
+	asm volatile (						\
+		"syscall\n"					\
+		: "=a"(_ret)					\
+		: "r"(_arg1), "r"(_arg2),			\
+		  "0"(_num)					\
+		: "rcx", "r11", "memory", "cc"			\
+	);							\
+	_ret;							\
+})
+
+void *create_shstk(void *addr)
+{
+	return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+void *create_normal_mem(void *addr)
+{
+	return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
+		    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+}
+
+void free_shstk(void *shstk)
+{
+	munmap(shstk, SS_SIZE);
+}
+
+int reset_shstk(void *shstk)
+{
+	return madvise(shstk, SS_SIZE, MADV_DONTNEED);
+}
+
+void try_shstk(unsigned long new_ssp)
+{
+	unsigned long ssp;
+
+	printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
+		new_ssp, *((unsigned long *)new_ssp));
+
+	ssp = get_ssp();
+	printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
+
+	asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+	asm volatile("saveprevssp");
+	printf("[INFO]\tssp is now %lx\n", get_ssp());
+
+	/* Switch back to original shadow stack */
+	ssp -= 8;
+	asm volatile("rstorssp (%0)\n":: "r" (ssp));
+	asm volatile("saveprevssp");
+}
+
+int test_shstk_pivot(void)
+{
+	void *shstk = create_shstk(0);
+
+	if (shstk == MAP_FAILED) {
+		printf("[FAIL]\tError creating shadow stack: %d\n", errno);
+		return 1;
+	}
+	try_shstk((unsigned long)shstk + SS_SIZE - 8);
+	free_shstk(shstk);
+
+	printf("[OK]\tShadow stack pivot\n");
+	return 0;
+}
+
+int test_shstk_faults(void)
+{
+	unsigned long *shstk = create_shstk(0);
+
+	/* Read shadow stack, test if it's zero to not get read optimized out */
+	if (*shstk != 0)
+		goto err;
+
+	/* Wrss memory that was already read. */
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	/* Page out memory, so we can wrss it again. */
+	if (reset_shstk((void *)shstk))
+		goto err;
+
+	write_shstk(shstk, 1);
+	if (*shstk != 1)
+		goto err;
+
+	printf("[OK]\tShadow stack faults\n");
+	return 0;
+
+err:
+	return 1;
+}
+
+unsigned long saved_ssp;
+unsigned long saved_ssp_val;
+volatile bool segv_triggered;
+
+void __attribute__((noinline)) violate_ss(void)
+{
+	saved_ssp = get_ssp();
+	saved_ssp_val = *(unsigned long *)saved_ssp;
+
+	/* Corrupt shadow stack */
+	printf("[INFO]\tCorrupting shadow stack\n");
+	write_shstk((void *)saved_ssp, 0);
+}
+
+void segv_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tGenerated shadow stack violation successfully\n");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	write_shstk((void *)saved_ssp, saved_ssp_val);
+}
+
+int test_shstk_violation(void)
+{
+	struct sigaction sa;
+
+	sa.sa_sigaction = segv_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* Make sure segv_triggered is set before violate_ss() */
+	asm volatile("" : : : "memory");
+
+	violate_ss();
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow stack violation test\n");
+
+	return !segv_triggered;
+}
+
+/* Gup test state */
+#define MAGIC_VAL 0x12345678
+bool is_shstk_access;
+void *shstk_ptr;
+int fd;
+
+void reset_test_shstk(void *addr)
+{
+	if (shstk_ptr != NULL)
+		free_shstk(shstk_ptr);
+	shstk_ptr = create_shstk(addr);
+}
+
+void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
+{
+	printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
+
+	segv_triggered = true;
+
+	/* Fix shadow stack */
+	if (is_shstk_access) {
+		reset_test_shstk(shstk_ptr);
+		return;
+	}
+
+	free_shstk(shstk_ptr);
+	create_normal_mem(shstk_ptr);
+}
+
+bool test_shstk_access(void *ptr)
+{
+	is_shstk_access = true;
+	segv_triggered = false;
+	write_shstk(ptr, MAGIC_VAL);
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool test_write_access(void *ptr)
+{
+	is_shstk_access = false;
+	segv_triggered = false;
+	*(unsigned long *)ptr = MAGIC_VAL;
+
+	asm volatile("" : : : "memory");
+
+	return segv_triggered;
+}
+
+bool gup_write(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (write(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+bool gup_read(void *ptr)
+{
+	unsigned long val;
+
+	lseek(fd, (unsigned long)ptr, SEEK_SET);
+	if (read(fd, &val, sizeof(val)) < 0)
+		return 1;
+
+	return 0;
+}
+
+int test_gup(void)
+{
+	struct sigaction sa;
+	int status;
+	pid_t pid;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	fd = open("/proc/self/mem", O_RDWR);
+	if (fd == -1)
+		return 1;
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (test_shstk_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> shstk access success\n");
+
+	reset_test_shstk(0);
+	if (gup_read(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup read -> write access success\n");
+
+	reset_test_shstk(0);
+	if (gup_write(shstk_ptr))
+		return 1;
+	if (!test_write_access(shstk_ptr))
+		return 1;
+	printf("[INFO]\tGup write -> write access success\n");
+
+	close(fd);
+
+	/* COW/gup test */
+	reset_test_shstk(0);
+	pid = fork();
+	if (!pid) {
+		fd = open("/proc/self/mem", O_RDWR);
+		if (fd == -1)
+			exit(1);
+
+		if (gup_write(shstk_ptr)) {
+			close(fd);
+			exit(1);
+		}
+		close(fd);
+		exit(0);
+	}
+	waitpid(pid, &status, 0);
+	if (WEXITSTATUS(status)) {
+		printf("[FAIL]\tWrite in child failed\n");
+		return 1;
+	}
+	if (*(unsigned long *)shstk_ptr == MAGIC_VAL) {
+		printf("[FAIL]\tWrite in child wrote through to shared memory\n");
+		return 1;
+	}
+
+	printf("[INFO]\tCow gup write -> write access success\n");
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tShadow gup test\n");
+
+	return 0;
+}
+
+int test_mprotect(void)
+{
+	struct sigaction sa;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	segv_triggered = false;
+
+	/* mprotect a shaodw stack as read only */
+	reset_test_shstk(0);
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+		return 1;
+	}
+
+	/* try to wrss it and fail */
+	if (!test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to read-only memory succeeded\n");
+		return 1;
+	}
+
+	/* then back to writable */
+	if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) {
+		printf("[FAIL]\tmprotect(PROT_WRITE) failed\n");
+		return 1;
+	}
+
+	/* then pivot to it and succeed */
+	if (test_shstk_access(shstk_ptr)) {
+		printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n");
+		return 1;
+	}
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tmprotect() test\n");
+
+	return 0;
+}
+
+char zero[4096];
+
+static void *uffd_thread(void *arg)
+{
+	struct uffdio_copy req;
+	int uffd = *(int *)arg;
+	struct uffd_msg msg;
+
+	if (read(uffd, &msg, sizeof(msg)) <= 0)
+		return (void *)1;
+
+	req.dst = msg.arg.pagefault.address;
+	req.src = (__u64)zero;
+	req.len = 4096;
+	req.mode = 0;
+
+	if (ioctl(uffd, UFFDIO_COPY, &req))
+		return (void *)1;
+
+	return (void *)0;
+}
+
+int test_userfaultfd(void)
+{
+	struct uffdio_register uffdio_register;
+	struct uffdio_api uffdio_api;
+	struct sigaction sa;
+	pthread_t thread;
+	void *res;
+	int uffd;
+
+	sa.sa_sigaction = test_access_fix_handler;
+	if (sigaction(SIGSEGV, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		printf("[SKIP]\tUserfaultfd unavailable.\n");
+		return 0;
+	}
+
+	reset_test_shstk(0);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+		goto err;
+
+	uffdio_register.range.start = (__u64)shstk_ptr;
+	uffdio_register.range.len = 4096;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		goto err;
+
+	if (pthread_create(&thread, NULL, &uffd_thread, &uffd))
+		goto err;
+
+	test_shstk_access(shstk_ptr);
+
+	if (pthread_join(thread, &res))
+		goto err;
+
+	if (test_shstk_access(shstk_ptr))
+		goto err;
+
+	free_shstk(shstk_ptr);
+
+	signal(SIGSEGV, SIG_DFL);
+
+	printf("[OK]\tUserfaultfd test\n");
+	return !!res;
+err:
+	free_shstk(shstk_ptr);
+	close(uffd);
+	signal(SIGSEGV, SIG_DFL);
+	return 1;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = 0;
+
+	if (ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK)) {
+		printf("[SKIP]\tCould not enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_CET_DISABLE, CET_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	if (ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK)) {
+		printf("[SKIP]\tCould not re-enable Shadow stack\n");
+		return 1;
+	}
+
+	if (ARCH_PRCTL(ARCH_CET_ENABLE, CET_WRSS)) {
+		printf("[SKIP]\tCould not enable WRSS\n");
+		ret = 1;
+		goto out;
+	}
+
+	/* Should have succeeded if here, but this is a test, so double check. */
+	if (!get_ssp()) {
+		printf("[FAIL]\tShadow stack disabled\n");
+		return 1;
+	}
+
+	if (test_shstk_pivot()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack pivot\n");
+		goto out;
+	}
+
+	if (test_shstk_faults()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack fault test\n");
+		goto out;
+	}
+
+	if (test_shstk_violation()) {
+		ret = 1;
+		printf("[FAIL]\tShadow stack violation test\n");
+		goto out;
+	}
+
+	if (test_gup()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow stack gup\n");
+	}
+
+	if (test_mprotect()) {
+		ret = 1;
+		printf("[FAIL]\tShadow shadow mprotect test\n");
+	}
+
+	if (test_userfaultfd()) {
+		ret = 1;
+		printf("[FAIL]\tUserfaultfd test\n");
+	}
+
+out:
+	/*
+	 * Disable shadow stack before the function returns, or there will be a
+	 * shadow stack violation.
+	 */
+	if (ARCH_PRCTL(ARCH_CET_DISABLE, CET_SHSTK)) {
+		ret = 1;
+		printf("[FAIL]\tDisabling shadow stack failed\n");
+	}
+
+	return ret;
+}
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (31 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 32/39] selftests/x86: Add shadow stack test Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 23:57   ` Kees Cook
  2022-09-29 22:29 ` [OPTIONAL/CLEANUP v2 34/39] x86: Separate out x86_regset for 32 and 64 bit Rick Edgecombe
                   ` (6 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

Shadow stack is supported on newer AMD processors, but the kernel
implementation has not been tested on them. Prevent basic issues from
showing up for normal users by disabling shadow stack on all CPUs except
Intel until it has been tested. At which point the limitation should be
removed.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v1:
 - New patch.

 arch/x86/kernel/cpu/common.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d7415bb556b2..f7cacc5698d5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -606,6 +606,14 @@ static __always_inline void setup_cet(struct cpuinfo_x86 *c)
 	if (!kernel_ibt && !user_shstk)
 		return;
 
+	/*
+	 * Shadow stack is supported on AMD processors, but has not been
+	 * tested. Only support it on Intel processors until this is done.
+	 * At which point, this vendor check should be removed.
+	 */
+	if (c->x86_vendor != X86_VENDOR_INTEL)
+		setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+
 	if (kernel_ibt)
 		msr = CET_ENDBR_EN;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [OPTIONAL/CLEANUP v2 34/39] x86: Separate out x86_regset for 32 and 64 bit
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (32 preceding siblings ...)
  2022-09-29 22:29 ` [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-29 22:29 ` [OPTIONAL/CLEANUP v2 35/39] x86: Improve formatting of user_regset arrays Rick Edgecombe
                   ` (5 subsequent siblings)
  39 siblings, 0 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

In fill_thread_core_info() the ptrace accessible registers are collected
for a core file to be written out as notes. The note array is allocated
from a size calculated by iterating the user regset view, and counting the
regsets that have a non-zero core_note_type. However, this only allows for
there to be non-zero core_note_type at the end of the regset view. If
there are any in the middle, fill_thread_core_info() will overflow the
note allocation, as it iterates over the size of the view and the
allocation would be smaller than that.

To apparently avoid this problem, x86_32_regsets and x86_64_regsets need
to be constructed in a special way. They both draw their indices from a
shared enum x86_regset, but 32 bit and 64 bit don't all support the same
regsets and can be compiled in at the same time in the case of
IA32_EMULATION. So this enum has to be laid out in a special way such that
there are no gaps for both x86_32_regsets and x86_64_regsets. This
involves ordering them just right by creating aliases for enum’s that
are only in one view or the other, or creating multiple versions like
REGSET_IOPERM32/REGSET_IOPERM64.

So the collection of the registers tries to minimize the size of the
allocation, but it doesn’t quite work. Then the x86 ptrace side works
around it by constructing the enum just right to avoid a problem. In the
end there is no functional problem, but it is somewhat strange and
fragile.

It could also be improved like this [1], by better utilizing the smaller
array, but this still wastes space in the regset array’s if they are not
carefully crafted to avoid gaps. Instead, just fully separate out the
enums and give them separate 32 and 64 enum names. Add some bitsize-free
defines for REGSET_GENERAL and REGSET_FP since they are the only two
referred to in bitsize generic code.

This should have no functional change and is only changing how constants
are generated and referred to.

[1] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - New patch

 arch/x86/kernel/ptrace.c | 61 ++++++++++++++++++++++++++--------------
 1 file changed, 40 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 37c12fb92906..1a4df5fbc5e9 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -44,16 +44,35 @@
 
 #include "tls.h"
 
-enum x86_regset {
-	REGSET_GENERAL,
-	REGSET_FP,
-	REGSET_XFP,
-	REGSET_IOPERM64 = REGSET_XFP,
-	REGSET_XSTATE,
-	REGSET_TLS,
+enum x86_regset_32 {
+	REGSET_GENERAL32,
+	REGSET_FP32,
+	REGSET_XFP32,
+	REGSET_XSTATE32,
+	REGSET_TLS32,
 	REGSET_IOPERM32,
 };
 
+enum x86_regset_64 {
+	REGSET_GENERAL64,
+	REGSET_FP64,
+	REGSET_IOPERM64,
+	REGSET_XSTATE64,
+};
+
+#define REGSET_GENERAL \
+({ \
+	BUILD_BUG_ON((int)REGSET_GENERAL32 != (int)REGSET_GENERAL64); \
+	REGSET_GENERAL32; \
+})
+
+#define REGSET_FP \
+({ \
+	BUILD_BUG_ON((int)REGSET_FP32 != (int)REGSET_FP64); \
+	REGSET_FP32; \
+})
+
+
 struct pt_regs_offset {
 	const char *name;
 	int offset;
@@ -788,13 +807,13 @@ long arch_ptrace(struct task_struct *child, long request,
 #ifdef CONFIG_X86_32
 	case PTRACE_GETFPXREGS:	/* Get the child extended FPU state. */
 		return copy_regset_to_user(child, &user_x86_32_view,
-					   REGSET_XFP,
+					   REGSET_XFP32,
 					   0, sizeof(struct user_fxsr_struct),
 					   datap) ? -EIO : 0;
 
 	case PTRACE_SETFPXREGS:	/* Set the child extended FPU state. */
 		return copy_regset_from_user(child, &user_x86_32_view,
-					     REGSET_XFP,
+					     REGSET_XFP32,
 					     0, sizeof(struct user_fxsr_struct),
 					     datap) ? -EIO : 0;
 #endif
@@ -1086,13 +1105,13 @@ static long ia32_arch_ptrace(struct task_struct *child, compat_long_t request,
 
 	case PTRACE_GETFPXREGS:	/* Get the child extended FPU state. */
 		return copy_regset_to_user(child, &user_x86_32_view,
-					   REGSET_XFP, 0,
+					   REGSET_XFP32, 0,
 					   sizeof(struct user32_fxsr_struct),
 					   datap);
 
 	case PTRACE_SETFPXREGS:	/* Set the child extended FPU state. */
 		return copy_regset_from_user(child, &user_x86_32_view,
-					     REGSET_XFP, 0,
+					     REGSET_XFP32, 0,
 					     sizeof(struct user32_fxsr_struct),
 					     datap);
 
@@ -1215,19 +1234,19 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 #ifdef CONFIG_X86_64
 
 static struct user_regset x86_64_regsets[] __ro_after_init = {
-	[REGSET_GENERAL] = {
+	[REGSET_GENERAL64] = {
 		.core_note_type = NT_PRSTATUS,
 		.n = sizeof(struct user_regs_struct) / sizeof(long),
 		.size = sizeof(long), .align = sizeof(long),
 		.regset_get = genregs_get, .set = genregs_set
 	},
-	[REGSET_FP] = {
+	[REGSET_FP64] = {
 		.core_note_type = NT_PRFPREG,
 		.n = sizeof(struct fxregs_state) / sizeof(long),
 		.size = sizeof(long), .align = sizeof(long),
 		.active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
 	},
-	[REGSET_XSTATE] = {
+	[REGSET_XSTATE64] = {
 		.core_note_type = NT_X86_XSTATE,
 		.size = sizeof(u64), .align = sizeof(u64),
 		.active = xstateregs_active, .regset_get = xstateregs_get,
@@ -1256,31 +1275,31 @@ static const struct user_regset_view user_x86_64_view = {
 
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 static struct user_regset x86_32_regsets[] __ro_after_init = {
-	[REGSET_GENERAL] = {
+	[REGSET_GENERAL32] = {
 		.core_note_type = NT_PRSTATUS,
 		.n = sizeof(struct user_regs_struct32) / sizeof(u32),
 		.size = sizeof(u32), .align = sizeof(u32),
 		.regset_get = genregs32_get, .set = genregs32_set
 	},
-	[REGSET_FP] = {
+	[REGSET_FP32] = {
 		.core_note_type = NT_PRFPREG,
 		.n = sizeof(struct user_i387_ia32_struct) / sizeof(u32),
 		.size = sizeof(u32), .align = sizeof(u32),
 		.active = regset_fpregs_active, .regset_get = fpregs_get, .set = fpregs_set
 	},
-	[REGSET_XFP] = {
+	[REGSET_XFP32] = {
 		.core_note_type = NT_PRXFPREG,
 		.n = sizeof(struct fxregs_state) / sizeof(u32),
 		.size = sizeof(u32), .align = sizeof(u32),
 		.active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
 	},
-	[REGSET_XSTATE] = {
+	[REGSET_XSTATE32] = {
 		.core_note_type = NT_X86_XSTATE,
 		.size = sizeof(u64), .align = sizeof(u64),
 		.active = xstateregs_active, .regset_get = xstateregs_get,
 		.set = xstateregs_set
 	},
-	[REGSET_TLS] = {
+	[REGSET_TLS32] = {
 		.core_note_type = NT_386_TLS,
 		.n = GDT_ENTRY_TLS_ENTRIES, .bias = GDT_ENTRY_TLS_MIN,
 		.size = sizeof(struct user_desc),
@@ -1311,10 +1330,10 @@ u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 void __init update_regset_xstate_info(unsigned int size, u64 xstate_mask)
 {
 #ifdef CONFIG_X86_64
-	x86_64_regsets[REGSET_XSTATE].n = size / sizeof(u64);
+	x86_64_regsets[REGSET_XSTATE64].n = size / sizeof(u64);
 #endif
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
-	x86_32_regsets[REGSET_XSTATE].n = size / sizeof(u64);
+	x86_32_regsets[REGSET_XSTATE32].n = size / sizeof(u64);
 #endif
 	xstate_fx_sw_bytes[USER_XSTATE_XCR0_WORD] = xstate_mask;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [OPTIONAL/CLEANUP v2 35/39] x86: Improve formatting of user_regset arrays
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (33 preceding siblings ...)
  2022-09-29 22:29 ` [OPTIONAL/CLEANUP v2 34/39] x86: Separate out x86_regset for 32 and 64 bit Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
                   ` (4 subsequent siblings)
  39 siblings, 0 replies; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

Back in 2018, Ingo Molnar suggested[0] to improve the formatting of the
struct user_regset arrays. They have multiple member initializations per
line and some lines exceed 100 chars. Reformat them like he suggested.

[0] https://lore.kernel.org/lkml/20180711102035.GB8574@gmail.com/

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - New patch

 arch/x86/kernel/ptrace.c | 107 ++++++++++++++++++++++++---------------
 1 file changed, 65 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 1a4df5fbc5e9..eed8a65d335d 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1235,28 +1235,37 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 
 static struct user_regset x86_64_regsets[] __ro_after_init = {
 	[REGSET_GENERAL64] = {
-		.core_note_type = NT_PRSTATUS,
-		.n = sizeof(struct user_regs_struct) / sizeof(long),
-		.size = sizeof(long), .align = sizeof(long),
-		.regset_get = genregs_get, .set = genregs_set
+		.core_note_type	= NT_PRSTATUS,
+		.n		= sizeof(struct user_regs_struct) / sizeof(long),
+		.size		= sizeof(long),
+		.align		= sizeof(long),
+		.regset_get	= genregs_get,
+		.set		= genregs_set
 	},
 	[REGSET_FP64] = {
-		.core_note_type = NT_PRFPREG,
-		.n = sizeof(struct fxregs_state) / sizeof(long),
-		.size = sizeof(long), .align = sizeof(long),
-		.active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
+		.core_note_type	= NT_PRFPREG,
+		.n		= sizeof(struct fxregs_state) / sizeof(long),
+		.size		= sizeof(long),
+		.align		= sizeof(long),
+		.active		= regset_xregset_fpregs_active,
+		.regset_get	= xfpregs_get,
+		.set		= xfpregs_set
 	},
 	[REGSET_XSTATE64] = {
-		.core_note_type = NT_X86_XSTATE,
-		.size = sizeof(u64), .align = sizeof(u64),
-		.active = xstateregs_active, .regset_get = xstateregs_get,
-		.set = xstateregs_set
+		.core_note_type	= NT_X86_XSTATE,
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= xstateregs_active,
+		.regset_get	= xstateregs_get,
+		.set		= xstateregs_set
 	},
 	[REGSET_IOPERM64] = {
-		.core_note_type = NT_386_IOPERM,
-		.n = IO_BITMAP_LONGS,
-		.size = sizeof(long), .align = sizeof(long),
-		.active = ioperm_active, .regset_get = ioperm_get
+		.core_note_type	= NT_386_IOPERM,
+		.n		= IO_BITMAP_LONGS,
+		.size		= sizeof(long),
+		.align		= sizeof(long),
+		.active		= ioperm_active,
+		.regset_get	= ioperm_get
 	},
 };
 
@@ -1276,42 +1285,56 @@ static const struct user_regset_view user_x86_64_view = {
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 static struct user_regset x86_32_regsets[] __ro_after_init = {
 	[REGSET_GENERAL32] = {
-		.core_note_type = NT_PRSTATUS,
-		.n = sizeof(struct user_regs_struct32) / sizeof(u32),
-		.size = sizeof(u32), .align = sizeof(u32),
-		.regset_get = genregs32_get, .set = genregs32_set
+		.core_note_type	= NT_PRSTATUS,
+		.n		= sizeof(struct user_regs_struct32) / sizeof(u32),
+		.size		= sizeof(u32),
+		.align		= sizeof(u32),
+		.regset_get	= genregs32_get,
+		.set		= genregs32_set
 	},
 	[REGSET_FP32] = {
-		.core_note_type = NT_PRFPREG,
-		.n = sizeof(struct user_i387_ia32_struct) / sizeof(u32),
-		.size = sizeof(u32), .align = sizeof(u32),
-		.active = regset_fpregs_active, .regset_get = fpregs_get, .set = fpregs_set
+		.core_note_type	= NT_PRFPREG,
+		.n		= sizeof(struct user_i387_ia32_struct) / sizeof(u32),
+		.size		= sizeof(u32),
+		.align		= sizeof(u32),
+		.active		= regset_fpregs_active,
+		.regset_get	= fpregs_get,
+		.set		= fpregs_set
 	},
 	[REGSET_XFP32] = {
-		.core_note_type = NT_PRXFPREG,
-		.n = sizeof(struct fxregs_state) / sizeof(u32),
-		.size = sizeof(u32), .align = sizeof(u32),
-		.active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
+		.core_note_type	= NT_PRXFPREG,
+		.n		= sizeof(struct fxregs_state) / sizeof(u32),
+		.size		= sizeof(u32),
+		.align		= sizeof(u32),
+		.active		= regset_xregset_fpregs_active,
+		.regset_get	= xfpregs_get,
+		.set		= xfpregs_set
 	},
 	[REGSET_XSTATE32] = {
-		.core_note_type = NT_X86_XSTATE,
-		.size = sizeof(u64), .align = sizeof(u64),
-		.active = xstateregs_active, .regset_get = xstateregs_get,
-		.set = xstateregs_set
+		.core_note_type	= NT_X86_XSTATE,
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= xstateregs_active,
+		.regset_get	= xstateregs_get,
+		.set		= xstateregs_set
 	},
 	[REGSET_TLS32] = {
-		.core_note_type = NT_386_TLS,
-		.n = GDT_ENTRY_TLS_ENTRIES, .bias = GDT_ENTRY_TLS_MIN,
-		.size = sizeof(struct user_desc),
-		.align = sizeof(struct user_desc),
-		.active = regset_tls_active,
-		.regset_get = regset_tls_get, .set = regset_tls_set
+		.core_note_type	= NT_386_TLS,
+		.n		= GDT_ENTRY_TLS_ENTRIES,
+		.bias		= GDT_ENTRY_TLS_MIN,
+		.size		= sizeof(struct user_desc),
+		.align		= sizeof(struct user_desc),
+		.active		= regset_tls_active,
+		.regset_get	= regset_tls_get,
+		.set		= regset_tls_set
 	},
 	[REGSET_IOPERM32] = {
-		.core_note_type = NT_386_IOPERM,
-		.n = IO_BITMAP_BYTES / sizeof(u32),
-		.size = sizeof(u32), .align = sizeof(u32),
-		.active = ioperm_active, .regset_get = ioperm_get
+		.core_note_type	= NT_386_IOPERM,
+		.n		= IO_BITMAP_BYTES / sizeof(u32),
+		.size		= sizeof(u32),
+		.align		= sizeof(u32),
+		.active		= ioperm_active,
+		.regset_get	= ioperm_get
 	},
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (34 preceding siblings ...)
  2022-09-29 22:29 ` [OPTIONAL/CLEANUP v2 35/39] x86: Improve formatting of user_regset arrays Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 19:07   ` Chang S. Bae
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET Rick Edgecombe
                   ` (3 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

If an xfeature is saved in a buffer, the xfeature's bit will be set in
xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
is in it's init state. In this case the xfeature buffer address cannot
be retrieved with get_xsave_addr().

Future patches will need to handle the case of writing to an xfeature
that may not be saved. So provide helpers to init an xfeature in an
xsave buffer.

This could of course be done directly by reaching into the xsave buffer,
however this would not be robust against future changes to optimize the
xsave buffer by compacting it. In that case the xsave buffer would need
to be re-arranged as well. So the logic properly belongs encapsulated
in a helper where the logic can be unified.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - New patch

 arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
 arch/x86/kernel/fpu/xstate.h |  6 ++++
 2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9258fc1169cc..82cee1f2f0c8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -942,6 +942,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
 }
 
+static int xsave_buffer_access_checks(int xfeature_nr)
+{
+	/*
+	 * Do we even *have* xsave state?
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVE))
+		return 1;
+
+	/*
+	 * We should not ever be requesting features that we
+	 * have not enabled.
+	 */
+	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+		return 1;
+
+	return 0;
+}
+
 /*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
@@ -962,17 +980,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 {
-	/*
-	 * Do we even *have* xsave state?
-	 */
-	if (!boot_cpu_has(X86_FEATURE_XSAVE))
-		return NULL;
-
-	/*
-	 * We should not ever be requesting features that we
-	 * have not enabled.
-	 */
-	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+	if (xsave_buffer_access_checks(xfeature_nr))
 		return NULL;
 
 	/*
@@ -992,6 +1000,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	return __raw_xsave_addr(xsave, xfeature_nr);
 }
 
+/*
+ * Given the xsave area and a state inside, this function
+ * initializes an xfeature in the buffer.
+ *
+ * get_xsave_addr() will return NULL if the feature bit is
+ * not present in the header. This function will make it so
+ * the xfeature buffer address is ready to be retrieved by
+ * get_xsave_addr().
+ *
+ * Inputs:
+ *	xstate: the thread's storage area for all FPU data
+ *	xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ *	XFEATURE_SSE, etc...)
+ * Output:
+ *	1 if the feature cannot be inited, 0 on success
+ */
+int init_xfeature(struct xregs_state *xsave, int xfeature_nr)
+{
+	if (xsave_buffer_access_checks(xfeature_nr))
+		return 1;
+
+	/*
+	 * Mark the feature inited.
+	 */
+	xsave->header.xfeatures |= BIT_ULL(xfeature_nr);
+	return 0;
+}
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 5ad47031383b..fb8aae678e9f 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -54,6 +54,12 @@ extern void fpu__init_cpu_xstate(void);
 extern void fpu__init_system_xstate(unsigned int legacy_size);
 
 extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int init_xfeature(struct xregs_state *xsave, int xfeature_nr);
+
+static inline int xfeature_saved(struct xregs_state *xsave, int xfeature_nr)
+{
+	return xsave->header.xfeatures & BIT_ULL(xfeature_nr);
+}
 
 static inline u64 xfeatures_mask_supervisor(void)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (35 preceding siblings ...)
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 23:59   ` Kees Cook
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 38/39] x86/cet/shstk: Add ARCH_CET_UNLOCK Rick Edgecombe
                   ` (2 subsequent siblings)
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Some applications (like GDB and CRIU) would like to tweak CET state via
ptrace. This allows for existing functionality to continue to work for
seized CET applications. Provide an interface based on the xsave buffer
format of CET, but filter unneeded states to make the kernel’s job
easier.

There is already ptrace functionality for accessing xstate, but this
does not include supervisor xfeatures. So there is not a completely
clear place for where to put the CET state. Adding it to the user
xfeatures regset would complicate that code, as it currently shares
logic with signals which should not have supervisor features.

Don’t add a general supervisor xfeature regset like the user one,
because it is better to maintain flexibility for other supervisor
xfeatures to define their own interface. For example, an xfeature may
decide not to expose all of it’s state to userspace. A lot of enum
values remain to be used, so just put it in dedicated CET regset.

The only downside to not having a generic supervisor xfeature regset,
is that apps need to be enlightened of any new supervisor xfeature
exposed this way (i.e. they can’t try to have generic save/restore
logic). But maybe that is a good thing, because they have to think
through each new xfeature instead of encountering issues when new a new
supervisor xfeature was added.

By adding a CET regset, it also has the effect of including the CET state
in a core dump, which could be useful for debugging.

Inside the setter CET regset, filter out invalid state. Today this
includes states disallowed by the HW and states involving Indirect Branch
Tracking which the kernel does not currently support for usersapce.

So this leaves three pieces of data that can be set, shadow stack
enablement, WRSS enablement and the shadow stack pointer. It is worth
noting that this is separate than enabling shadow stack via the
arch_prctl()s. Enabling shadow stack involves more than just flipping the
bit. The kernel is made aware that it has to do extra things when cloning
or handling signals. That logic is triggered off of separate feature
enablement state kept in the task struct. So the flipping on HW shadow
stack enforcement without notifying the kernel to change its behavior
would severely limit what an application could do without crashing. Since
there is likely no use for this, only allow the CET registers to be set
if shadow stack is already enabled via the arch_prctl()s. This will let
apps like GDB toggle shadow stack enforcement for apps that already have
shadow stack enabled, and minimize scenarios the kernel has to worry
about.

Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

---

v2:
 - Check alignment on ssp.
 - Block IBT bits.
 - Handle init states instead of returning error.
 - Add verbose commit log justifying the design.

Yu-Cheng v12:
 - Return -ENODEV when CET registers are in INIT state.
 - Check reserved/non-support bits from user input.

 arch/x86/include/asm/fpu/regset.h |  7 ++-
 arch/x86/include/asm/msr-index.h  |  5 ++
 arch/x86/kernel/fpu/regset.c      | 95 +++++++++++++++++++++++++++++++
 arch/x86/kernel/ptrace.c          | 20 +++++++
 include/uapi/linux/elf.h          |  1 +
 5 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index 4f928d6a367b..8622184d87f5 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@
 
 #include <linux/regset.h>
 
-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+				cetregs_active;
 extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get,
-				 xstateregs_get;
+				 xstateregs_get, cetregs_get;
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
-				 xstateregs_set;
+				 xstateregs_set, cetregs_set;
 
 /*
  * xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6674bdb096f3..fbc319682664 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -432,6 +432,11 @@
 #define CET_RESERVED			(BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
 #define CET_SUPPRESS			BIT_ULL(10)
 #define CET_WAIT_ENDBR			BIT_ULL(11)
+#define CET_EG_LEG_BITMAP_BASE_MASK	GENMASK_ULL(63, 13)
+
+#define CET_U_IBT_MASK			(CET_ENDBR_EN | CET_LEG_IW_EN | CET_NO_TRACK_EN | \
+					 CET_NO_TRACK_EN | CET_SUPPRESS_DISABLE | CET_SUPPRESS | \
+					 CET_WAIT_ENDBR | CET_EG_LEG_BITMAP_BASE_MASK)
 
 #define MSR_IA32_PL0_SSP		0x000006a4 /* ring-0 shadow stack pointer */
 #define MSR_IA32_PL1_SSP		0x000006a5 /* ring-1 shadow stack pointer */
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 75ffaef8c299..440dc1921ee4 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -174,6 +174,101 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	return ret;
 }
 
+int cetregs_active(struct task_struct *target, const struct user_regset *regset)
+{
+#ifdef CONFIG_X86_SHADOW_STACK
+	if (target->thread.shstk.size)
+		return regset->n;
+#endif
+	return 0;
+}
+
+int cetregs_get(struct task_struct *target, const struct user_regset *regset,
+		struct membuf to)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct cet_user_state *cetregs;
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		return -ENODEV;
+
+	sync_fpstate(fpu);
+	cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER);
+	if (!cetregs) {
+		/*
+		 * The registers are the in the init state. The init values for
+		 * these regs are zero, so just zero the output buffer.
+		 */
+		membuf_zero(&to, sizeof(struct cet_user_state));
+		return 0;
+	}
+
+	return membuf_write(&to, cetregs, sizeof(struct cet_user_state));
+}
+
+int cetregs_set(struct task_struct *target, const struct user_regset *regset,
+		  unsigned int pos, unsigned int count,
+		  const void *kbuf, const void __user *ubuf)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct xregs_state *xsave = &fpu->fpstate->regs.xsave;
+	struct cet_user_state *cetregs, tmp;
+	bool ia32;
+	int r;
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK) ||
+	    !cetregs_active(target, regset))
+		return -ENODEV;
+
+	ia32 = IS_ENABLED(CONFIG_IA32_EMULATION) &&
+	       target->thread_info.status & TS_COMPAT;
+
+	r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &tmp, 0, -1);
+	if (r)
+		return r;
+
+	/*
+	 * Some kernel instructions (IRET, etc) can cause exceptions in the case
+	 * of disallowed CET register values. Just prevent invalid values.
+	 */
+	if ((tmp.user_ssp >= TASK_SIZE_MAX) ||
+	    (ia32 && !IS_ALIGNED(tmp.user_ssp, 4)) ||
+	    (!ia32 && !IS_ALIGNED(tmp.user_ssp, 8)))
+		return -EINVAL;
+
+	/*
+	 * Don't allow any IBT bits to be set because it is not supported by
+	 * the kernel yet. Also don't allow reserved bits.
+	 */
+	if ((tmp.user_cet & CET_RESERVED) || (tmp.user_cet & CET_U_IBT_MASK))
+		return -EINVAL;
+
+	fpu_force_restore(fpu);
+
+	/*
+	 * Don't want to init the xfeature until the kernel will definetely
+	 * overwrite it, otherwise if it inits and then fails out, it would
+	 * end up initing it to random data.
+	 */
+	if (!xfeature_saved(xsave, XFEATURE_CET_USER) &&
+	    WARN_ON(init_xfeature(xsave, XFEATURE_CET_USER)))
+		return -ENODEV;
+
+	cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
+	if (WARN_ON(!cetregs)) {
+		/*
+		 * This shouldn't ever be NULL because it was successfully
+		 * inited above if needed. The only scenario would be if an
+		 * xfeature was somehow saved in a buffer, but not enabled in
+		 * xsave.
+		 */
+		return -ENODEV;
+	}
+
+	memmove(cetregs, &tmp, sizeof(tmp));
+	return 0;
+}
+
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 
 /*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index eed8a65d335d..f9e6635b69ce 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -51,6 +51,7 @@ enum x86_regset_32 {
 	REGSET_XSTATE32,
 	REGSET_TLS32,
 	REGSET_IOPERM32,
+	REGSET_CET32,
 };
 
 enum x86_regset_64 {
@@ -58,6 +59,7 @@ enum x86_regset_64 {
 	REGSET_FP64,
 	REGSET_IOPERM64,
 	REGSET_XSTATE64,
+	REGSET_CET64,
 };
 
 #define REGSET_GENERAL \
@@ -1267,6 +1269,15 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
 		.active		= ioperm_active,
 		.regset_get	= ioperm_get
 	},
+	[REGSET_CET64] = {
+		.core_note_type	= NT_X86_CET,
+		.n		= sizeof(struct cet_user_state) / sizeof(u64),
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= cetregs_active,
+		.regset_get	= cetregs_get,
+		.set		= cetregs_set
+	},
 };
 
 static const struct user_regset_view user_x86_64_view = {
@@ -1336,6 +1347,15 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
 		.active		= ioperm_active,
 		.regset_get	= ioperm_get
 	},
+	[REGSET_CET32] = {
+		.core_note_type = NT_X86_CET,
+		.n		= sizeof(struct cet_user_state) / sizeof(u64),
+		.size		= sizeof(u64),
+		.align		= sizeof(u64),
+		.active		= cetregs_active,
+		.regset_get	= cetregs_get,
+		.set		= cetregs_set
+	},
 };
 
 static const struct user_regset_view user_x86_32_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index c7b056af9ef0..11089731e2e9 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -406,6 +406,7 @@ typedef struct elf64_shdr {
 #define NT_386_TLS	0x200		/* i386 TLS slots (struct user_desc) */
 #define NT_386_IOPERM	0x201		/* x86 io permission bitmap (1=deny) */
 #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
+#define NT_X86_CET	0x203		/* x86 CET state */
 #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
 #define NT_S390_TIMER	0x301		/* s390 timer register */
 #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [OPTIONAL/RFC v2 38/39] x86/cet/shstk: Add ARCH_CET_UNLOCK
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (36 preceding siblings ...)
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-04  0:00   ` Kees Cook
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support Rick Edgecombe
  2022-10-03 17:04 ` [PATCH v2 00/39] Shadowstacks for userspace Kees Cook
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other CET operations, but restrict it being called
by the ptrace arch_pctl() interface.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
[Merged into recent API changes, added commit log and docs]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v2:
 - New patch

 Documentation/x86/cet.rst         | 3 +++
 arch/x86/include/uapi/asm/prctl.h | 1 +
 arch/x86/kernel/process_64.c      | 1 +
 arch/x86/kernel/shstk.c           | 9 +++++++--
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
index 4a0dfb6830f9..6b270a24ebc3 100644
--- a/Documentation/x86/cet.rst
+++ b/Documentation/x86/cet.rst
@@ -81,6 +81,9 @@ arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
 arch_prctl(ARCH_CET_LOCK, unsigned int features)
     Lock in features at their current enabled or disabled status.
 
+arch_prctl(ARCH_CET_UNLOCK, unsigned int features)
+    Unlock features.
+
 The return values are as following:
     On success, return 0. On error, errno can be::
 
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index d811f0c5fc4f..2f4d81ab4849 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -25,6 +25,7 @@
 #define ARCH_CET_ENABLE			0x4001
 #define ARCH_CET_DISABLE		0x4002
 #define ARCH_CET_LOCK			0x4003
+#define ARCH_CET_UNLOCK			0x4004
 
 #define CET_SHSTK			0x1
 #define CET_WRSS			0x2
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index be544b4b4c8b..fbb2062dd0d2 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -834,6 +834,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_CET_ENABLE:
 	case ARCH_CET_DISABLE:
 	case ARCH_CET_LOCK:
+	case ARCH_CET_UNLOCK:
 		return cet_prctl(task, option, arg2);
 	default:
 		ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 0efec02dbe6b..af1255164f0c 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -464,9 +464,14 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
 		return 0;
 	}
 
-	/* Don't allow via ptrace */
-	if (task != current)
+	/* Only allow via ptrace */
+	if (task != current) {
+		if (option == ARCH_CET_UNLOCK) {
+			task->thread.features_locked &= ~features;
+			return 0;
+		}
 		return -EINVAL;
+	}
 
 	/* Do not allow to change locked features */
 	if (features & task->thread.features_locked)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (37 preceding siblings ...)
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 38/39] x86/cet/shstk: Add ARCH_CET_UNLOCK Rick Edgecombe
@ 2022-09-29 22:29 ` Rick Edgecombe
  2022-10-03 23:21   ` Andy Lutomirski
  2022-10-03 17:04 ` [PATCH v2 00/39] Shadowstacks for userspace Kees Cook
  39 siblings, 1 reply; 241+ messages in thread
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe

To handle stack overflows, applications can register a separate signal alt
stack to use for the stack to handle signals. To handle shadow stack
overflows the kernel can similarly provide the ability to have an alt
shadow stack.

Signals push information about the execution context to the stack that
will handle the signal. The data pushed is use to restore registers
and other state after the signal. In the case of handling the signal on
a normal stack, the stack just needs to be unwound over the stack frame,
but in the case of alt stacks, the saved stack pointer is important for
the sigreturn to find it’s way back to the thread. With shadow stack
there is a new type of stack pointer, the shadow stack pointer (SSP), that
needs to be restored. Just like the regular stack pointer, it needs to be
saved somewhere in order to implement shadow alt stacks. This is already
done as part of the token placed to prevent SROP attacks, so on sigreturn
from an alt shadow stack, the kernel can easily know which SSP to restore.

But to enable SS_AUTODISARM like functionality, the kernel also needs to
push the shadow alt stack and size somewhere, like happens in regular
alt stacks. So push this data using the same format. In the end the
shadow stack sigframe looks like this:
|1...old SSP|1...alt stack size|1...alt stack base| 0|

In the future, any other data could come between the alt stack base and
the guard zero. The guard zero is to prevent tricking the kernel into
processing half of one frame and half of the adjacent frame.

In past designs for userspace shadow stacks, shadow alt stacks were not
supported. Since there was only one shadow stack, longjmp() could jump out
of a signal by using incssp to unwind the SSP to the place where the
setjmp() was called. Since alt shadow stacks are a new thing, simply don't
support longjmp()ing from an alt shadow stacks.

Introduce a new syscall "sigaltshstk" that behaves similarly to
sigaltstack. Have it take new and old stack_t's to specify the base and
length of the alt shadow stack. Don't have it adopt the same flag
semantics though, because not all alt stack flags will necessarily apply
to alt shadow stacks. As long as the syscall is getting new flag meanings
make SS_AUTODISARM the default behavior for sigaltshstk(), and not require
a flag. Today the only flag supported is SS_DISABLE, and a !SS_AUTODISARM
mode is not supported.

So when a signal hits it will jump to the location specified in
sigaltshstk(). Currently (without WRSS), userspace doesn’t have the
ability to arbitrarily set the SSP. But telling the kernel to set the
SSP to an arbitrary point on signal is kind of like that. So there would
be a weakening of the shadow stack protections unless additional checks
are made. With the SS_AUTODISARM-style behavior, the SSP will only jump to
the shadow stack if the SSP is not already on the shadow stack, otherwise
it will just push the SSP. So have the kernel checks for a token
whenever transitioning to the alt stack from a place other than the alt
stack. This token can be written by the kernel during shadow stack
allocation, using the map_shadow_stack syscall.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - New patch

 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/include/asm/cet.h                    |   2 +
 arch/x86/include/asm/processor.h              |   3 +
 arch/x86/kernel/process.c                     |   3 +
 arch/x86/kernel/shstk.c                       | 178 +++++++++++++++---
 include/linux/syscalls.h                      |   1 +
 kernel/sys_ni.c                               |   1 +
 .../testing/selftests/x86/test_shadow_stack.c |  75 ++++++++
 8 files changed, 240 insertions(+), 24 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d9639e3e0a33..a2dd5d56caa4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -373,6 +373,7 @@
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
 451	common	map_shadow_stack	sys_map_shadow_stack
+452	common	sigaltshstk		sys_sigaltshstk
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index edf681d4843a..52119b913ed6 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -26,6 +26,7 @@ void reset_thread_shstk(void);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
 int wrss_control(bool enable);
+void reset_alt_shstk(void);
 #else
 static inline long cet_prctl(struct task_struct *task, int option,
 		      unsigned long features) { return -EINVAL; }
@@ -40,6 +41,7 @@ static inline void reset_thread_shstk(void) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
 static inline int wrss_control(bool enable) { return -EOPNOTSUPP; }
+static inline void reset_alt_shstk(void) {}
 #endif /* CONFIG_X86_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3a0c9d9d4d1d..b9fb966edec7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -536,6 +536,9 @@ struct thread_struct {
 
 #ifdef CONFIG_X86_SHADOW_STACK
 	struct thread_shstk	shstk;
+	unsigned long			sas_shstk_sp;
+	size_t				sas_shstk_size;
+	unsigned int			sas_shstk_flags;
 #endif
 
 	/* Floating point and extended processor state */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5e63d190becd..b71eb2d6a20f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -176,6 +176,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
+	if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM)
+		reset_alt_shstk();
+
 	/* Allocate a new shadow stack for pthread if needed */
 	ret = shstk_alloc_thread_stack(p, clone_flags, args->flags, &shstk_addr);
 	if (ret)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index af1255164f0c..05ee3793b60f 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,7 @@
 #include <asm/special_insns.h>
 #include <asm/fpu/api.h>
 #include <asm/prctl.h>
+#include <asm/signal.h>
 
 #define SS_FRAME_SIZE 8
 
@@ -149,11 +150,18 @@ int shstk_setup(void)
 	return 0;
 }
 
+void reset_alt_shstk(void)
+{
+	current->thread.sas_shstk_sp = 0;
+	current->thread.sas_shstk_size = 0;
+}
+
 void reset_thread_shstk(void)
 {
 	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
 	current->thread.features = 0;
 	current->thread.features_locked = 0;
+	reset_alt_shstk();
 }
 
 int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
@@ -238,39 +246,67 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
 	return 0;
 }
 
+static bool on_alt_shstk(unsigned long ssp)
+{
+	unsigned long alt_ss_start = current->thread.sas_shstk_sp;
+	unsigned long alt_ss_end = alt_ss_start + current->thread.sas_shstk_size;
+
+	return ssp >= alt_ss_start && ssp < alt_ss_end;
+}
+
+static bool alt_shstk_active(void)
+{
+	return current->thread.sas_shstk_sp;
+}
+
+static bool alt_shstk_valid(unsigned long ssp, size_t size)
+{
+	if (ssp && (size < PAGE_SIZE || size >= TASK_SIZE_MAX))
+		return -EINVAL;
+
+	if (ssp >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
 /*
- * Create a restore token on shadow stack, and then push the user-mode
- * function return address.
+ * Verify the user shadow stack has a valid token on it, and then set
+ * *new_ssp according to the token.
  */
-static int shstk_setup_rstor_token(unsigned long ret_addr, unsigned long *new_ssp)
+static int shstk_check_rstor_token(unsigned long token_addr, unsigned long *new_ssp)
 {
-	unsigned long ssp, token_addr;
-	int err;
+	unsigned long token;
 
-	if (!ret_addr)
+	if (get_user(token, (unsigned long __user *)token_addr))
+		return -EFAULT;
+
+	/* Is mode flag correct? */
+	if (!(token & BIT(0)))
 		return -EINVAL;
 
-	ssp = get_user_shstk_addr();
-	if (!ssp)
+	/* Is busy flag set? */
+	if (token & BIT(1))
 		return -EINVAL;
 
-	err = create_rstor_token(ssp, &token_addr);
-	if (err)
-		return err;
+	/* Mask out flags */
+	token &= ~3UL;
+
+	/* Restore address aligned? */
+	if (!IS_ALIGNED(token, 8))
+		return -EINVAL;
 
-	ssp = token_addr - sizeof(u64);
-	err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
+	/* Token placed properly? */
+	if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+		return -EINVAL;
 
-	if (!err)
-		*new_ssp = ssp;
+	*new_ssp = token;
 
-	return err;
+	return 0;
 }
 
-static int shstk_push_sigframe(unsigned long *ssp)
+static int shstk_push_sigframe(unsigned long *ssp, unsigned long target_ssp)
 {
-	unsigned long target_ssp = *ssp;
-
 	/* Token must be aligned */
 	if (!IS_ALIGNED(*ssp, 8))
 		return -EINVAL;
@@ -278,17 +314,32 @@ static int shstk_push_sigframe(unsigned long *ssp)
 	if (!IS_ALIGNED(target_ssp, 8))
 		return -EINVAL;
 
+	*ssp -= SS_FRAME_SIZE;
+	if (write_user_shstk_64((u64 __user *)*ssp, 0))
+		return -EFAULT;
+
+	*ssp -= SS_FRAME_SIZE;
+	if (put_shstk_data((u64 __user *)*ssp, current->thread.sas_shstk_sp))
+		return -EFAULT;
+
+	*ssp -= SS_FRAME_SIZE;
+	if (put_shstk_data((u64 __user *)*ssp, current->thread.sas_shstk_size))
+		return -EFAULT;
+
 	*ssp -= SS_FRAME_SIZE;
 	if (put_shstk_data((void *__user)*ssp, target_ssp))
 		return -EFAULT;
 
+	current->thread.sas_shstk_sp = 0;
+	current->thread.sas_shstk_size = 0;
+
 	return 0;
 }
 
 
 static int shstk_pop_sigframe(unsigned long *ssp)
 {
-	unsigned long token_addr;
+	unsigned long token_addr, shstk_sp, shstk_size;
 	int err;
 
 	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
@@ -303,7 +354,38 @@ static int shstk_pop_sigframe(unsigned long *ssp)
 	if (unlikely(token_addr >= TASK_SIZE_MAX))
 		return -EINVAL;
 
+	*ssp += SS_FRAME_SIZE;
+	err = get_shstk_data(&shstk_size, (void __user *)*ssp);
+	if (unlikely(err))
+		return err;
+
+	*ssp += SS_FRAME_SIZE;
+	err = get_shstk_data(&shstk_sp, (void __user *)*ssp);
+	if (unlikely(err))
+		return err;
+
+	if (unlikely(alt_shstk_valid((unsigned long)shstk_sp, shstk_size)))
+		return -EINVAL;
+
 	*ssp = token_addr;
+	current->thread.sas_shstk_sp = shstk_sp;
+	current->thread.sas_shstk_size = shstk_size;
+
+	return 0;
+}
+
+static unsigned long get_sig_start_ssp(unsigned long orig_ssp, unsigned long *ssp)
+{
+	unsigned long sp_end = (current->thread.sas_shstk_sp +
+				current->thread.sas_shstk_size) - SS_FRAME_SIZE;
+
+	if (!alt_shstk_active() || on_alt_shstk(*ssp)) {
+		*ssp = orig_ssp;
+		return 0;
+	}
+
+	if (shstk_check_rstor_token(sp_end, ssp))
+		return -EINVAL;
 
 	return 0;
 }
@@ -311,7 +393,7 @@ static int shstk_pop_sigframe(unsigned long *ssp)
 int setup_signal_shadow_stack(struct ksignal *ksig)
 {
 	void __user *restorer = ksig->ka.sa.sa_restorer;
-	unsigned long ssp;
+	unsigned long ssp, orig_ssp;
 	int err;
 
 	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
@@ -321,11 +403,15 @@ int setup_signal_shadow_stack(struct ksignal *ksig)
 	if (!restorer)
 		return -EINVAL;
 
-	ssp = get_user_shstk_addr();
-	if (unlikely(!ssp))
+	orig_ssp = get_user_shstk_addr();
+	if (unlikely(!orig_ssp))
 		return -EINVAL;
 
-	err = shstk_push_sigframe(&ssp);
+	err = get_sig_start_ssp(orig_ssp, &ssp);
+	if (unlikely(err))
+		return err;
+
+	err = shstk_push_sigframe(&ssp, orig_ssp);
 	if (unlikely(err))
 		return err;
 
@@ -496,3 +582,47 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
 		return wrss_control(true);
 	return -EINVAL;
 }
+
+SYSCALL_DEFINE2(sigaltshstk, const stack_t __user *, uss, stack_t __user *, uoss)
+{
+	unsigned long ssp;
+	stack_t new, old;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return -ENOSYS;
+
+	ssp = get_user_shstk_addr();
+
+	if (unlikely(!ssp || on_alt_shstk(ssp)))
+		return -EPERM;
+
+	if (uss) {
+		if (unlikely(copy_from_user(&new, uss, sizeof(stack_t))))
+			return -EFAULT;
+
+		if (unlikely(alt_shstk_valid((unsigned long)new.ss_sp,
+					     new.ss_size)))
+			return -EINVAL;
+
+		if (new.ss_flags & SS_DISABLE) {
+			current->thread.sas_shstk_sp = 0;
+			current->thread.sas_shstk_size = 0;
+			return 0;
+		}
+
+		current->thread.sas_shstk_sp = (unsigned long) new.ss_sp;
+		current->thread.sas_shstk_size = new.ss_size;
+		/* No saved flags for now */
+	}
+
+	if (!uoss)
+		return 0;
+
+	memset(&old, 0, sizeof(stack_t));
+	old.ss_sp = (void __user *)current->thread.sas_shstk_sp;
+	old.ss_size = current->thread.sas_shstk_size;
+	if (copy_to_user(uoss, &old, sizeof(stack_t)))
+		return -EFAULT;
+
+	return 0;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3ae05cbdea5b..7b7e7bb992c2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1057,6 +1057,7 @@ asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long l
 					    unsigned long home_node,
 					    unsigned long flags);
 asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
+asmlinkage long sys_sigaltshstk(const struct sigaltstack *uss, struct sigaltstack *uoss);
 
 /*
  * Architecture-specific system calls
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index cb9aebd34646..3a5f8b76e7a4 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -382,6 +382,7 @@ COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
 COND_SYSCALL(map_shadow_stack);
+COND_SYSCALL(sigaltshstk);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
index 249397736d0d..22b856de5cdd 100644
--- a/tools/testing/selftests/x86/test_shadow_stack.c
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -492,6 +492,76 @@ int test_userfaultfd(void)
 	return 1;
 }
 
+volatile bool segv_pass;
+
+long sigaltshstk(stack_t *uss, stack_t *ouss)
+{
+	return syscall(__NR_sigaltshstk, uss, ouss);
+}
+
+void segv_alt_handler(int signum, siginfo_t *si, void *uc)
+{
+	unsigned long min = (unsigned long)shstk_ptr;
+	unsigned long max = (unsigned long)shstk_ptr + SS_SIZE;
+	unsigned long ssp = get_ssp();
+	stack_t alt_shstk_stackt;
+
+	if (sigaltshstk(NULL, &alt_shstk_stackt))
+		goto fail;
+
+	if (alt_shstk_stackt.ss_sp || alt_shstk_stackt.ss_size)
+		goto fail;
+
+	if (ssp < min || ssp > max - 8)
+		goto fail;
+
+	segv_pass = true;
+	return;
+fail:
+	segv_pass = false;
+}
+
+int test_shstk_alt_stack(void)
+{
+	stack_t alt_shstk_stackt;
+	struct sigaction sa;
+	int ret = 1;
+
+	sa.sa_sigaction = segv_alt_handler;
+	if (sigaction(SIGUSR1, &sa, NULL))
+		return 1;
+	sa.sa_flags = SA_SIGINFO;
+
+	shstk_ptr = create_shstk(0);
+	if (shstk_ptr == MAP_FAILED)
+		goto err_sig;
+
+	alt_shstk_stackt.ss_sp = shstk_ptr;
+	alt_shstk_stackt.ss_size = SS_SIZE;
+	if (sigaltshstk(&alt_shstk_stackt, NULL) == -1)
+		goto err_shstk;
+
+	segv_pass = false;
+
+	/* Make sure segv_was_on_alt is set before signal */
+	asm volatile("" : : : "memory");
+
+	raise(SIGUSR1);
+
+	if (segv_pass) {
+		printf("[OK]\tAlt shadow stack test.\n");
+		ret = 0;
+	}
+
+err_shstk:
+	alt_shstk_stackt.ss_flags = SS_DISABLE;
+	sigaltshstk(&alt_shstk_stackt, NULL);
+	free_shstk(shstk_ptr);
+err_sig:
+	signal(SIGUSR1, SIG_DFL);
+	return ret;
+}
+
 int main(int argc, char *argv[])
 {
 	int ret = 0;
@@ -556,6 +626,11 @@ int main(int argc, char *argv[])
 		printf("[FAIL]\tUserfaultfd test\n");
 	}
 
+	if (test_shstk_alt_stack()) {
+		ret = 1;
+		printf("[FAIL]\tAlt shadow stack test\n");
+	}
+
 out:
 	/*
 	 * Disable shadow stack before the function returns, or there will be a
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
@ 2022-09-30  3:41   ` Bagas Sanjaya
  2022-09-30 13:33     ` Jonathan Corbet
  2022-10-03 19:35     ` John Hubbard
  2022-10-03 17:18   ` Kees Cook
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 241+ messages in thread
From: Bagas Sanjaya @ 2022-09-30  3:41 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

[-- Attachment #1: Type: text/plain, Size: 14206 bytes --]

On Thu, Sep 29, 2022 at 03:28:58PM -0700, Rick Edgecombe wrote:
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +
> +Overview
> +========
> +
> +Control-flow Enforcement Technology (CET) is term referring to several
> +related x86 processor features that provides protection against control
> +flow hijacking attacks. The HW feature itself can be set up to protect
> +both applications and the kernel. Only user-mode protection is implemented
> +in the 64-bit kernel.
> +
> +CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
> +a secondary stack allocated from memory and cannot be directly modified by
> +applications. When executing a CALL instruction, the processor pushes the
> +return address to both the normal stack and the shadow stack. Upon
> +function return, the processor pops the shadow stack copy and compares it
> +to the normal stack copy. If the two differ, the processor raises a
> +control-protection fault. Indirect branch tracking verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes. Not all CPU's have both Shadow Stack and Indirect Branch Tracking
> +and only Shadow Stack is currently supported in the kernel.
> +
> +The Kconfig options is X86_SHADOW_STACK, and it can be disabled with
> +the kernel parameter clearcpuid, like this: "clearcpuid=shstk".
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1
> +or later are required. To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +At run time, /proc/cpuinfo shows CET features if the processor supports
> +CET.
> +
> +Application Enabling
> +====================
> +
> +An application's CET capability is marked in its ELF header and can be
> +verified from readelf/llvm-readelf output:
> +
> +    readelf -n <application> | grep -a SHSTK
> +        properties: x86 feature: SHSTK
> +
> +The kernel does not process these applications directly. Applications must
> +enable them using the interface descriped in section 4. Typically this
> +would be done in dynamic loader or static runtime objects, as is the case
> +in glibc.
> +
> +Backward Compatibility
> +======================
> +
> +GLIBC provides a few CET tunables via the GLIBC_TUNABLES environment
> +variable:
> +
> +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-WRSS
> +    Turn off SHSTK/WRSS.
> +
> +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> +    This controls how dlopen() handles SHSTK legacy libraries::
> +
> +        on         - continue with SHSTK enabled;
> +        permissive - continue with SHSTK off.
> +
> +Details can be found in the GLIBC manual pages.
> +
> +CET arch_prctl()'s
> +==================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's.
> +
> +arch_prctl(ARCH_CET_ENABLE, unsigned int feature)
> +    Enable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
> +    Disable features specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_CET_LOCK, unsigned int features)
> +    Lock in features at their current enabled or disabled status.
> +
> +The return values are as following:
> +    On success, return 0. On error, errno can be::
> +
> +        -EPERM if any of the passed feature are locked.
> +        -EOPNOTSUPP if the feature is not supported by the hardware or
> +         disabled by kernel parameter.
> +        -EINVAL arguments (non existing feature, etc)
> +
> +Currently shadow stack and WRSS are supported via this interface. WRSS
> +can only be enabled with shadow stack, and is automatically disabled
> +if shadow stack is disabled.
> +
> +Proc status
> +===========
> +To check if an application is actually running with shadow stack, the
> +user can read the /proc/$PID/arch_status. It will report "wrss" or
> +"shstk" depending on what is enabled.
> +
> +The implementation of the Shadow Stack
> +======================================
> +
> +Shadow Stack size
> +-----------------
> +
> +A task's shadow stack is allocated from memory to a fixed size of
> +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
> +the maximum size of the normal stack, but capped to 4 GB. However,
> +a compat-mode application's address space is smaller, each of its thread's
> +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
> +
> +Signal
> +------
> +
> +By default, the main program and its signal handlers use the same shadow
> +stack. Because the shadow stack stores only return addresses, a large
> +shadow stack covers the condition that both the program stack and the
> +signal alternate stack run out.
> +
> +The kernel creates a restore token for the shadow stack and pushes the
> +restorer address to the shadow stack. Then verifies that token when
> +restoring from the signal handler.
> +
> +Fork
> +----
> +
> +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
> +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
> +shadow access triggers a page fault with the shadow stack access bit set
> +in the page fault error code.
> +
> +When a task forks a child, its shadow stack PTEs are copied and both the
> +parent's and the child's shadow stack PTEs are cleared of the dirty bit.
> +Upon the next shadow stack access, the resulting shadow stack page fault
> +is handled by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new shadow stack
> +for the new thread.

The documentation above can be improved (both grammar and formatting):

---- >8 ----

diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
index 6b270a24ebc3a2..f691f7995cf088 100644
--- a/Documentation/x86/cet.rst
+++ b/Documentation/x86/cet.rst
@@ -15,92 +15,101 @@ in the 64-bit kernel.
 
 CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
 a secondary stack allocated from memory and cannot be directly modified by
-applications. When executing a CALL instruction, the processor pushes the
+applications. When executing a ``CALL`` instruction, the processor pushes the
 return address to both the normal stack and the shadow stack. Upon
 function return, the processor pops the shadow stack copy and compares it
 to the normal stack copy. If the two differ, the processor raises a
 control-protection fault. Indirect branch tracking verifies indirect
-CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
-opcodes. Not all CPU's have both Shadow Stack and Indirect Branch Tracking
-and only Shadow Stack is currently supported in the kernel.
+``CALL``/``JMP`` targets are intended as marked by the compiler with ``ENDBR``
+opcodes. Not all CPUs have both Shadow Stack and Indirect Branch Tracking
+and only Shadow Stack is currently supported by the kernel.
 
-The Kconfig options is X86_SHADOW_STACK, and it can be disabled with
-the kernel parameter clearcpuid, like this: "clearcpuid=shstk".
+The Kconfig options is ``X86_SHADOW_STACK`` and it can be overridden with
+the kernel command-line parameter ``clearcpuid`` (for example
+``clearcpuid=shstk``).
 
 To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1
-or later are required. To build a CET-enabled application, GLIBC v2.28 or
+or later are required. To build a CET-enabled application, glibc v2.28 or
 later is also required.
 
-At run time, /proc/cpuinfo shows CET features if the processor supports
-CET.
+At run time, ``/proc/cpuinfo`` shows CET features if the processor supports
+them
 
-Application Enabling
-====================
+Enabling CET in applications
+============================
 
-An application's CET capability is marked in its ELF header and can be
-verified from readelf/llvm-readelf output:
+The CET capability of an application is marked in its ELF header and can be
+verified from ``readelf``/``llvm-readelf`` output::
 
     readelf -n <application> | grep -a SHSTK
         properties: x86 feature: SHSTK
 
 The kernel does not process these applications directly. Applications must
-enable them using the interface descriped in section 4. Typically this
+enable them using :ref:`cet-arch_prctl`. Typically this
 would be done in dynamic loader or static runtime objects, as is the case
 in glibc.
 
 Backward Compatibility
 ======================
 
-GLIBC provides a few CET tunables via the GLIBC_TUNABLES environment
+glibc provides a few CET tunables via the ``GLIBC_TUNABLES`` environment
 variable:
 
-GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-WRSS
+  * ``GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-WRSS``
+
     Turn off SHSTK/WRSS.
 
-GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
-    This controls how dlopen() handles SHSTK legacy libraries::
+  * ``GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>``
 
-        on         - continue with SHSTK enabled;
-        permissive - continue with SHSTK off.
+    This controls how :manpage:`dlopen(3)` handles SHSTK legacy libraries.
+    Possible values are:
 
-Details can be found in the GLIBC manual pages.
+    * ``on``         - continue with SHSTK enabled;
+    * ``permissive`` - continue with SHSTK off.
 
-CET arch_prctl()'s
-==================
+.. _cet-arch_prctl:
 
-Elf features should be enabled by the loader using the below arch_prctl's.
+CET arch_prctl() interface
+==========================
 
-arch_prctl(ARCH_CET_ENABLE, unsigned int feature)
-    Enable a single feature specified in 'feature'. Can only operate on
+ELF features should be enabled by the loader using the following
+:manpage:`arch_prctl(2)` subfunctions:
+
+  * ``arch_prctl(ARCH_CET_ENABLE, unsigned int feature)``
+
+    Enable a single feature specified in ``feature``. Can only operate on
     one feature at a time.
 
-arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
-    Disable features specified in 'feature'. Can only operate on
+  * ``arch_prctl(ARCH_CET_DISABLE, unsigned int feature)``
+
+    Disable features specified in ``feature``. Can only operate on
     one feature at a time.
 
-arch_prctl(ARCH_CET_LOCK, unsigned int features)
-    Lock in features at their current enabled or disabled status.
+  * ``arch_prctl(ARCH_CET_LOCK, unsigned int features)``
+
+    Lock in features at their current status.
+
+  * ``arch_prctl(ARCH_CET_UNLOCK, unsigned int features)``
 
-arch_prctl(ARCH_CET_UNLOCK, unsigned int features)
     Unlock features.
 
-The return values are as following:
-    On success, return 0. On error, errno can be::
+On success, :manpage:`arch_prctl(2)` returns 0, otherwise the errno
+can be:
 
-        -EPERM if any of the passed feature are locked.
-        -EOPNOTSUPP if the feature is not supported by the hardware or
-         disabled by kernel parameter.
-        -EINVAL arguments (non existing feature, etc)
+  - ``EPERM`` if any of the passed feature are locked.
+  - ``EOPNOTSUPP`` if the feature is not supported by the hardware or
+    disabled by the kernel command-line parameter.
+  - ``EINVAL`` if the arguments are invalid (non existing feature, etc).
 
 Currently shadow stack and WRSS are supported via this interface. WRSS
 can only be enabled with shadow stack, and is automatically disabled
 if shadow stack is disabled.
 
-Proc status
+proc status
 ===========
-To check if an application is actually running with shadow stack, the
-user can read the /proc/$PID/arch_status. It will report "wrss" or
-"shstk" depending on what is enabled.
+To check if an application is actually running with shadow stack, users can
+read ``/proc/$PID/arch_status``. It will report ``wrss`` or
+``shstk`` depending on what is enabled.
 
 The implementation of the Shadow Stack
 ======================================
@@ -108,11 +117,11 @@ The implementation of the Shadow Stack
 Shadow Stack size
 -----------------
 
-A task's shadow stack is allocated from memory to a fixed size of
-MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+The shadow stack of a task is allocated from memory to a fixed size of
+``MIN(RLIMIT_STACK, 4 GB)``. In other words, the shadow stack is allocated to
 the maximum size of the normal stack, but capped to 4 GB. However,
-a compat-mode application's address space is smaller, each of its thread's
-shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+the address space of a compat-mode application is smaller; the shadow stack
+size of each of its thread is ``MIN(1/4 RLIMIT_STACK, 4 GB)``.
 
 Signal
 ------
@@ -123,19 +132,19 @@ shadow stack covers the condition that both the program stack and the
 signal alternate stack run out.
 
 The kernel creates a restore token for the shadow stack and pushes the
-restorer address to the shadow stack. Then verifies that token when
-restoring from the signal handler.
+restorer address to it. Then the kernel verifies that token when restoring
+from the signal handler.
 
 Fork
 ----
 
-The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
-to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+The shadow stack vma has ``VM_SHADOW_STACK`` flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is read-write and dirty, a
 shadow access triggers a page fault with the shadow stack access bit set
 in the page fault error code.
 
 When a task forks a child, its shadow stack PTEs are copied and both the
-parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+shadow stack PTEs of parent and child are cleared of the dirty bit.
 Upon the next shadow stack access, the resulting shadow stack page fault
 is handled by page copy/re-use.
 
Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-30  3:41   ` Bagas Sanjaya
@ 2022-09-30 13:33     ` Jonathan Corbet
  2022-09-30 13:41       ` Bagas Sanjaya
  2022-10-03 19:35     ` John Hubbard
  1 sibling, 1 reply; 241+ messages in thread
From: Jonathan Corbet @ 2022-09-30 13:33 UTC (permalink / raw)
  To: Bagas Sanjaya, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

Bagas Sanjaya <bagasdotme@gmail.com> writes:

> The documentation above can be improved (both grammar and formatting):
>
> ---- >8 ----
>
> diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
> index 6b270a24ebc3a2..f691f7995cf088 100644
> --- a/Documentation/x86/cet.rst
> +++ b/Documentation/x86/cet.rst
> @@ -15,92 +15,101 @@ in the 64-bit kernel.
>  
>  CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
>  a secondary stack allocated from memory and cannot be directly modified by
> -applications. When executing a CALL instruction, the processor pushes the
> +applications. When executing a ``CALL`` instruction, the processor pushes the

Just to be clear, not everybody is fond of sprinkling lots of ``literal
text`` throughout the documentation in this way.  Heavy use of it will
certainly clutter the plain-text file and can be a net negative overall.

Thanks,

jon

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-30 13:33     ` Jonathan Corbet
@ 2022-09-30 13:41       ` Bagas Sanjaya
  2022-10-03 16:56         ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Bagas Sanjaya @ 2022-09-30 13:41 UTC (permalink / raw)
  To: Jonathan Corbet, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On 9/30/22 20:33, Jonathan Corbet wrote:
>>  CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
>>  a secondary stack allocated from memory and cannot be directly modified by
>> -applications. When executing a CALL instruction, the processor pushes the
>> +applications. When executing a ``CALL`` instruction, the processor pushes the
> 
> Just to be clear, not everybody is fond of sprinkling lots of ``literal
> text`` throughout the documentation in this way.  Heavy use of it will
> certainly clutter the plain-text file and can be a net negative overall.
> 

Actually there is a trade-off between semantic correctness and plain-text
clarity. With regards to inline code samples (like identifiers), I fall
into the former camp. But when I'm reviewing patches for which the
surrounding documentation go latter camp (leave code samples alone without
markup), I can adapt to that style as long as it causes no warnings
whatsover.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
@ 2022-09-30 15:16   ` Jann Horn
  2022-10-06 16:10     ` Edgecombe, Rick P
  2022-10-03 16:26   ` Kirill A . Shutemov
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 241+ messages in thread
From: Jann Horn @ 2022-09-30 15:16 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Fri, Sep 30, 2022 at 12:30 AM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
> The reason it's lightly used is that Dirty=1 is normally set _before_ a
> write. A write with a Write=0 PTE would typically only generate a fault,
> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
> stacks will no longer exhibit this oddity.

Stupid question, since I just recently learned that IOMMUv2 is a
thing: I assume this also holds for IOMMUs that implement IOMMUv2/SVA,
where the IOMMU directly walks the userspace page tables, and not just
for the CPU core?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-29 22:29 ` [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
@ 2022-09-30 19:16   ` Dave Hansen
  2022-09-30 20:30     ` Edgecombe, Rick P
  2022-09-30 23:00     ` Jann Horn
  2022-10-03 18:39   ` Kees Cook
  1 sibling, 2 replies; 241+ messages in thread
From: Dave Hansen @ 2022-09-30 19:16 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On 9/29/22 15:29, Rick Edgecombe wrote:
> @@ -1633,6 +1633,9 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
>  {
>  	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
>  
> +	if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY)
> +		return 0;

Do we not have a helper for this?  Seems a bit messy to open-code these
shadow-stack permissions.  Definitely at least needs a comment.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-30 19:16   ` Dave Hansen
@ 2022-09-30 20:30     ` Edgecombe, Rick P
  2022-09-30 20:37       ` Dave Hansen
  2022-09-30 23:00     ` Jann Horn
  1 sibling, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-09-30 20:30 UTC (permalink / raw)
  To: Shankar, Ravi V, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Hansen, Dave,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, 2022-09-30 at 12:16 -0700, Dave Hansen wrote:
> On 9/29/22 15:29, Rick Edgecombe wrote:
> > @@ -1633,6 +1633,9 @@ static inline bool
> > __pte_access_permitted(unsigned long pteval, bool write)
> >   {
> >        unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
> >   
> > +     if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) ==
> > _PAGE_DIRTY)
> > +             return 0;
> 
> Do we not have a helper for this?  Seems a bit messy to open-code
> these
> shadow-stack permissions.  Definitely at least needs a comment.

It's because pteval is an unsigned long. We could create a pte_t, and
use the helpers, but then we would be using pte_foo() on pmd's, etc. So
probably comment is the better option?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-30 20:30     ` Edgecombe, Rick P
@ 2022-09-30 20:37       ` Dave Hansen
  0 siblings, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2022-09-30 20:37 UTC (permalink / raw)
  To: Edgecombe, Rick P, Shankar, Ravi V, bsingharora, hpa,
	Syromiatnikov, Eugene, peterz, rdunlap, keescook, dave.hansen,
	kirill.shutemov, Eranian, Stephane, linux-mm, fweimer,
	nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg, hjl.tools,
	Yang, Weijiang, Lutomirski, Andy, pavel, arnd, Moreira, Joao,
	tglx, mike.kravetz, x86, linux-doc, jamorris, john.allen, rppt,
	mingo, corbet, linux-kernel, linux-api, gorcunov

On 9/30/22 13:30, Edgecombe, Rick P wrote:
> On Fri, 2022-09-30 at 12:16 -0700, Dave Hansen wrote:
>> On 9/29/22 15:29, Rick Edgecombe wrote:
>>> @@ -1633,6 +1633,9 @@ static inline bool
>>> __pte_access_permitted(unsigned long pteval, bool write)
>>>   {
>>>        unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
>>>
>>> +     if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) ==
>>> _PAGE_DIRTY)
>>> +             return 0;
>> Do we not have a helper for this?  Seems a bit messy to open-code
>> these
>> shadow-stack permissions.  Definitely at least needs a comment.
> It's because pteval is an unsigned long. We could create a pte_t, and
> use the helpers, but then we would be using pte_foo() on pmd's, etc. So
> probably comment is the better option?

Yeah, a comment is probably best.

This is one of those "generic" page table functions that doesn't work
well with the p{te,md,ud}_* types.  It's either this or cast over to a
pteval_t for pmd/pud and pretend this is a pte-only function.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-30 19:16   ` Dave Hansen
  2022-09-30 20:30     ` Edgecombe, Rick P
@ 2022-09-30 23:00     ` Jann Horn
  2022-09-30 23:02       ` Jann Horn
  2022-09-30 23:04       ` Edgecombe, Rick P
  1 sibling, 2 replies; 241+ messages in thread
From: Jann Horn @ 2022-09-30 23:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Fri, Sep 30, 2022 at 9:16 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 9/29/22 15:29, Rick Edgecombe wrote:
> > @@ -1633,6 +1633,9 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
> >  {
> >       unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
> >
> > +     if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY)
> > +             return 0;
>
> Do we not have a helper for this?  Seems a bit messy to open-code these
> shadow-stack permissions.  Definitely at least needs a comment.

FWIW, if you look at more context around this diff, the function looks
like this:

 static inline bool __pte_access_permitted(unsigned long pteval, bool write)
 {
        unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;

+       if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY)
+               return 0;
+
        if (write)
                need_pte_bits |= _PAGE_RW;

        if ((pteval & need_pte_bits) != need_pte_bits)
                return 0;

        return __pkru_allows_pkey(pte_flags_pkey(pteval), write);
 }

So I think this change is actually a no-op - the only thing it does is
to return 0 if write==1, !_PAGE_RW, and _PAGE_DIRTY. But the check
below will always return 0 if !_PAGE_RW, unless I'm misreading it? And
this is the only patch in the series that touches this function, so
it's not like this becomes necessary with a later patch in the series
either.

Should this check go in anyway for clarity reasons, or should this
instead be a comment explaining that __pte_access_permitted() behaves
just like the hardware access check, which means shadow pages are
treated as readonly?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-30 23:00     ` Jann Horn
@ 2022-09-30 23:02       ` Jann Horn
  2022-09-30 23:04       ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Jann Horn @ 2022-09-30 23:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Sat, Oct 1, 2022 at 1:00 AM Jann Horn <jannh@google.com> wrote:
> So I think this change is actually a no-op - the only thing it does is
> to return 0 if write==1, !_PAGE_RW, and _PAGE_DIRTY. But the check
> below will always return 0 if !_PAGE_RW, unless I'm misreading it?

Er, to be precise, it will always return 0 if write==1 and !_PAGE_RW.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-30 23:00     ` Jann Horn
  2022-09-30 23:02       ` Jann Horn
@ 2022-09-30 23:04       ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-09-30 23:04 UTC (permalink / raw)
  To: jannh, Hansen, Dave
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, dethoma, linux-arch, kcc, bp,
	oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Sat, 2022-10-01 at 01:00 +0200, Jann Horn wrote:
> On Fri, Sep 30, 2022 at 9:16 PM Dave Hansen <dave.hansen@intel.com>
> wrote:
> > On 9/29/22 15:29, Rick Edgecombe wrote:
> > > @@ -1633,6 +1633,9 @@ static inline bool
> > > __pte_access_permitted(unsigned long pteval, bool write)
> > >   {
> > >        unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
> > > 
> > > +     if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) ==
> > > _PAGE_DIRTY)
> > > +             return 0;
> > 
> > Do we not have a helper for this?  Seems a bit messy to open-code
> > these
> > shadow-stack permissions.  Definitely at least needs a comment.
> 
> FWIW, if you look at more context around this diff, the function
> looks
> like this:
> 
>  static inline bool __pte_access_permitted(unsigned long pteval, bool
> write)
>  {
>         unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
> 
> +       if (write && (pteval & (_PAGE_RW | _PAGE_DIRTY)) ==
> _PAGE_DIRTY)
> +               return 0;
> +
>         if (write)
>                 need_pte_bits |= _PAGE_RW;
> 
>         if ((pteval & need_pte_bits) != need_pte_bits)
>                 return 0;
> 
>         return __pkru_allows_pkey(pte_flags_pkey(pteval), write);
>  }
> 
> So I think this change is actually a no-op - the only thing it does
> is
> to return 0 if write==1, !_PAGE_RW, and _PAGE_DIRTY. But the check
> below will always return 0 if !_PAGE_RW, unless I'm misreading it?
> And
> this is the only patch in the series that touches this function, so
> it's not like this becomes necessary with a later patch in the series
> either.
> 
> Should this check go in anyway for clarity reasons, or should this
> instead be a comment explaining that __pte_access_permitted() behaves
> just like the hardware access check, which means shadow pages are
> treated as readonly?

Thanks Jann, I was just realizing the same thing. Yes, I think it
doesn't do anything. I can add a comment of why there is no check, but
otherwise the check seems like unnecessary work.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack
  2022-09-29 22:29 ` [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack Rick Edgecombe
@ 2022-10-03 10:36   ` Mike Rapoport
  2022-10-03 16:57     ` Edgecombe, Rick P
  2022-10-03 20:29   ` Kees Cook
  1 sibling, 1 reply; 241+ messages in thread
From: Mike Rapoport @ 2022-10-03 10:36 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:22PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When a process is duplicated, but the child shares the address space with
> the parent, there is potential for the threads sharing a single stack to
> cause conflicts for each other. In the normal non-cet case this is handled
> in two ways.
> 
> With regular CLONE_VM a new stack is provided by userspace such that the
> parent and child have different stacks.
> 
> For vfork, the parent is suspended until the child exits. So as long as
> the child doesn't return from the vfork()/CLONE_VFORK calling function and
> sticks to a limited set of operations, the parent and child can share the
> same stack.
> 
> For shadow stack, these scenarios present similar sharing problems. For the
> CLONE_VM case, the child and the parent must have separate shadow stacks.
> Instead of changing clone to take a shadow stack, have the kernel just
> allocate one and switch to it.
> 
> Use stack_size passed from clone3() syscall for thread shadow stack size. A
> compat-mode thread shadow stack size is further reduced to 1/4. This
> allows more threads to run in a 32-bit address space. The clone() does not
> pass stack_size, which was added to clone3(). In that case, use
> RLIMIT_STACK size and cap to 4 GB.
> 
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from
> returning from the vfork() calling function, using CET. Glibc does this
> by adjusting the shadow stack pointer in the child, so that the child
> receives a #CP if it tries to return from vfork() calling function.
> 
> Free the shadow stack on thread exit by doing it in mm_release(). Skip
> this when exiting a vfork() child since the stack is shared in the
> parent.
> 
> During this operation, the shadow stack pointer of the new thread needs
> to be updated to point to the newly allocated shadow stack. Since the
> ability to do this is confined to the FPU subsystem, change
> fpu_clone() to take the new shadow stack pointer, and update it
> internally inside the FPU subsystem. This part was suggested by Thomas
> Gleixner.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> 
> v2:
>  - Have fpu_clone() take new shadow stack pointer and update SSP in
>    xsave buffer for new task. (tglx)
> 
> v1:
>  - Expand commit log.
>  - Add more comments.
>  - Switch to xsave helpers.
> 
> Yu-cheng v30:
>  - Update comments about clone()/clone3(). (Borislav Petkov)
> 
> Yu-cheng v29:
>  - WARN_ON_ONCE() when get_xsave_addr() returns NULL, and update comments.
>    (Dave Hansen)
> 
>  arch/x86/include/asm/cet.h         |  7 +++++
>  arch/x86/include/asm/fpu/sched.h   |  3 +-
>  arch/x86/include/asm/mmu_context.h |  2 ++
>  arch/x86/kernel/fpu/core.c         | 40 ++++++++++++++++++++++++-
>  arch/x86/kernel/process.c          | 17 ++++++++++-
>  arch/x86/kernel/shstk.c            | 48 +++++++++++++++++++++++++++++-
>  6 files changed, 113 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 778d3054ccc7..f332e9b42b6d 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -555,8 +555,40 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
>  	}
>  }
>  
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
> +{
> +	struct cet_user_state *xstate;
> +
> +	/* If ssp update is not needed. */
> +	if (!ssp)
> +		return 0;
> +
> +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> +				XFEATURE_CET_USER);
> +
> +	/*
> +	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
> +	 * stack and the fpu state should be up to date since it was just copied
> +	 * from the parent in fpu_clone(). So there must be a valid non-init CET
> +	 * state location in the buffer.
> +	 */
> +	if (WARN_ON_ONCE(!xstate))
> +		return 1;
> +
> +	xstate->user_ssp = (u64)ssp;
> +
> +	return 0;
> +}
> +#else
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
> +{

return 0; ?

> +}
> +#endif
> +

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
@ 2022-10-03 13:40   ` Kirill A . Shutemov
  2022-10-03 19:53     ` Edgecombe, Rick P
  2022-10-03 17:25   ` Kees Cook
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 13:40 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:28:59PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow Stack provides protection against function return address
> corruption. It is active when the processor supports it, the kernel has
> CONFIG_X86_SHADOW_STACK enabled, and the application is built for the
> feature. This is only implemented for the 64-bit kernel. When it is
> enabled, legacy non-Shadow Stack applications continue to work, but without
> protection.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> 
> v2:
>  - Remove already wrong kernel size increase info (tlgx)
>  - Change prompt to remove "Intel" (tglx)
>  - Update line about what CPUs are supported (Dave)
> 
> Yu-cheng v25:
>  - Remove X86_CET and use X86_SHADOW_STACK directly.
> 
> Yu-cheng v24:
>  - Update for the splitting X86_CET to X86_SHADOW_STACK and X86_IBT.
> 
>  arch/x86/Kconfig           | 18 ++++++++++++++++++
>  arch/x86/Kconfig.assembler |  5 +++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f9920f1341c8..b68eb75887b8 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -26,6 +26,7 @@ config X86_64
>  	depends on 64BIT
>  	# Options that are inherently 64-bit kernel only:
>  	select ARCH_HAS_GIGANTIC_PAGE
> +	select ARCH_HAS_SHADOW_STACK
>  	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>  	select ARCH_USE_CMPXCHG_LOCKREF
>  	select HAVE_ARCH_SOFT_DIRTY
> @@ -1936,6 +1937,23 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config ARCH_HAS_SHADOW_STACK
> +	def_bool n

Hm. Shouldn't ARCH_HAS_SHADOW_STACK definition be in arch/Kconfig, not
under arch/x86?

Also, I think "def_bool n" has the same meaning as just "bool", no?

> +
> +config X86_SHADOW_STACK
> +	prompt "X86 Shadow Stack"
> +	def_bool n

Maybe just

	bool "X86 Shadow Stack"

?

> +	depends on ARCH_HAS_SHADOW_STACK
> +	select ARCH_USES_HIGH_VMA_FLAGS
> +	help
> +	  Shadow Stack protection is a hardware feature that detects function
> +	  return address corruption. Today the kernel's support is limited to
> +	  virtualizing it in KVM guests.
> +
> +	  CPUs supporting shadow stacks were first released in 2020.
> +
> +	  If unsure, say N.
> +
>  config EFI
>  	bool "EFI runtime service support"
>  	depends on ACPI
> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> index 26b8c08e2fc4..00c79dd93651 100644
> --- a/arch/x86/Kconfig.assembler
> +++ b/arch/x86/Kconfig.assembler
> @@ -19,3 +19,8 @@ config AS_TPAUSE
>  	def_bool $(as-instr,tpause %ecx)
>  	help
>  	  Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
> +
> +config AS_WRUSS
> +	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
> +	help
> +	  Supported by binutils >= 2.31 and LLVM integrated assembler
> -- 
> 2.17.1
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
@ 2022-10-03 14:01   ` Kirill A . Shutemov
  2022-10-03 18:12     ` Edgecombe, Rick P
  2022-10-03 18:04   ` Kees Cook
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 14:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu, Michael Kerrisk

On Thu, Sep 29, 2022 at 03:29:04PM -0700, Rick Edgecombe wrote:
> +#else
> +static void do_user_control_protection_fault(struct pt_regs *regs,
> +					     unsigned long error_code)
> +{
> +	WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");

Why is this a warning, but runtime check for !X86_FEATURE_IBT and
!X86_FEATURE_SHSTK below is fatal?

> +}
> +#endif
> +
> +#ifdef CONFIG_X86_KERNEL_IBT
> +
> +static __ro_after_init bool ibt_fatal = true;
> +
> +extern void ibt_selftest_ip(void); /* code label defined in asm below */
>  
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
>  	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
>  		regs->ax = 0;
>  		return;
> @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
>  }
>  
>  __setup("ibt=", ibt_setup);
> -
> +#else
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
> +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");

Ditto.

> +}
>  #endif /* CONFIG_X86_KERNEL_IBT */
>  
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pr_err("Unexpected #CP\n");
> +		BUG();
> +	}
> +
> +	if (user_mode(regs))
> +		do_user_control_protection_fault(regs, error_code);
> +	else
> +		do_kernel_control_protection_fault(regs);
> +}
> +#endif /* defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK) */
> +
>  #ifdef CONFIG_X86_F00F_BUG
>  void handle_invalid_op(struct pt_regs *regs)
>  #else

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2022-09-29 22:29 ` [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
@ 2022-10-03 14:17   ` Kirill A . Shutemov
  2022-10-05  1:31   ` Andrew Cooper
  1 sibling, 0 replies; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 14:17 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu, Christoph Hellwig

On Thu, Sep 29, 2022 at 03:29:05PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Processors sometimes directly create Write=0,Dirty=1 PTEs. These PTEs are
> created by software. One such case is that kernel read-only pages are
> historically set up as Dirty.
> 
> New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
> shadow stack pages. When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some
> instructions can write to such supervisor memory. The kernel does not set
> IA32_S_CET.SH_STK_EN, but to reduce ambiguity between shadow stack and
> regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> 
> ---
> 
> v2:
>  - Normalize PTE bit descriptions between patches
> 
>  arch/x86/include/asm/pgtable_types.h | 6 +++---
>  arch/x86/mm/pat/set_memory.c         | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index aa174fed3a71..ff82237e7b6b 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -192,10 +192,10 @@ enum page_cache_mode {
>  #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
>  #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
>  #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
> -#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
> -#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
> +#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
> +#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
>  #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
> -#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
> +#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
>  #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
>  #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
>  #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 1abd5438f126..ed9193b469ba 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1977,7 +1977,7 @@ int set_memory_nx(unsigned long addr, int numpages)
>  
>  int set_memory_ro(unsigned long addr, int numpages)
>  {
> -	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
> +	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
>  }

Hm. Do we need to modify also *_wrprotect() helpers to clear dirty bit?

I guess not (at least without a lot of audit), as we risk loosing dirty
bit on page cache pages. But why is it safe? Do we only care about about
kernel PTEs here? Userspace Write=0,Dirty=1 PTEs handled as before?

>  int set_memory_rw(unsigned long addr, int numpages)
> -- 
> 2.17.1
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
  2022-09-30 15:16   ` Jann Horn
@ 2022-10-03 16:26   ` Kirill A . Shutemov
  2022-10-03 21:36     ` Edgecombe, Rick P
  2022-10-05  2:17   ` Andrew Cooper
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 16:26 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> +/*
> + * Normally the Dirty bit is used to denote COW memory on x86. But
> + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> + * since the Dirty=1,Write=0 will result in the memory being treated
> + * as shaodw stack by the HW. So when creating COW memory, a software
> + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and
> + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
> + * transition it to the shadow stack compatible version of COW (Cow=1).
> + */
> +
> +static inline pte_t pte_mkcow(pte_t pte)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return pte;
> +
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_set_flags(pte, _PAGE_COW);
> +}
> +
> +static inline pte_t pte_clear_cow(pte_t pte)
> +{
> +	/*
> +	 * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
> +	 * See the _PAGE_COW definition for more details.
> +	 */
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return pte;
> +
> +	/*
> +	 * PTE is getting copied-on-write, so it will be dirtied
> +	 * if writable, or made shadow stack if shadow stack and
> +	 * being copied on access. Set they dirty bit for both
> +	 * cases.
> +	 */
> +	pte = pte_set_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_COW);
> +}

These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the _PAGE_COW
logic for all machines with 64-bit entries. It will get you much more
coverage and more universal rules.

> +
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>  static inline int pte_uffd_wp(pte_t pte)
>  {
> @@ -319,7 +381,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pte_t pte_mkold(pte_t pte)
> @@ -329,7 +391,16 @@ static inline pte_t pte_mkold(pte_t pte)
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_RW);
> +	pte = pte_clear_flags(pte, _PAGE_RW);
> +
> +	/*
> +	 * Blindly clearing _PAGE_RW might accidentally create
> +	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
> +	 * dirty value to the software bit.
> +	 */
> +	if (pte_dirty(pte))
> +		pte = pte_mkcow(pte);
> +	return pte;
>  }

Hm. What about ptep/pmdp_set_wrprotect()? They clear _PAGE_RW blindly.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-30 13:41       ` Bagas Sanjaya
@ 2022-10-03 16:56         ` Edgecombe, Rick P
  2022-10-04  2:16           ` Bagas Sanjaya
  2022-10-05  9:10           ` Peter Zijlstra
  0 siblings, 2 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 16:56 UTC (permalink / raw)
  To: corbet, bagasdotme
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, Moreira, Joao, tglx, mike.kravetz, x86, Yang,
	Weijiang, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-09-30 at 20:41 +0700, Bagas Sanjaya wrote:
> On 9/30/22 20:33, Jonathan Corbet wrote:
> > >   CET introduces Shadow Stack and Indirect Branch Tracking.
> > > Shadow stack is
> > >   a secondary stack allocated from memory and cannot be directly
> > > modified by
> > > -applications. When executing a CALL instruction, the processor
> > > pushes the
> > > +applications. When executing a ``CALL`` instruction, the
> > > processor pushes the
> > 
> > Just to be clear, not everybody is fond of sprinkling lots of
> > ``literal
> > text`` throughout the documentation in this way.  Heavy use of it
> > will
> > certainly clutter the plain-text file and can be a net negative
> > overall.
> > 
> 
> Actually there is a trade-off between semantic correctness and plain-
> text
> clarity. With regards to inline code samples (like identifiers), I
> fall
> into the former camp. But when I'm reviewing patches for which the
> surrounding documentation go latter camp (leave code samples alone
> without
> markup), I can adapt to that style as long as it causes no warnings
> whatsover.

Thanks. Unless anyone has any objections, I think I'll take all these
changes, except for the literal-izing of the instructions. They are not
really being used as code samples in this case.

Bagas, can you reply with your sign-off and I'll just apply it?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack
  2022-10-03 10:36   ` Mike Rapoport
@ 2022-10-03 16:57     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 16:57 UTC (permalink / raw)
  To: rppt
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 13:36 +0300, Mike Rapoport wrote:
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > shstk_addr)
> > +{
> 
> return 0; ?
> 
> > +}
> > +#endif
> > +

Oops. It was a last minute change to have update_fpu_shstk() return
int. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 00/39] Shadowstacks for userspace
  2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
                   ` (38 preceding siblings ...)
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support Rick Edgecombe
@ 2022-10-03 17:04 ` Kees Cook
  2022-10-03 17:25   ` Jann Horn
  2022-10-03 18:33   ` Edgecombe, Rick P
  39 siblings, 2 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:04 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:28:57PM -0700, Rick Edgecombe wrote:
> This is an overdue followup to the “Shadow stacks for userspace” CET series. 
> Thanks for all the comments on the first version [0]. They drove a decent 
> amount of changes for v2. Since it has been awhile, I’ll try to summarize the 
> areas that got major changes since last time. Smaller changes are listed in 
> each patch.

Thanks for the write-up!

> [...]
>         GUP
>         ---
>         Shadow stack memory is generally treated as writable by the kernel, but
>         it behaves differently then other writable memory with respect to GUP.
>         FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
>         set. Shadow stack memory is writable from the perspective of being
>         changeable by userspace, but it is also protected memory from
>         userspace’s perspective. So preventing it from being writable via
>         FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
>         it. However, like read-only memory, FOLL_FORCE can still write through
>         it. This means shadow stacks can be written to via things like
>         “/proc/self/mem”. Apps that want extra security will have to prevent
>         access to kernel features that can write with FOLL_FORCE.

This seems like a problem to me -- the point of SS is that there cannot be
a way to write to them without specific instruction sequences. The fact
that /proc/self/mem bypasses memory protections was an old design mistake
that keeps leading to surprising behaviors. It would be much nicer to
draw the line somewhere and just say that FOLL_FORCE doesn't work on
VM_SHADOW_STACK. Why must FOLL_FORCE be allowed to write to SS?

> [...]
> Shadow stack signal format
> --------------------------
> So to handle alt shadow stacks we need to push some data onto a stack. To 
> prevent SROP we need to push something to the shadow stack that the kernel can 
> [...]
> shadow stack return address or a shadow stack tokens. To make sure it can’t be 
> used, data is pushed with the high bit (bit 63) set. This bit is a linear 
> address bit in both the token format and a normal return address, so it should 
> not conflict with anything. It puts any return address in the kernel half of 
> the address space, so would never be created naturally by a userspace program. 
> It will not be a valid restore token either, as the kernel address will never 
> be pointing to the previous frame in the shadow stack.
> 
> When a signal hits, the format pushed to the stack that is handling the signal 
> is four 8 byte values (since we are 64 bit only):
> |1...old SSP|1...alt stack size|1...alt stack base|0|

Do these end up being non-canonical addresses? (To avoid confusion with
"real" kernel addresses?)

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
  2022-09-30  3:41   ` Bagas Sanjaya
@ 2022-10-03 17:18   ` Kees Cook
  2022-10-03 19:46     ` Edgecombe, Rick P
  2022-10-05  0:02   ` Andrew Cooper
  2022-10-10 12:19   ` Florian Weimer
  3 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:18 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:28:58PM -0700, Rick Edgecombe wrote:
> [...]
> +Overview
> +========
> +
> +Control-flow Enforcement Technology (CET) is term referring to several
> +related x86 processor features that provides protection against control
> +flow hijacking attacks. The HW feature itself can be set up to protect
> +both applications and the kernel. Only user-mode protection is implemented
> +in the 64-bit kernel.

This likely needs rewording, since it's not strictly true any more:
IBT is supported in kernel-mode now (CONFIG_X86_IBT).

> +CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
> +a secondary stack allocated from memory and cannot be directly modified by
> +applications. When executing a CALL instruction, the processor pushes the
> +return address to both the normal stack and the shadow stack. Upon
> +function return, the processor pops the shadow stack copy and compares it
> +to the normal stack copy. If the two differ, the processor raises a
> +control-protection fault. Indirect branch tracking verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes. Not all CPU's have both Shadow Stack and Indirect Branch Tracking
> +and only Shadow Stack is currently supported in the kernel.
> +
> +The Kconfig options is X86_SHADOW_STACK, and it can be disabled with
> +the kernel parameter clearcpuid, like this: "clearcpuid=shstk".
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1
> +or later are required. To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +At run time, /proc/cpuinfo shows CET features if the processor supports
> +CET.

Maybe call them out by name: shstk ibt

> +CET arch_prctl()'s
> +==================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's.
> +
> +arch_prctl(ARCH_CET_ENABLE, unsigned int feature)
> +    Enable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.

Does this mean only 1 bit out of the 32 may be specified?

> +
> +arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
> +    Disable features specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_CET_LOCK, unsigned int features)
> +    Lock in features at their current enabled or disabled status.

How is the "features" argument processed here?

> [...]
> +Proc status
> +===========
> +To check if an application is actually running with shadow stack, the
> +user can read the /proc/$PID/arch_status. It will report "wrss" or
> +"shstk" depending on what is enabled.

TIL about "arch_status". :) Why is this a separate file? "status" is
already has unique field names.

> +Fork
> +----
> +
> +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
> +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
> +shadow access triggers a page fault with the shadow stack access bit set
> +in the page fault error code.
> +
> +When a task forks a child, its shadow stack PTEs are copied and both the
> +parent's and the child's shadow stack PTEs are cleared of the dirty bit.
> +Upon the next shadow stack access, the resulting shadow stack page fault
> +is handled by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new shadow stack
> +for the new thread.

Perhaps speak to the ASLR characteristics of the shstk here?

Also, it seems if there is a "Fork" section, there should be an "Exec"
section? I suspect it would be short: shstk is disabled when execve() is
called and must be re-enabled from userspace, yes?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 00/39] Shadowstacks for userspace
  2022-10-03 17:04 ` [PATCH v2 00/39] Shadowstacks for userspace Kees Cook
@ 2022-10-03 17:25   ` Jann Horn
  2022-10-04  5:01     ` Kees Cook
  2022-10-03 18:33   ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Jann Horn @ 2022-10-03 17:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma

On Mon, Oct 3, 2022 at 7:04 PM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 29, 2022 at 03:28:57PM -0700, Rick Edgecombe wrote:
> > This is an overdue followup to the “Shadow stacks for userspace” CET series.
> > Thanks for all the comments on the first version [0]. They drove a decent
> > amount of changes for v2. Since it has been awhile, I’ll try to summarize the
> > areas that got major changes since last time. Smaller changes are listed in
> > each patch.
>
> Thanks for the write-up!
>
> > [...]
> >         GUP
> >         ---
> >         Shadow stack memory is generally treated as writable by the kernel, but
> >         it behaves differently then other writable memory with respect to GUP.
> >         FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
> >         set. Shadow stack memory is writable from the perspective of being
> >         changeable by userspace, but it is also protected memory from
> >         userspace’s perspective. So preventing it from being writable via
> >         FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
> >         it. However, like read-only memory, FOLL_FORCE can still write through
> >         it. This means shadow stacks can be written to via things like
> >         “/proc/self/mem”. Apps that want extra security will have to prevent
> >         access to kernel features that can write with FOLL_FORCE.
>
> This seems like a problem to me -- the point of SS is that there cannot be
> a way to write to them without specific instruction sequences. The fact
> that /proc/self/mem bypasses memory protections was an old design mistake
> that keeps leading to surprising behaviors. It would be much nicer to
> draw the line somewhere and just say that FOLL_FORCE doesn't work on
> VM_SHADOW_STACK. Why must FOLL_FORCE be allowed to write to SS?

But once you have FOLL_FORCE, you can also just write over stuff like
executable code instead of writing over the stack. I don't think
allowing FOLL_FORCE writes over shadow stacks from /proc/$pid/mem is
making things worse in any way, and it's probably helpful for stuff
like debuggers.

If you don't want /proc/$pid/mem to be able to do stuff like that,
then IMO the way to go is to change when /proc/$pid/mem uses
FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
  2022-10-03 13:40   ` Kirill A . Shutemov
@ 2022-10-03 17:25   ` Kees Cook
  2022-10-03 19:52     ` Edgecombe, Rick P
  2022-10-03 19:42   ` Dave Hansen
  2022-10-12 20:04   ` Borislav Petkov
  3 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:25 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:28:59PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow Stack provides protection against function return address
> corruption. It is active when the processor supports it, the kernel has
> CONFIG_X86_SHADOW_STACK enabled, and the application is built for the
> feature. This is only implemented for the 64-bit kernel. When it is
> enabled, legacy non-Shadow Stack applications continue to work, but without
> protection.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> 
> v2:
>  - Remove already wrong kernel size increase info (tlgx)
>  - Change prompt to remove "Intel" (tglx)
>  - Update line about what CPUs are supported (Dave)
> 
> Yu-cheng v25:
>  - Remove X86_CET and use X86_SHADOW_STACK directly.
> 
> Yu-cheng v24:
>  - Update for the splitting X86_CET to X86_SHADOW_STACK and X86_IBT.
> 
>  arch/x86/Kconfig           | 18 ++++++++++++++++++
>  arch/x86/Kconfig.assembler |  5 +++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f9920f1341c8..b68eb75887b8 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -26,6 +26,7 @@ config X86_64
>  	depends on 64BIT
>  	# Options that are inherently 64-bit kernel only:
>  	select ARCH_HAS_GIGANTIC_PAGE
> +	select ARCH_HAS_SHADOW_STACK
>  	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>  	select ARCH_USE_CMPXCHG_LOCKREF
>  	select HAVE_ARCH_SOFT_DIRTY
> @@ -1936,6 +1937,23 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config ARCH_HAS_SHADOW_STACK
> +	def_bool n
> +
> +config X86_SHADOW_STACK
> +	prompt "X86 Shadow Stack"
> +	def_bool n

I hope we can switch this to "default y" soon, given it's a hardware
feature that is disabled at runtime when not available.

> +	depends on ARCH_HAS_SHADOW_STACK

Doesn't this depend on AS_WRUSS too?

> +	select ARCH_USES_HIGH_VMA_FLAGS
> +	help
> +	  Shadow Stack protection is a hardware feature that detects function
> +	  return address corruption. Today the kernel's support is limited to
> +	  virtualizing it in KVM guests.
> +
> +	  CPUs supporting shadow stacks were first released in 2020.
> +
> +	  If unsure, say N.
> +
>  config EFI
>  	bool "EFI runtime service support"
>  	depends on ACPI
> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> index 26b8c08e2fc4..00c79dd93651 100644
> --- a/arch/x86/Kconfig.assembler
> +++ b/arch/x86/Kconfig.assembler
> @@ -19,3 +19,8 @@ config AS_TPAUSE
>  	def_bool $(as-instr,tpause %ecx)
>  	help
>  	  Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
> +
> +config AS_WRUSS
> +	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
> +	help
> +	  Supported by binutils >= 2.31 and LLVM integrated assembler

Otherwise, I don't see anything else using OCNFIG_AS_WRUSS:

$ git grep AS_WRUSS
arch/x86/Kconfig.assembler:config AS_WRUSS

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2022-09-29 22:29 ` [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
@ 2022-10-03 17:26   ` Kees Cook
  2022-10-14 16:20   ` Borislav Petkov
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:26 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:00PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The Control-Flow Enforcement Technology contains two related features,
> one of which is Shadow Stacks. Future patches will utilize this feature
> for shadow stack support in KVM, so add a CPU feature flags for Shadow
> Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
> 
> To protect shadow stack state from malicious modification, the registers
> are only accessible in supervisor mode. This implementation
> context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
> on XSAVES.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2022-09-29 22:29 ` [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
@ 2022-10-03 17:31   ` Kees Cook
  2022-10-05  0:55   ` Andrew Cooper
  2022-10-14 17:12   ` Borislav Petkov
  2 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:31 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:01PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Utilizing CET features requires a CR4 bit to be enabled as well as bits
> to be set in CET MSRs. Setting the CR4 bit does two things:
>  1. Enables the usage of WRUSS instruction, which the kernel can use to
>     write to userspace shadow stacks.
>  2. Allows those individual aspects of CET to be enabled later via the MSR.
>  3. Allows CET to be enabled in guests
> 
> While future patches will allow the MSR values to be saved and restored
> per task, the CR4 bit will allow for WRUSS to be used regardless of if a
> tasks CET MSRs have been restored.
> 
> Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
> and is configured with kernel IBT. However future patches that enable
> userspace shadow stack support will need the bit set as well. So change
> the logic to enable it in either case.
> 
> Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
> userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2022-09-29 22:29 ` [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
@ 2022-10-03 17:40   ` Kees Cook
  2022-10-15  9:46   ` Borislav Petkov
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:40 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:02PM -0700, Rick Edgecombe wrote:
> [...]
> xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
> it actually check's the xfeature. This ends up exceeding 80 chars, but was

Spelling nit through-out all patches: possessive used for plurals. E.g.
the above "check's" instances should be "checks". Please review all the
documentation and commit logs; it shows up a lot. :)

> [...]
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c8340156bfd2..5e6a4867fd05 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -39,26 +39,26 @@
>   */
>  static const char *xfeature_names[] =
>  {
> -	"x87 floating point registers"	,
> -	"SSE registers"			,
> -	"AVX registers"			,
> -	"MPX bounds registers"		,
> -	"MPX CSR"			,
> -	"AVX-512 opmask"		,
> -	"AVX-512 Hi256"			,
> -	"AVX-512 ZMM_Hi256"		,
> -	"Processor Trace (unused)"	,
> -	"Protection Keys User registers",
> -	"PASID state",
> -	"unknown xstate feature"	,
> -	"unknown xstate feature"	,
> -	"unknown xstate feature"	,
> -	"unknown xstate feature"	,
> -	"unknown xstate feature"	,
> -	"unknown xstate feature"	,
> -	"AMX Tile config"		,
> -	"AMX Tile data"			,
> -	"unknown xstate feature"	,
> +	"x87 floating point registers"			,
> +	"SSE registers"					,
> +	"AVX registers"					,
> +	"MPX bounds registers"				,
> +	"MPX CSR"					,
> +	"AVX-512 opmask"				,
> +	"AVX-512 Hi256"					,
> +	"AVX-512 ZMM_Hi256"				,
> +	"Processor Trace (unused)"			,
> +	"Protection Keys User registers"		,
> +	"PASID state"					,
> +	"Control-flow User registers"			,
> +	"Control-flow Kernel registers (unused)"	,
> +	"unknown xstate feature"			,
> +	"unknown xstate feature"			,
> +	"unknown xstate feature"			,
> +	"unknown xstate feature"			,
> +	"AMX Tile config"				,
> +	"AMX Tile data"					,
> +	"unknown xstate feature"			,

What a strange style. Why not just leave the commas after the " ? Then
these kinds of multi-line updates aren't needed in the future.

> [...]
> -	/*
> -	 * Make *SURE* to add any feature numbers in below if
> -	 * there are "holes" in the xsave state component
> -	 * numbers.
> -	 */
> -	if ((nr < XFEATURE_YMM) ||
> -	    (nr >= XFEATURE_MAX) ||
> -	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
> -	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
> +	if (!chked) {
>  		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
>  		XSTATE_WARN_ON(1);
>  		return false;

This clean-up feels like it should be part of a separate patch, but
okay. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
@ 2022-10-03 17:43   ` Kirill A . Shutemov
  2022-10-03 18:11   ` Nadav Amit
  1 sibling, 0 replies; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 17:43 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:09PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2f2963429f48..58c7bf9d7392 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +#ifdef CONFIG_X86_SHADOW_STACK
> +	/*
> +	 * Avoid accidentally creating shadow stack PTEs
> +	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
> +	 * the hardware setting Dirty=1.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pte_t old_pte, new_pte;
> +
> +		old_pte = READ_ONCE(*ptep);
> +		do {
> +			new_pte = pte_wrprotect(old_pte);
> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> +		return;
> +	}
> +#endif
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>  }

Okay, this addresses my previous question. The need in cmpxchg is
unfortunate, but well.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2022-09-29 22:29 ` [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
@ 2022-10-03 17:47   ` Kirill A . Shutemov
  2022-10-04  0:29     ` Edgecombe, Rick P
  2022-10-03 18:17   ` Kees Cook
  1 sibling, 1 reply; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 17:47 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:11PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set.  However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages.  These
> two cases are handled differently for page faults. Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  Documentation/filesystems/proc.rst | 1 +
>  arch/x86/mm/mmap.c                 | 2 ++
>  fs/proc/task_mmu.c                 | 3 +++
>  include/linux/mm.h                 | 8 ++++++++
>  4 files changed, 14 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index e7aafc82be99..d54ff397947a 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -560,6 +560,7 @@ encoded manner. The codes are the following:
>      mt    arm64 MTE allocation tags are enabled
>      um    userfaultfd missing tracking
>      uw    userfaultfd wr-protect tracking
> +    ss    shadow stack page
>      ==    =======================================
>  
>  Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index c90c20904a60..f3f52c5e2fd6 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
>  
>  const char *arch_vma_name(struct vm_area_struct *vma)
>  {
> +	if (vma->vm_flags & VM_SHADOW_STACK)
> +		return "[shadow stack]";
>  	return NULL;
>  }
>  

But why here?

CONFIG_ARCH_HAS_SHADOW_STACK implies that there will be more than one arch
that supports shadow stack. The name has to come from generic code too, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate
  2022-09-29 22:29 ` [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
@ 2022-10-03 17:48   ` Kees Cook
  2022-10-03 20:05     ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 17:48 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:03PM -0700, Rick Edgecombe wrote:
> Just like user xfeatures, supervisor xfeatures can be active in the
> registers or present in the task FPU buffer. If the registers are
> active, the registers can be modified directly. If the registers are
> not active, the modification must be performed on the task FPU buffer.
> 
> When the state is not active, the kernel could perform modifications
> directly to the buffer. But in order for it to do that, it needs
> to know where in the buffer the specific state it wants to modify is
> located. Doing this is not robust against optimizations that compact
> the FPU buffer, as each access would require computing where in the
> buffer it is.
> 
> The easiest way to modify supervisor xfeature data is to force restore
> the registers and write directly to the MSRs. Often times this is just fine
> anyway as the registers need to be restored before returning to userspace.
> Do this for now, leaving buffer writing optimizations for the future.

Just for my own clarity, does this mean lock/load _needs_ to happen
before MSR access, or is it just a convenient place to do it? From later
patches it seems it's a requirement during MSR access, which might be a
good idea to detail here. It answers the question "when is this function
needed?"

> 
> Add a new function fpregs_lock_and_load() that can simultaneously call
> fpregs_lock() and do this restore. Also perform some extra sanity
> checks in this function since this will be used in non-fpu focused code.

Nit: this is called "fpu_lock_and_load" in the patch itself.

> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
  2022-10-03 14:01   ` Kirill A . Shutemov
@ 2022-10-03 18:04   ` Kees Cook
  2022-10-03 20:33     ` Edgecombe, Rick P
  2022-10-03 22:51   ` Andy Lutomirski
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:04 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu, Michael Kerrisk

On Thu, Sep 29, 2022 at 03:29:04PM -0700, Rick Edgecombe wrote:
> [...]
> -#ifdef CONFIG_X86_KERNEL_IBT
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)

This pattern is repeated several times. Perhaps there needs to be a
CONFIG_X86_CET to make this more readable? Really just a style question.

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b68eb75887b8..6cb52616e0cf 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1836,6 +1836,11 @@ config CC_HAS_IBT
 		  (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
 		  $(as-instr,endbr64)
 
+config X86_CET
+	def_bool n
+	help
+	  CET features are enabled (IBT and/or Shadow Stack)
+
 config X86_KERNEL_IBT
 	prompt "Indirect Branch Tracking"
 	bool
@@ -1843,6 +1848,7 @@ config X86_KERNEL_IBT
 	# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
 	depends on !LD_IS_LLD || LLD_VERSION >= 140000
 	select OBJTOOL
+	select X86_CET
 	help
 	  Build the kernel with support for Indirect Branch Tracking, a
 	  hardware support course-grain forward-edge Control Flow Integrity
@@ -1945,6 +1951,7 @@ config X86_SHADOW_STACK
 	def_bool n
 	depends on ARCH_HAS_SHADOW_STACK
 	select ARCH_USES_HIGH_VMA_FLAGS
+	select X86_CET
 	help
 	  Shadow Stack protection is a hardware feature that detects function
 	  return address corruption. Today the kernel's support is limited to

> [...]
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pr_err("Unexpected #CP\n");
> +		BUG();
> +	}

I second Kirill's question here. This seems an entirely survivable
(but highly unexpected) state. I think this whole "if" could just be
replaced with:

	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_IBT) &&
		     !cpu_feature_enabled(X86_FEATURE_SHSTK),
		     "Unexpected #CP\n");

Otherwise this looks good to me.

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 09/39] x86/mm: Move pmd_write(), pud_write() up in the file
  2022-09-29 22:29 ` [PATCH v2 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
@ 2022-10-03 18:06   ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:06 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:06PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> To prepare the introduction of _PAGE_COW, move pmd_write() and
> pud_write() up in the file, so that they can be used by other
> helpers below.  No functional changes.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Hey, a PTE patch I'm able to review! ;)

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
  2022-10-03 17:43   ` Kirill A . Shutemov
@ 2022-10-03 18:11   ` Nadav Amit
  2022-10-03 18:51     ` Dave Hansen
  2022-10-03 22:28     ` Edgecombe, Rick P
  1 sibling, 2 replies; 241+ messages in thread
From: Nadav Amit @ 2022-10-03 18:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: X86 ML, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	linux-doc, Linux MM, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, Mike Rapoport, jamorris,
	dethoma, Yu-cheng Yu

On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
> stack. Copy-on-write PTes then have Write=0,Cow=1.
> 
> When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
> become a transient shadow stack PTE in two cases:
> 
> The first case is that some processors can start a write but end up seeing
> a Write=0 PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE. However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
> 
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> 
> v2:
> - Compile out some code due to clang build error
> - Clarify commit log (dhansen)
> - Normalize PTE bit descriptions between patches (dhansen)
> - Update comment with text from (dhansen)
> 
> Yu-cheng v30:
> - Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).
> 
> arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2f2963429f48..58c7bf9d7392 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> static inline void ptep_set_wrprotect(struct mm_struct *mm,
> 				      unsigned long addr, pte_t *ptep)
> {
> +#ifdef CONFIG_X86_SHADOW_STACK
> +	/*
> +	 * Avoid accidentally creating shadow stack PTEs
> +	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
> +	 * the hardware setting Dirty=1.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pte_t old_pte, new_pte;
> +
> +		old_pte = READ_ONCE(*ptep);
> +		do {
> +			new_pte = pte_wrprotect(old_pte);
> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> +		return;
> +	}
> +#endif

There is no way of using IS_ENABLED() here instead of these ifdefs?

Did you have a look at ptep_set_access_flags() and friends and checked they
do not need to be changed too? Perhaps you should at least add some
assertion just to ensure nothing breaks.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2022-09-29 22:29 ` [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
@ 2022-10-03 18:11   ` Kees Cook
  2022-10-03 18:24   ` Peter Xu
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu, Peter Xu

On Thu, Sep 29, 2022 at 03:29:10PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
> VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-10-03 14:01   ` Kirill A . Shutemov
@ 2022-10-03 18:12     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 18:12 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: mtk.manpages, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 17:01 +0300, Kirill A . Shutemov wrote:
> On Thu, Sep 29, 2022 at 03:29:04PM -0700, Rick Edgecombe wrote:
> > +#else
> > +static void do_user_control_protection_fault(struct pt_regs *regs,
> > +                                          unsigned long
> > error_code)
> > +{
> > +     WARN_ONCE(1, "User-mode control protection fault with shadow
> > support disabled\n");
> 
> Why is this a warning, but runtime check for !X86_FEATURE_IBT and
> !X86_FEATURE_SHSTK below is fatal?

It was a BUG() in the original KERNEL_IBT focused handler IIRC. There
seems to be some renewed effort to stop doing those:

https://lore.kernel.org/all/20220923113426.52871-2-david@redhat.com/T/#u

...so I'll change it to a WARN for this. In the kernel specific portion
of the handler, it also does a BUG on endbranch violation. I'll leave
that one for this change.



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2022-09-29 22:29 ` [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
  2022-10-03 17:47   ` Kirill A . Shutemov
@ 2022-10-03 18:17   ` Kees Cook
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:17 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:11PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set.  However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages.  These
> two cases are handled differently for page faults. Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  Documentation/filesystems/proc.rst | 1 +
>  arch/x86/mm/mmap.c                 | 2 ++
>  fs/proc/task_mmu.c                 | 3 +++
>  include/linux/mm.h                 | 8 ++++++++
>  4 files changed, 14 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index e7aafc82be99..d54ff397947a 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -560,6 +560,7 @@ encoded manner. The codes are the following:
>      mt    arm64 MTE allocation tags are enabled
>      um    userfaultfd missing tracking
>      uw    userfaultfd wr-protect tracking
> +    ss    shadow stack page
>      ==    =======================================
>  
>  Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index c90c20904a60..f3f52c5e2fd6 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
>  
>  const char *arch_vma_name(struct vm_area_struct *vma)
>  {
> +	if (vma->vm_flags & VM_SHADOW_STACK)
> +		return "[shadow stack]";
>  	return NULL;
>  }

I agree with Kirill: this should be in the arch-agnostic code.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors
  2022-09-29 22:29 ` [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors Rick Edgecombe
@ 2022-10-03 18:20   ` Kees Cook
  2022-10-14 10:07   ` Peter Zijlstra
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:12PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The CPU performs "shadow stack accesses" when it expects to encounter
> shadow stack mappings. These accesses can be implicit (via CALL/RET
> instructions) or explicit (instructions like WRSS).
> 
> Shadow stacks accesses to shadow-stack mappings can see faults in normal,
> valid operation just like regular accesses to regular mappings. Shadow
> stacks need some of the same features like delayed allocation, swap and
> copy-on-write. The kernel needs to use faults to implement those features.
> 
> The architecture has concepts of both shadow stack reads and shadow stack
> writes. Any shadow stack access to non-shadow stack memory will generate
> a fault with the shadow stack error code bit set.
> 
> This means that, unlike normal write protection, the fault handler needs
> to create a type of memory that can be written to (with instructions that
> generate shadow stack writes), even to fulfill a read access. So in the
> case of COW memory, the COW needs to take place even with a shadow stack
> read. Otherwise the page will be left (shadow stack) writable in
> userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
> for shadow stack accesses, even if the access was a shadow stack read.
> 
> Shadow stack accesses can also result in errors, such as when a shadow
> stack overflows, or if a shadow stack access occurs to a non-shadow-stack
> mapping. Also, generate the errors for invalid shadow stack accesses.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack
  2022-09-29 22:29 ` [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
@ 2022-10-03 18:22   ` Kees Cook
  2022-10-03 23:53   ` Kirill A . Shutemov
  2022-10-14 15:32   ` Peter Zijlstra
  2 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:22 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:13PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When serving a page fault, maybe_mkwrite() makes a PTE writable if there is
> a write access to it, and its vma has VM_WRITE. Shadow stack accesses to
> shadow stack vma's are also treated as write accesses by the fault handler.
> This is because setting shadow stack memory makes it writable via some
> instructions, so COW has to happen even for shadow stack reads.
> 
> So maybe_mkwrite() should continue to set VM_WRITE vma's as normally
> writable, but also set VM_WRITE|VM_SHADOW_STACK vma's as shadow stack.
> 
> Do this by adding a pte_mkwrite_shstk() and a cross-arch stub. Check for
> VM_SHADOW_STACK in maybe_mkwrite() and call pte_mkwrite_shstk()
> accordingly.
> 
> Apply the same changes to maybe_pmd_mkwrite().
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
@ 2022-10-03 18:24   ` Kees Cook
  2022-10-03 23:56   ` Kirill A . Shutemov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:24 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> With the introduction of shadow stack memory there are two ways a pte can
> be writable: regular writable memory and shadow stack memory.
> 
> In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
> or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
> where a PTE is made writable. However, there are places where pte_mkwrite()
> is called directly and the logic should now also create a shadow stack PTE
> in the case of a shadow stack VMA.
> 
>  - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
>    directly and call pte_mkwrite(), which is the same as maybe_mkwrite()
>    in logic and intention. Just change them to maybe_mkwrite().
> 
>  - When userfaultfd is creating a PTE after userspace handles the fault
>    it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()
> 
> In other cases where pte_mkwrite() is called directly, the VMA will not
> be VM_SHADOW_STACK, and so shadow stack memory should not be created.
>  - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
>  - In the case of the "dirty_accountable" optimization in mprotect(),
>    shadow stack VMA's won't be VM_SHARED, so it is not nessary.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  2022-09-29 22:29 ` [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
  2022-10-03 18:11   ` Kees Cook
@ 2022-10-03 18:24   ` Peter Xu
  1 sibling, 0 replies; 241+ messages in thread
From: Peter Xu @ 2022-10-03 18:24 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:10PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
> VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-09-29 22:29 ` [PATCH v2 18/39] mm: Add guard pages around a shadow stack Rick Edgecombe
@ 2022-10-03 18:30   ` Kees Cook
  2022-10-05  2:30     ` Andrew Cooper
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:30 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:15PM -0700, Rick Edgecombe wrote:
> [...]
> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & VM_GROWSDOWN)
> +		return stack_guard_gap;
> +
> +	/*
> +	 * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).
> +	 * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
> +	 * (~1KB for INCSSPD) and touches the first and the last element
> +	 * in the range, which triggers a page fault if the range is not
> +	 * in a shadow stack. Because of this, creating 4-KB guard pages
> +	 * around a shadow stack prevents these instructions from going
> +	 * beyond.
> +	 *
> +	 * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
> +	 * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
> +	 */

Thank you for the details on how the size choice is made here! :)

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fef14ab3abcb..09458e77bf52 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2775,15 +2775,16 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
>  	return vma;
>  }
>  
> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
> +
>  static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
>  {
> +	unsigned long gap = stack_guard_start_gap(vma);
>  	unsigned long vm_start = vma->vm_start;
>  
> -	if (vma->vm_flags & VM_GROWSDOWN) {
> -		vm_start -= stack_guard_gap;
> -		if (vm_start > vma->vm_start)
> -			vm_start = 0;
> -	}
> +	vm_start -= gap;
> +	if (vm_start > vma->vm_start)
> +		vm_start = 0;
>  	return vm_start;
>  }
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9d780f415be3..f0d2e9143bd0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -247,6 +247,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>  	return origbrk;
>  }
>  

I feel like something could be done with this definitions to make them
inline, instead of __weak:

#ifndef stack_guard_start_gap
> +unsigned long __weak stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & VM_GROWSDOWN)
> +		return stack_guard_gap;
> +	return 0;
> +}
#endif

And then move the x86 stack_guard_start_gap to a header?

It's not exactly fast-path, but it feels a little weird. Regardlesss:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting
  2022-09-29 22:29 ` [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
@ 2022-10-03 18:31   ` Kees Cook
  2022-10-04  0:03   ` Kirill A . Shutemov
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:31 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:16PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Account shadow stack pages to stack memory.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 00/39] Shadowstacks for userspace
  2022-10-03 17:04 ` [PATCH v2 00/39] Shadowstacks for userspace Kees Cook
  2022-10-03 17:25   ` Jann Horn
@ 2022-10-03 18:33   ` Edgecombe, Rick P
  2022-10-04  3:59     ` Kees Cook
  1 sibling, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 18:33 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 10:04 -0700, Kees Cook wrote:
> > Shadow stack signal format
> > --------------------------
> > So to handle alt shadow stacks we need to push some data onto a
> > stack. To 
> > prevent SROP we need to push something to the shadow stack that the
> > kernel can 
> > [...]
> > shadow stack return address or a shadow stack tokens. To make sure
> > it can’t be 
> > used, data is pushed with the high bit (bit 63) set. This bit is a
> > linear 
> > address bit in both the token format and a normal return address,
> > so it should 
> > not conflict with anything. It puts any return address in the
> > kernel half of 
> > the address space, so would never be created naturally by a
> > userspace program. 
> > It will not be a valid restore token either, as the kernel address
> > will never 
> > be pointing to the previous frame in the shadow stack.
> > 
> > When a signal hits, the format pushed to the stack that is handling
> > the signal 
> > is four 8 byte values (since we are 64 bit only):
> > > 1...old SSP|1...alt stack size|1...alt stack base|0|
> 
> Do these end up being non-canonical addresses? (To avoid confusion
> with
> "real" kernel addresses?)

Usually, but not necessarily with LAM. LAM cannot mask bit 63 though.
So hypothetically they could become "real" kernel addresses some day.
To keep them in the user half but still make sure they are not usable,
you would either have to encode the bits over a lot of entries which
would use extra space, or shrink the available address space, which
could cause compatibility problems.

Do you think it's an issue?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-09-29 22:29 ` [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
  2022-09-30 19:16   ` Dave Hansen
@ 2022-10-03 18:39   ` Kees Cook
  2022-10-03 22:49     ` Andy Lutomirski
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 18:39 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:19PM -0700, Rick Edgecombe wrote:
> [...]
> Still allow FOLL_FORCE to write through shadow stack protections, as it
> does for read-only protections.

As I asked in the cover letter: why do we need to add this for shstk? It
was a mistake for general memory. :P

> [...]
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..56da98f3335c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1043,7 +1043,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  		return -EFAULT;
>  
>  	if (write) {
> -		if (!(vm_flags & VM_WRITE)) {
> +		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
>  			if (!(gup_flags & FOLL_FORCE))
>  				return -EFAULT;
>  			/*

How about this instead:

  		return -EFAULT;
  
 	if (write) {
+		if (vm_flags & VM_SHADOW_STACK)
+			return -EFAULT;
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 18:11   ` Nadav Amit
@ 2022-10-03 18:51     ` Dave Hansen
  2022-10-03 22:28     ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2022-10-03 18:51 UTC (permalink / raw)
  To: Nadav Amit, Rick Edgecombe
  Cc: X86 ML, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	linux-doc, Linux MM, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, Mike Rapoport, jamorris,
	dethoma, Yu-cheng Yu

On 10/3/22 11:11, Nadav Amit wrote:
>> +#ifdef CONFIG_X86_SHADOW_STACK
>> +	/*
>> +	 * Avoid accidentally creating shadow stack PTEs
>> +	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
>> +	 * the hardware setting Dirty=1.
>> +	 */
>> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> +		pte_t old_pte, new_pte;
>> +
>> +		old_pte = READ_ONCE(*ptep);
>> +		do {
>> +			new_pte = pte_wrprotect(old_pte);
>> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
>> +
>> +		return;
>> +	}
>> +#endif
> There is no way of using IS_ENABLED() here instead of these ifdefs?

Actually, both the existing #ifdef and an IS_ENABLED() check would be
is superfluous as-is.

Adding X86_FEATURE_SHSTK disabled-features.h gives cpu_feature_enabled()
compile-time optimizations for free.  No need for *any* additional
CONFIG_* checks.

The only issue would be if the #ifdef'd code won't even compile with
X86_FEATURE_SHSTK disabled.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-09-29 22:29 ` [PATCH v2 23/39] x86: Introduce userspace API for CET enabling Rick Edgecombe
@ 2022-10-03 19:01   ` Kees Cook
  2022-10-03 22:51     ` Edgecombe, Rick P
  2022-10-10 10:56   ` Florian Weimer
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 19:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:20PM -0700, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Add three new arch_prctl() handles:
> 
>  - ARCH_CET_ENABLE/DISABLE enables or disables the specified
>    feature. Returns 0 on success or an error.
> 
>  - ARCH_CET_LOCK prevents future disabling or enabling of the
>    specified feature. Returns 0 on success or an error
> 
> The features are handled per-thread and inherited over fork(2)/clone(2),
> but reset on exec().
> 
> This is preparation patch. It does not impelement any features.

typo: "implement"

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> [tweaked with feedback from tglx]
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> 
> v2:
>  - Only allow one enable/disable per call (tglx)
>  - Return error code like a normal arch_prctl() (Alexander Potapenko)
>  - Make CET only (tglx)
> 
>  arch/x86/include/asm/cet.h        | 20 ++++++++++++++++
>  arch/x86/include/asm/processor.h  |  3 +++
>  arch/x86/include/uapi/asm/prctl.h |  6 +++++
>  arch/x86/kernel/process.c         |  4 ++++
>  arch/x86/kernel/process_64.c      |  5 +++-
>  arch/x86/kernel/shstk.c           | 38 +++++++++++++++++++++++++++++++
>  6 files changed, 75 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/asm/cet.h
>  create mode 100644 arch/x86/kernel/shstk.c
> 
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> new file mode 100644
> index 000000000000..0fa4dbc98c49
> --- /dev/null
> +++ b/arch/x86/include/asm/cet.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_CET_H
> +#define _ASM_X86_CET_H
> +
> +#ifndef __ASSEMBLY__
> +#include <linux/types.h>
> +
> +struct task_struct;
> +
> +#ifdef CONFIG_X86_SHADOW_STACK
> +long cet_prctl(struct task_struct *task, int option,
> +		      unsigned long features);
> +#else
> +static inline long cet_prctl(struct task_struct *task, int option,
> +		      unsigned long features) { return -EINVAL; }
> +#endif /* CONFIG_X86_SHADOW_STACK */
> +
> +#endif /* __ASSEMBLY__ */
> +
> +#endif /* _ASM_X86_CET_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 356308c73951..a92bf76edafe 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -530,6 +530,9 @@ struct thread_struct {
>  	 */
>  	u32			pkru;
>  
> +	unsigned long		features;
> +	unsigned long		features_locked;

Should these be wrapped in #ifdef CONFIG_X86_SHADOW_STACK (or
CONFIG_X86_CET) ?

Also, just named "features"? Is this expected to be more than CET?

> +
>  	/* Floating point and extended processor state */
>  	struct fpu		fpu;
>  	/*
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 500b96e71f18..028158e35269 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -20,4 +20,10 @@
>  #define ARCH_MAP_VDSO_32		0x2002
>  #define ARCH_MAP_VDSO_64		0x2003
>  
> +/* Don't use 0x3001-0x3004 because of old glibcs */
> +
> +#define ARCH_CET_ENABLE			0x4001
> +#define ARCH_CET_DISABLE		0x4002
> +#define ARCH_CET_LOCK			0x4003
> +
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 58a6ea472db9..034880311e6b 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -367,6 +367,10 @@ void arch_setup_new_exec(void)
>  		task_clear_spec_ssb_noexec(current);
>  		speculation_ctrl_update(read_thread_flags());
>  	}
> +
> +	/* Reset thread features on exec */
> +	current->thread.features = 0;
> +	current->thread.features_locked = 0;

Same ifdef question here.

>  }
>  
>  #ifdef CONFIG_X86_IOPL_IOPERM
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 1962008fe743..8fa2c2b7de65 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -829,7 +829,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
>  	case ARCH_MAP_VDSO_64:
>  		return prctl_map_vdso(&vdso_image_64, arg2);
>  #endif
> -
> +	case ARCH_CET_ENABLE:
> +	case ARCH_CET_DISABLE:
> +	case ARCH_CET_LOCK:
> +		return cet_prctl(task, option, arg2);
>  	default:
>  		ret = -EINVAL;
>  		break;

I remain annoyed that prctl interfaces didn't use -ENOTSUP for "unknown
option". :P

> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> new file mode 100644
> index 000000000000..e3276ac9e9b9
> --- /dev/null
> +++ b/arch/x86/kernel/shstk.c

I think the Makefile addition should be moved from "x86/cet/shstk:
Add user-mode shadow stack support" to here, yes? Otherwise, there is a
bisectability randconfig-with-CONFIG_X86_SHADOW_STACK risk here (nothing
will implement "cet_prctl").

> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * shstk.c - Intel shadow stack support
> + *
> + * Copyright (c) 2021, Intel Corporation.
> + * Yu-cheng Yu <yu-cheng.yu@intel.com>
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/bitops.h>
> +#include <asm/prctl.h>
> +
> +long cet_prctl(struct task_struct *task, int option, unsigned long features)
> +{
> +	if (option == ARCH_CET_LOCK) {
> +		task->thread.features_locked |= features;
> +		return 0;
> +	}
> +
> +	/* Don't allow via ptrace */
> +	if (task != current)
> +		return -EINVAL;

... but locking _is_ allowed via ptrace? If that intended, it should be
explicitly mentioned in the commit log and in a comment here.

Also, perhaps -ESRCH ?

> +
> +	/* Do not allow to change locked features */
> +	if (features & task->thread.features_locked)
> +		return -EPERM;
> +
> +	/* Only support enabling/disabling one feature at a time. */
> +	if (hweight_long(features) > 1)
> +		return -EINVAL;

Perhaps -E2BIG ?

> +	if (option == ARCH_CET_DISABLE) {
> +		return -EINVAL;
> +	}
> +
> +	/* Handle ARCH_CET_ENABLE */
> +	return -EINVAL;
> +}
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
@ 2022-10-03 19:07   ` Chang S. Bae
  2022-10-04 23:05     ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Chang S. Bae @ 2022-10-03 19:07 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On 9/29/2022 3:29 PM, Rick Edgecombe wrote:
> If an xfeature is saved in a buffer, the xfeature's bit will be set in
> xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
> is in it's init state. In this case the xfeature buffer address cannot
> be retrieved with get_xsave_addr().
> 
> Future patches will need to handle the case of writing to an xfeature
> that may not be saved. So provide helpers to init an xfeature in an
> xsave buffer.
> 
> This could of course be done directly by reaching into the xsave buffer,
> however this would not be robust against future changes to optimize the
> xsave buffer by compacting it. In that case the xsave buffer would need
> to be re-arranged as well. So the logic properly belongs encapsulated
> in a helper where the logic can be unified.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> 
> v2:
>   - New patch
> 
>   arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
>   arch/x86/kernel/fpu/xstate.h |  6 ++++
>   2 files changed, 53 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 9258fc1169cc..82cee1f2f0c8 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -942,6 +942,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>   	return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
>   }
>   
> +static int xsave_buffer_access_checks(int xfeature_nr)
> +{
> +	/*
> +	 * Do we even *have* xsave state?
> +	 */
> +	if (!boot_cpu_has(X86_FEATURE_XSAVE))
> +		return 1;
> +
> +	/*
> +	 * We should not ever be requesting features that we
> +	 * have not enabled.
> +	 */
> +	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> +		return 1;
> +
> +	return 0;
> +}
> +
>   /*
>    * Given the xsave area and a state inside, this function returns the
>    * address of the state.
> @@ -962,17 +980,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>    */
>   void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>   {
> -	/*
> -	 * Do we even *have* xsave state?
> -	 */
> -	if (!boot_cpu_has(X86_FEATURE_XSAVE))
> -		return NULL;
> -
> -	/*
> -	 * We should not ever be requesting features that we
> -	 * have not enabled.
> -	 */
> -	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> +	if (xsave_buffer_access_checks(xfeature_nr))
>   		return NULL;
>   
>   	/*
> @@ -992,6 +1000,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
>   	return __raw_xsave_addr(xsave, xfeature_nr);
>   }
>   
> +/*
> + * Given the xsave area and a state inside, this function
> + * initializes an xfeature in the buffer.

But, this function sets XSTATE_BV bits in the buffer. That does not 
*initialize* the state, right?

> + *
> + * get_xsave_addr() will return NULL if the feature bit is
> + * not present in the header. This function will make it so
> + * the xfeature buffer address is ready to be retrieved by
> + * get_xsave_addr().

Looks like this is used in the next patch to help ptracer().

We have the state copy function -- copy_uabi_to_xstate() that retrieves 
the address using __raw_xsave_addr() instead of get_xsave_addr(), copies 
the state, and then updates XSTATE_BV.

__raw_xsave_addr() also ensures whether the state is in the compacted 
format or not. I think you can use it.

Also, I'm curious about the reason why you want to update XSTATE_BV 
first with this new helper.

Overall, I'm not sure these new helpers are necessary.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-30  3:41   ` Bagas Sanjaya
  2022-09-30 13:33     ` Jonathan Corbet
@ 2022-10-03 19:35     ` John Hubbard
  2022-10-03 19:39       ` Dave Hansen
  2022-10-04  2:13       ` Bagas Sanjaya
  1 sibling, 2 replies; 241+ messages in thread
From: John Hubbard @ 2022-10-03 19:35 UTC (permalink / raw)
  To: Bagas Sanjaya, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On 9/29/22 20:41, Bagas Sanjaya wrote:
...
> The documentation above can be improved (both grammar and formatting):
> 
> ---- >8 ----
> 
> diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
> index 6b270a24ebc3a2..f691f7995cf088 100644
> --- a/Documentation/x86/cet.rst
> +++ b/Documentation/x86/cet.rst
> @@ -15,92 +15,101 @@ in the 64-bit kernel.
>   
>   CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
>   a secondary stack allocated from memory and cannot be directly modified by
> -applications. When executing a CALL instruction, the processor pushes the
> +applications. When executing a ``CALL`` instruction, the processor pushes the

It's always a judgment call, as to whether to use something like ``CALL`
or just plain CALL. Here, I'd like to opine that that the benefits of
``CALL`` are very small, whereas plain text in cet.rst has been made
significantly worse. So the result is, "this is not worth it".

The same is true of pretty much all of the other literalizing changes
below, IMHO.

Just so you have some additional input on this. I tend to spend time
fussing a lot (too much, yes) over readability issues, so this jumps
right out at me. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-03 19:35     ` John Hubbard
@ 2022-10-03 19:39       ` Dave Hansen
  2022-10-04  2:13       ` Bagas Sanjaya
  1 sibling, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2022-10-03 19:39 UTC (permalink / raw)
  To: John Hubbard, Bagas Sanjaya, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On 10/3/22 12:35, John Hubbard wrote:
> It's always a judgment call, as to whether to use something like ``CALL`
> or just plain CALL. Here, I'd like to opine that that the benefits of
> ``CALL`` are very small, whereas plain text in cet.rst has been made
> significantly worse. So the result is, "this is not worth it".

I'm definitely in this camp as well.  Unless the markup *really* adds to
readability, just leave it alone.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
  2022-10-03 13:40   ` Kirill A . Shutemov
  2022-10-03 17:25   ` Kees Cook
@ 2022-10-03 19:42   ` Dave Hansen
  2022-10-03 19:50     ` Edgecombe, Rick P
  2022-10-12 20:04   ` Borislav Petkov
  3 siblings, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2022-10-03 19:42 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: Yu-cheng Yu

On 9/29/22 15:28, Rick Edgecombe wrote:
> +config X86_SHADOW_STACK
> +	prompt "X86 Shadow Stack"
> +	def_bool n
> +	depends on ARCH_HAS_SHADOW_STACK
> +	select ARCH_USES_HIGH_VMA_FLAGS
> +	help
> +	  Shadow Stack protection is a hardware feature that detects function
> +	  return address corruption. Today the kernel's support is limited to
> +	  virtualizing it in KVM guests.
> +

Is this help text up to date?  It seems a bit at odds with the series title.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-09-29 22:29 ` [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support Rick Edgecombe
@ 2022-10-03 19:43   ` Kees Cook
  2022-10-03 20:04     ` Dave Hansen
  2022-10-20 21:29     ` Edgecombe, Rick P
  0 siblings, 2 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 19:43 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:21PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Introduce basic shadow stack enabling/disabling/allocation routines.
> A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
> and has a fixed size of min(RLIMIT_STACK, 4GB).
> 
> Keep the task's shadow stack address and size in thread_struct. This will
> be copied when cloning new threads, but needs to be cleared during exec,
> so add a function to do this.
> 
> Do not support IA32 emulation.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> 
> v2:
>  - Get rid of unnessary shstk->base checks
>  - Don't support IA32 emulation
> 
> v1:
>  - Switch to xsave helpers.
>  - Expand commit log.
> 
> Yu-cheng v30:
>  - Remove superfluous comments for struct thread_shstk.
>  - Replace 'populate' with 'unused'.
> 
> Yu-cheng v28:
>  - Update shstk_setup() with wrmsrl_safe(), returns success when shadow
>    stack feature is not present (since this is a setup function).
> 
>  arch/x86/include/asm/cet.h        |  13 +++
>  arch/x86/include/asm/msr.h        |  11 +++
>  arch/x86/include/asm/processor.h  |   5 ++
>  arch/x86/include/uapi/asm/prctl.h |   2 +
>  arch/x86/kernel/Makefile          |   2 +
>  arch/x86/kernel/process_64.c      |   2 +
>  arch/x86/kernel/shstk.c           | 143 ++++++++++++++++++++++++++++++
>  7 files changed, 178 insertions(+)
> 
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 0fa4dbc98c49..a4a1f4c0089b 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -7,12 +7,25 @@
>  
>  struct task_struct;
>  
> +struct thread_shstk {
> +	u64	base;
> +	u64	size;
> +};
> +
>  #ifdef CONFIG_X86_SHADOW_STACK
>  long cet_prctl(struct task_struct *task, int option,
>  		      unsigned long features);
> +int shstk_setup(void);
> +void shstk_free(struct task_struct *p);
> +int shstk_disable(void);
> +void reset_thread_shstk(void);
>  #else
>  static inline long cet_prctl(struct task_struct *task, int option,
>  		      unsigned long features) { return -EINVAL; }
> +static inline int shstk_setup(void) { return -EOPNOTSUPP; }
> +static inline void shstk_free(struct task_struct *p) {}
> +static inline int shstk_disable(void) { return -EOPNOTSUPP; }
> +static inline void reset_thread_shstk(void) {}
>  #endif /* CONFIG_X86_SHADOW_STACK */

shstk_setup() and shstk_disable() are not called outside of shstk.c, so
they can be removed from this header entirely.

>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..a9cb4c434e60 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
>  int msr_set_bit(u32 msr, u8 bit);
>  int msr_clear_bit(u32 msr, u8 bit);
>  
> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> +{
> +	u64 val, new_val;
> +
> +	rdmsrl(msr, val);
> +	new_val = (val & ~clear) | set;
> +
> +	if (new_val != val)
> +		wrmsrl(msr, new_val);
> +}

I always get uncomfortable when I see these kinds of generalized helper
functions for touching cpu bits, etc. It just begs for future attacker
abuse to muck with arbitrary bits -- even marked inline there is a risk
the compiler will ignore that in some circumstances (not as currently
used in the code, but I'm imagining future changes leading to such a
condition). Will you humor me and change this to a macro instead? That'll
force it always inline (even __always_inline isn't always inline):

/* Helper that can never get accidentally un-inlined. */
#define set_clr_bits_msrl(msr, set, clear)	do {	\
	u64 __val, __new_val;				\
							\
	rdmsrl(msr, __val);				\
	__new_val = (__val & ~(clear)) | (set);		\
							\
	if (__new_val != __val)				\
		wrmsrl(msr, __new_val);			\
} while (0)


> +
>  #ifdef CONFIG_SMP
>  int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
>  int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index a92bf76edafe..3a0c9d9d4d1d 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -27,6 +27,7 @@ struct vm86;
>  #include <asm/unwind_hints.h>
>  #include <asm/vmxfeatures.h>
>  #include <asm/vdso/processor.h>
> +#include <asm/cet.h>
>  
>  #include <linux/personality.h>
>  #include <linux/cache.h>
> @@ -533,6 +534,10 @@ struct thread_struct {
>  	unsigned long		features;
>  	unsigned long		features_locked;
>  
> +#ifdef CONFIG_X86_SHADOW_STACK
> +	struct thread_shstk	shstk;
> +#endif
> +
>  	/* Floating point and extended processor state */
>  	struct fpu		fpu;
>  	/*
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 028158e35269..41af3a8c4fa4 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -26,4 +26,6 @@
>  #define ARCH_CET_DISABLE		0x4002
>  #define ARCH_CET_LOCK			0x4003
>  
For readability, maybe add: /* ARCH_CET_* "features" bits */

> +#define CET_SHSTK			0x1

This is UAPI, so the BIT() macro isn't available, but since this is
unsigned long, please use the form:  (1ULL <<  0)  etc...

> +
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index a20a5ebfacd7..8950d1f71226 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
>  
>  obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
>  
> +obj-$(CONFIG_X86_SHADOW_STACK)		+= shstk.o
> +
>  ###
>  # 64 bit specific files
>  ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 8fa2c2b7de65..be544b4b4c8b 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
>  		load_gs_index(__USER_DS);
>  	}
>  
> +	reset_thread_shstk();
> +
>  	loadsegment(fs, 0);
>  	loadsegment(es, _ds);
>  	loadsegment(ds, _ds);
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index e3276ac9e9b9..a0b8d4adb2bf 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -8,8 +8,151 @@
>  
>  #include <linux/sched.h>
>  #include <linux/bitops.h>
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/signal.h>
> +#include <linux/compat.h>
> +#include <linux/sizes.h>
> +#include <linux/user.h>
> +#include <asm/msr.h>
> +#include <asm/fpu/xstate.h>
> +#include <asm/fpu/types.h>
> +#include <asm/cet.h>
> +#include <asm/special_insns.h>
> +#include <asm/fpu/api.h>
>  #include <asm/prctl.h>
>  
> +static bool feature_enabled(unsigned long features)
> +{
> +	return current->thread.features & features;
> +}
> +
> +static void feature_set(unsigned long features)
> +{
> +	current->thread.features |= features;
> +}
> +
> +static void feature_clr(unsigned long features)
> +{
> +	current->thread.features &= ~features;
> +}

"feature" vs "features" here is confusing. Should these helpers enforce
the single-bit-set requirements? If so, please switch to a bit number
instead of a mask. If not, please rename these to
"features_{enabled,set,clr}", and fix "features_enabled" to check them
all:
	return (current->thread.features & features) == features;

> +static unsigned long alloc_shstk(unsigned long size)
> +{
> +	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, unused;

WARN_ON + clamp on "size" here, or perhaps move the bounds check from
shstk_setup() into here?

> +
> +	mmap_write_lock(mm);
> +	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> +		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);

This will use the mmap base address offset randomization, I guess?

> +
> +	mmap_write_unlock(mm);
> +
> +	return addr;
> +}
> +
> +static void unmap_shadow_stack(u64 base, u64 size)
> +{
> +	while (1) {
> +		int r;
> +
> +		r = vm_munmap(base, size);
> +
> +		/*
> +		 * vm_munmap() returns -EINTR when mmap_lock is held by
> +		 * something else, and that lock should not be held for a
> +		 * long time.  Retry it for the case.
> +		 */
> +		if (r == -EINTR) {
> +			cond_resched();
> +			continue;
> +		}
> +
> +		/*
> +		 * For all other types of vm_munmap() failure, either the
> +		 * system is out of memory or there is bug.
> +		 */
> +		WARN_ON_ONCE(r);
> +		break;
> +	}
> +}
> +
> +int shstk_setup(void)

Only called local. Make static?

> +{
> +	struct thread_shstk *shstk = &current->thread.shstk;
> +	unsigned long addr, size;
> +
> +	/* Already enabled */
> +	if (feature_enabled(CET_SHSTK))
> +		return 0;
> +
> +	/* Also not supported for 32 bit */
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || in_ia32_syscall())
> +		return -EOPNOTSUPP;
> +
> +	size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
> +	addr = alloc_shstk(size);
> +	if (IS_ERR_VALUE(addr))
> +		return PTR_ERR((void *)addr);
> +
> +	fpu_lock_and_load();
> +	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> +	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
> +	fpregs_unlock();
> +
> +	shstk->base = addr;
> +	shstk->size = size;
> +	feature_set(CET_SHSTK);
> +
> +	return 0;
> +}
> +
> +void reset_thread_shstk(void)
> +{
> +	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
> +	current->thread.features = 0;
> +	current->thread.features_locked = 0;
> +}

If features is always going to be tied to shstk, why not put them in the
shstk struct?

Also, shouldn't this also be called from arch_setup_new_exec() instead
of the open-coded wipe of features there?

> +
> +void shstk_free(struct task_struct *tsk)
> +{
> +	struct thread_shstk *shstk = &tsk->thread.shstk;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> +	    !feature_enabled(CET_SHSTK))
> +		return;
> +
> +	if (!tsk->mm)
> +		return;
> +
> +	unmap_shadow_stack(shstk->base, shstk->size);

I feel like base and size should be zeroed here?

> +}
> +
> +int shstk_disable(void)

This is only called locally. static?

> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return -EOPNOTSUPP;
> +
> +	/* Already disabled? */
> +	if (!feature_enabled(CET_SHSTK))
> +		return 0;
> +
> +	fpu_lock_and_load();
> +	/* Disable WRSS too when disabling shadow stack */
> +	set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
> +	wrmsrl(MSR_IA32_PL3_SSP, 0);
> +	fpregs_unlock();
> +
> +	shstk_free(current);
> +	feature_clr(CET_SHSTK);
> +
> +	return 0;
> +}
> +
>  long cet_prctl(struct task_struct *task, int option, unsigned long features)
>  {
>  	if (option == ARCH_CET_LOCK) {
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-03 17:18   ` Kees Cook
@ 2022-10-03 19:46     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 19:46 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 10:18 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:28:58PM -0700, Rick Edgecombe wrote:
> > [...]
> > +Overview
> > +========
> > +
> > +Control-flow Enforcement Technology (CET) is term referring to
> > several
> > +related x86 processor features that provides protection against
> > control
> > +flow hijacking attacks. The HW feature itself can be set up to
> > protect
> > +both applications and the kernel. Only user-mode protection is
> > implemented
> > +in the 64-bit kernel.
> 
> This likely needs rewording, since it's not strictly true any more:
> IBT is supported in kernel-mode now (CONFIG_X86_IBT).

Yep, thanks.

> 
> > +CET introduces Shadow Stack and Indirect Branch Tracking. Shadow
> > stack is
> > +a secondary stack allocated from memory and cannot be directly
> > modified by
> > +applications. When executing a CALL instruction, the processor
> > pushes the
> > +return address to both the normal stack and the shadow stack. Upon
> > +function return, the processor pops the shadow stack copy and
> > compares it
> > +to the normal stack copy. If the two differ, the processor raises
> > a
> > +control-protection fault. Indirect branch tracking verifies
> > indirect
> > +CALL/JMP targets are intended as marked by the compiler with
> > 'ENDBR'
> > +opcodes. Not all CPU's have both Shadow Stack and Indirect Branch
> > Tracking
> > +and only Shadow Stack is currently supported in the kernel.
> > +
> > +The Kconfig options is X86_SHADOW_STACK, and it can be disabled
> > with
> > +the kernel parameter clearcpuid, like this: "clearcpuid=shstk".
> > +
> > +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM
> > v10.0.1
> > +or later are required. To build a CET-enabled application, GLIBC
> > v2.28 or
> > +later is also required.
> > +
> > +At run time, /proc/cpuinfo shows CET features if the processor
> > supports
> > +CET.
> 
> Maybe call them out by name: shstk ibt

Ok.

> 
> > +CET arch_prctl()'s
> > +==================
> > +
> > +Elf features should be enabled by the loader using the below
> > arch_prctl's.
> > +
> > +arch_prctl(ARCH_CET_ENABLE, unsigned int feature)
> > +    Enable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> 
> Does this mean only 1 bit out of the 32 may be specified?

Yes, exactly.

> 
> > +
> > +arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
> > +    Disable features specified in 'feature'. Can only operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_CET_LOCK, unsigned int features)
> > +    Lock in features at their current enabled or disabled status.
> 
> How is the "features" argument processed here?

Yes, this should have more info. The kernel keeps a mask of features
that are "locked". The mask is ORed with the existing value. So any
bits set here cannot be enabled or disabled afterwards. Bit's unset in
the mask passed are ignored.

> 
> > [...]
> > +Proc status
> > +===========
> > +To check if an application is actually running with shadow stack,
> > the
> > +user can read the /proc/$PID/arch_status. It will report "wrss" or
> > +"shstk" depending on what is enabled.
> 
> TIL about "arch_status". :) Why is this a separate file? "status" is
> already has unique field names.

It looks like "status" only has arch-agnostic feature status today.
Maybe that's the reason? CET seems to fit there though.

> 
> > +Fork
> > +----
> > +
> > +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are
> > required
> > +to be read-only and dirty. When a shadow stack PTE is not RO and
> > dirty, a
> > +shadow access triggers a page fault with the shadow stack access
> > bit set
> > +in the page fault error code.
> > +
> > +When a task forks a child, its shadow stack PTEs are copied and
> > both the
> > +parent's and the child's shadow stack PTEs are cleared of the
> > dirty bit.
> > +Upon the next shadow stack access, the resulting shadow stack page
> > fault
> > +is handled by page copy/re-use.
> > +
> > +When a pthread child is created, the kernel allocates a new shadow
> > stack
> > +for the new thread.
> 
> Perhaps speak to the ASLR characteristics of the shstk here?

It behaves just like mmap(). I can add some info.

> 
> Also, it seems if there is a "Fork" section, there should be an
> "Exec"
> section? I suspect it would be short: shstk is disabled when execve()
> is
> called and must be re-enabled from userspace, yes?

Sure, I can add some info.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-10-03 19:42   ` Dave Hansen
@ 2022-10-03 19:50     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 19:50 UTC (permalink / raw)
  To: Shankar, Ravi V, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Hansen, Dave,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Mon, 2022-10-03 at 12:42 -0700, Dave Hansen wrote:
> On 9/29/22 15:28, Rick Edgecombe wrote:
> > +config X86_SHADOW_STACK
> > +     prompt "X86 Shadow Stack"
> > +     def_bool n
> > +     depends on ARCH_HAS_SHADOW_STACK
> > +     select ARCH_USES_HIGH_VMA_FLAGS
> > +     help
> > +       Shadow Stack protection is a hardware feature that detects
> > function
> > +       return address corruption. Today the kernel's support is
> > limited to
> > +       virtualizing it in KVM guests.
> > +
> 
> Is this help text up to date?  It seems a bit at odds with the series
> title.

Arg, yes. This patch got screwed up when I converted it back and forth
for the KVM series.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-10-03 17:25   ` Kees Cook
@ 2022-10-03 19:52     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 19:52 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 10:25 -0700, Kees Cook wrote:
> > +config X86_SHADOW_STACK
> > +     prompt "X86 Shadow Stack"
> > +     def_bool n
> 
> I hope we can switch this to "default y" soon, given it's a hardware
> feature that is disabled at runtime when not available.

Hmm, yes. Not sure on this. I'm inclined to leave it as is for now.

> 
> > +     depends on ARCH_HAS_SHADOW_STACK
> 
> Doesn't this depend on AS_WRUSS too?

Yes, this got messed up when this patch went to and from the CET KVM
series.

Thanks!

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-10-03 13:40   ` Kirill A . Shutemov
@ 2022-10-03 19:53     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 19:53 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 16:40 +0300, Kirill A . Shutemov wrote:
> Hm. Shouldn't ARCH_HAS_SHADOW_STACK definition be in arch/Kconfig,
> not
> under arch/x86?
> 
> Also, I think "def_bool n" has the same meaning as just "bool", no?
> 
> > +
> > +config X86_SHADOW_STACK
> > +     prompt "X86 Shadow Stack"
> > +     def_bool n
> 
> Maybe just
> 
>         bool "X86 Shadow Stack"
> 
> ?
Yep, will change it. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-03 19:43   ` Kees Cook
@ 2022-10-03 20:04     ` Dave Hansen
  2022-10-04  4:04       ` Kees Cook
  2022-10-04 10:17       ` David Laight
  2022-10-20 21:29     ` Edgecombe, Rick P
  1 sibling, 2 replies; 241+ messages in thread
From: Dave Hansen @ 2022-10-03 20:04 UTC (permalink / raw)
  To: Kees Cook, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On 10/3/22 12:43, Kees Cook wrote:
>> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
>> +{
>> +	u64 val, new_val;
>> +
>> +	rdmsrl(msr, val);
>> +	new_val = (val & ~clear) | set;
>> +
>> +	if (new_val != val)
>> +		wrmsrl(msr, new_val);
>> +}
> I always get uncomfortable when I see these kinds of generalized helper
> functions for touching cpu bits, etc. It just begs for future attacker
> abuse to muck with arbitrary bits -- even marked inline there is a risk
> the compiler will ignore that in some circumstances (not as currently
> used in the code, but I'm imagining future changes leading to such a
> condition). Will you humor me and change this to a macro instead? That'll
> force it always inline (even __always_inline isn't always inline):

Oh, are you thinking that this is dangerous because it's so surgical and
non-intrusive?  It's even more powerful to an attacker than, say
wrmsrl(), because there they actually have to know what the existing
value is to update it.  With this helper, it's quite easy to flip an
individual bit without disturbing the neighboring bits.

Is that it?

I don't _like_ the #defines, but doing one here doesn't seem too onerous
considering how critical MSRs are.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate
  2022-10-03 17:48   ` Kees Cook
@ 2022-10-03 20:05     ` Edgecombe, Rick P
  2022-10-04  4:05       ` Kees Cook
  2022-10-04 14:18       ` Dave Hansen
  0 siblings, 2 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 20:05 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 10:48 -0700, Kees Cook wrote:
> > The easiest way to modify supervisor xfeature data is to force
> > restore
> > the registers and write directly to the MSRs. Often times this is
> > just fine
> > anyway as the registers need to be restored before returning to
> > userspace.
> > Do this for now, leaving buffer writing optimizations for the
> > future.
> 
> Just for my own clarity, does this mean lock/load _needs_ to happen
> before MSR access, or is it just a convenient place to do it? From
> later
> patches it seems it's a requirement during MSR access, which might be
> a
> good idea to detail here. It answers the question "when is this
> function
> needed?"

The CET state is xsaves managed. It gets lazily restored before
returning to userspace with the rest of the fpu stuff. This function
will force restore all the fpu state to the registers early and lock
them from being automatically saved/restored. Then the tasks CET state
can be modified in the MSRs, before unlocking the fpregs. Last time I
tried to modify the state directly in the xsave buffer when it was
efficient, but it had issues and Thomas suggested this.

> 
> > 
> > Add a new function fpregs_lock_and_load() that can simultaneously
> > call
> > fpregs_lock() and do this restore. Also perform some extra sanity
> > checks in this function since this will be used in non-fpu focused
> > code.
> 
> Nit: this is called "fpu_lock_and_load" in the patch itself.

Oops, thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack
  2022-09-29 22:29 ` [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack Rick Edgecombe
  2022-10-03 10:36   ` Mike Rapoport
@ 2022-10-03 20:29   ` Kees Cook
  2022-10-04 22:09     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 20:29 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:22PM -0700, Rick Edgecombe wrote:
> [...]
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
> +{
> +	struct cet_user_state *xstate;
> +
> +	/* If ssp update is not needed. */
> +	if (!ssp)
> +		return 0;

My brain will work to undo the collision of Shadow Stack Pointer with
Stack Smashing Protection. ;)

> [...]
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index a0b8d4adb2bf..db4e53f9fdaf 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -118,6 +118,46 @@ void reset_thread_shstk(void)
>  	current->thread.features_locked = 0;
>  }
>  
> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> +			     unsigned long stack_size, unsigned long *shstk_addr)

Er, arg 3 is "stack_size". From later:

> +     ret = shstk_alloc_thread_stack(p, clone_flags, args->flags, &shstk_addr);
                                                       ^^^^^^^^^^^

clone_flags and args->flags are identical ... this must be accidentally
working. I was expecting 0 there.

> +{
> +	struct thread_shstk *shstk = &tsk->thread.shstk;
> +	unsigned long addr;
> +
> +	/*
> +	 * If shadow stack is not enabled on the new thread, skip any
> +	 * switch to a new shadow stack.
> +	 */
> +	if (!feature_enabled(CET_SHSTK))
> +		return 0;
> +
> +	/*
> +	 * clone() does not pass stack_size, which was added to clone3().
> +	 * Use RLIMIT_STACK and cap to 4 GB.
> +	 */
> +	if (!stack_size)
> +		stack_size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);

Again, perhaps the clamp should happen in alloc_shstk()?

> +
> +	/*
> +	 * For CLONE_VM, except vfork, the child needs a separate shadow
> +	 * stack.
> +	 */
> +	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> +		return 0;
> +
> +
> +	stack_size = PAGE_ALIGN(stack_size);

Uhm, I think a line went missing here. :P

"x86/cet/shstk: Introduce map_shadow_stack syscall" adds the missing:

+	addr = alloc_shstk(0, stack_size, 0, false);

Please add back the original. :)

> +	if (IS_ERR_VALUE(addr))
> +		return PTR_ERR((void *)addr);
> +
> +	shstk->base = addr;
> +	shstk->size = stack_size;
> +
> +	*shstk_addr = addr + stack_size;
> +
> +	return 0;
> +}
> +
>  void shstk_free(struct task_struct *tsk)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> @@ -126,7 +166,13 @@ void shstk_free(struct task_struct *tsk)
>  	    !feature_enabled(CET_SHSTK))
>  		return;
>  
> -	if (!tsk->mm)
> +	/*
> +	 * When fork() with CLONE_VM fails, the child (tsk) already has a
> +	 * shadow stack allocated, and exit_thread() calls this function to
> +	 * free it.  In this case the parent (current) and the child share
> +	 * the same mm struct.
> +	 */
> +	if (!tsk->mm || tsk->mm != current->mm)
>  		return;
>  
>  	unmap_shadow_stack(shstk->base, shstk->size);
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-10-03 18:04   ` Kees Cook
@ 2022-10-03 20:33     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 20:33 UTC (permalink / raw)
  To: keescook
  Cc: mtk.manpages, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 11:04 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:04PM -0700, Rick Edgecombe wrote:
> > [...]
> > -#ifdef CONFIG_X86_KERNEL_IBT
> > +#if defined(CONFIG_X86_KERNEL_IBT) ||
> > defined(CONFIG_X86_SHADOW_STACK)
> 
> This pattern is repeated several times. Perhaps there needs to be a
> CONFIG_X86_CET to make this more readable? Really just a style
> question.

Hmm, good idea. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-09-29 22:29 ` [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk Rick Edgecombe
@ 2022-10-03 20:44   ` Kees Cook
  2022-10-04 22:13     ` Edgecombe, Rick P
  2022-10-05  2:43   ` Andrew Cooper
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 20:44 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:23PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Shadow stack's are normally written to via CALL/RET or specific CET
> instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
> operations the kernel will need to write to directly using the ring-0 only
> WRUSS instruction.
> 
> A shadow stack restore token marks a restore point of the shadow stack, and
> the address in a token must point directly above the token, which is within
> the same shadow stack. This is distinctively different from other pointers
> on the shadow stack, since those pointers point to executable code area.
> 
> Introduce token setup and verify routines. Also introduce WRUSS, which is
> a kernel-mode instruction but writes directly to user shadow stack.
> 
> In future patches that enable shadow stack to work with signals, the kernel
> will need something to denote the point in the stack where sigreturn may be
> called. This will prevent attackers calling sigreturn at arbitrary places
> in the stack, in order to help prevent SROP attacks.
> 
> To do this, something that can only be written by the kernel needs to be
> placed on the shadow stack. This can be accomplished by setting bit 63 in
> the frame written to the shadow stack. Userspace return addresses can't
> have this bit set as it is in the kernel range. It is also can't be a
> valid restore token.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> 
> v2:
>  - Add data helpers for writing to shadow stack.
> 
> v1:
>  - Use xsave helpers.
> 
> Yu-cheng v30:
>  - Update commit log, remove description about signals.
>  - Update various comments.
>  - Remove variable 'ssp' init and adjust return value accordingly.
>  - Check get_user_shstk_addr() return value.
>  - Replace 'ia32' with 'proc32'.
> 
> Yu-cheng v29:
>  - Update comments for the use of get_xsave_addr().
> 
>  arch/x86/include/asm/special_insns.h |  13 ++++
>  arch/x86/kernel/shstk.c              | 108 +++++++++++++++++++++++++++
>  2 files changed, 121 insertions(+)
> 
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 35f709f619fb..f096f52bd059 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
>  		: [pax] "a" (p));
>  }
>  
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> +{
> +	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> +			  _ASM_EXTABLE(1b, %l[fail])
> +			  :: [addr] "r" (addr), [val] "r" (val)
> +			  :: fail);
> +	return 0;
> +fail:
> +	return -EFAULT;
> +}
> +#endif /* CONFIG_X86_SHADOW_STACK */
> +
>  #define nop() asm volatile ("nop")
>  
>  static inline void serialize(void)
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index db4e53f9fdaf..8904aef487bf 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -25,6 +25,8 @@
>  #include <asm/fpu/api.h>
>  #include <asm/prctl.h>
>  
> +#define SS_FRAME_SIZE 8
> +
>  static bool feature_enabled(unsigned long features)
>  {
>  	return current->thread.features & features;
> @@ -40,6 +42,31 @@ static void feature_clr(unsigned long features)
>  	current->thread.features &= ~features;
>  }
>  
> +/*
> + * Create a restore token on the shadow stack.  A token is always 8-byte
> + * and aligned to 8.
> + */
> +static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
> +{
> +	unsigned long addr;
> +
> +	/* Token must be aligned */
> +	if (!IS_ALIGNED(ssp, 8))
> +		return -EINVAL;
> +
> +	addr = ssp - SS_FRAME_SIZE;
> +
> +	/* Mark the token 64-bit */
> +	ssp |= BIT(0);

Wow, that confused me for a moment. :) SDE says:

- Bit 63:2 – Value of shadow stack pointer when this restore point was created.
- Bit 1 – Reserved. Must be zero.
- Bit 0 – Mode bit. If 0, the token is a compatibility/legacy mode
          “shadow stack restore” token. If 1, then this shadow stack restore
          token can be used with a RSTORSSP instruction in 64-bit mode.

So shouldn't this actually be:

	ssp &= ~BIT(1);	/* Reserved */
	ssp |=  BIT(0); /* RSTORSSP instruction in 64-bit mode */

> +
> +	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
> +		return -EFAULT;
> +
> +	*token_addr = addr;
> +
> +	return 0;
> +}
> +
>  static unsigned long alloc_shstk(unsigned long size)
>  {
>  	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> @@ -158,6 +185,87 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
>  	return 0;
>  }
>  
> +static unsigned long get_user_shstk_addr(void)
> +{
> +	unsigned long long ssp;
> +
> +	fpu_lock_and_load();
> +
> +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> +	fpregs_unlock();
> +
> +	return ssp;
> +}
> +
> +static int put_shstk_data(u64 __user *addr, u64 data)
> +{
> +	WARN_ON(data & BIT(63));

Let's make this a bit more defensive:

	if (WARN_ON_ONCE(data & BIT(63)))
		return -EFAULT;

> +
> +	/*
> +	 * Mark the high bit so that the sigframe can't be processed as a
> +	 * return address.
> +	 */
> +	if (write_user_shstk_64(addr, data | BIT(63)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
> +{
> +	unsigned long ldata;
> +
> +	if (unlikely(get_user(ldata, addr)))
> +		return -EFAULT;
> +
> +	if (!(ldata & BIT(63)))
> +		return -EINVAL;
> +
> +	*data = ldata & ~BIT(63);
> +
> +	return 0;
> +}
> +
> +/*
> + * Verify the user shadow stack has a valid token on it, and then set
> + * *new_ssp according to the token.
> + */
> +static int shstk_check_rstor_token(unsigned long *new_ssp)
> +{
> +	unsigned long token_addr;
> +	unsigned long token;
> +
> +	token_addr = get_user_shstk_addr();
> +	if (!token_addr)
> +		return -EINVAL;
> +
> +	if (get_user(token, (unsigned long __user *)token_addr))
> +		return -EFAULT;
> +
> +	/* Is mode flag correct? */
> +	if (!(token & BIT(0)))
> +		return -EINVAL;
> +
> +	/* Is busy flag set? */

"Busy"? Not "Reserved"?

> +	if (token & BIT(1))
> +		return -EINVAL;
> +
> +	/* Mask out flags */
> +	token &= ~3UL;
> +
> +	/* Restore address aligned? */
> +	if (!IS_ALIGNED(token, 8))
> +		return -EINVAL;
> +
> +	/* Token placed properly? */
> +	if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
> +		return -EINVAL;
> +
> +	*new_ssp = token;
> +
> +	return 0;
> +}
> +
>  void shstk_free(struct task_struct *tsk)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack
  2022-09-29 22:29 ` [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack Rick Edgecombe
@ 2022-10-03 20:52   ` Kees Cook
  2022-10-20 22:08     ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 20:52 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:24PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When a signal is handled normally the context is pushed to the stack
> before handling it. For shadow stacks, since the shadow stack only track's
> return addresses, there isn't any state that needs to be pushed. However,
> there are still a few things that need to be done. These things are
> userspace visible and which will be kernel ABI for shadow stacks.
> 
> One is to make sure the restorer address is written to shadow stack, since
> the signal handler (if not changing ucontext) returns to the restorer, and
> the restorer calls sigreturn. So add the restorer on the shadow stack
> before handling the signal, so there is not a conflict when the signal
> handler returns to the restorer.
> 
> The other thing to do is to place some type of checkable token on the
> thread's shadow stack before handling the signal and check it during
> sigreturn. This is an extra layer of protection to hamper attackers
> calling sigreturn manually as in SROP-like attacks.
> 
> For this token we can use the shadow stack data format defined earlier.
> Have the data pushed be the previous SSP. In the future the sigreturn
> might want to return back to a different stack. Storing the SSP (instead
> of a restore offset or something) allows for future functionality that
> may want to restore to a different stack.
> 
> So, when handling a signal push
>  - the SSP pointing in the shadow stack data format
>  - the restorer address below the restore token.
> 
> In sigreturn, verify SSP is stored in the data format and pop the shadow
> stack.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> 
> v2:
>  - Switch to new shstk signal format
> 
> v1:
>  - Use xsave helpers.
>  - Expand commit log.
> 
> Yu-cheng v27:
>  - Eliminate saving shadow stack pointer to signal context.
> 
> Yu-cheng v25:
>  - Update commit log/comments for the sc_ext struct.
>  - Use restorer address already calculated.
>  - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
>  - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
>  - Eliminate writing to MSR_IA32_U_CET for shadow stack.
>  - Change wrmsrl() to wrmsrl_safe() and handle error.
> 
>  arch/x86/ia32/ia32_signal.c |   1 +
>  arch/x86/include/asm/cet.h  |   5 ++
>  arch/x86/kernel/shstk.c     | 126 ++++++++++++++++++++++++++++++------
>  arch/x86/kernel/signal.c    |  10 +++
>  4 files changed, 123 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index c9c3859322fa..88d71b9de616 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -34,6 +34,7 @@
>  #include <asm/sigframe.h>
>  #include <asm/sighandling.h>
>  #include <asm/smap.h>
> +#include <asm/cet.h>
>  
>  static inline void reload_segments(struct sigcontext_32 *sc)
>  {
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 924de99e0c61..8c6fab9f402a 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -6,6 +6,7 @@
>  #include <linux/types.h>
>  
>  struct task_struct;
> +struct ksignal;
>  
>  struct thread_shstk {
>  	u64	base;
> @@ -22,6 +23,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
>  void shstk_free(struct task_struct *p);
>  int shstk_disable(void);
>  void reset_thread_shstk(void);
> +int setup_signal_shadow_stack(struct ksignal *ksig);
> +int restore_signal_shadow_stack(void);
>  #else
>  static inline long cet_prctl(struct task_struct *task, int option,
>  		      unsigned long features) { return -EINVAL; }
> @@ -33,6 +36,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
>  static inline void shstk_free(struct task_struct *p) {}
>  static inline int shstk_disable(void) { return -EOPNOTSUPP; }
>  static inline void reset_thread_shstk(void) {}
> +static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
> +static inline int restore_signal_shadow_stack(void) { return 0; }
>  #endif /* CONFIG_X86_SHADOW_STACK */
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 8904aef487bf..04442134aadd 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -227,41 +227,129 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
>  }
>  
>  /*
> - * Verify the user shadow stack has a valid token on it, and then set
> - * *new_ssp according to the token.
> + * Create a restore token on shadow stack, and then push the user-mode
> + * function return address.
>   */
> -static int shstk_check_rstor_token(unsigned long *new_ssp)
> +static int shstk_setup_rstor_token(unsigned long ret_addr, unsigned long *new_ssp)

Oh, hrm. Prior patch defines shstk_check_rstor_token() and
doesn't call it. This patch removes it. :P Can you please remove
shstk_check_rstor_token() from the prior patch?

>  {
> -	unsigned long token_addr;
> -	unsigned long token;
> +	unsigned long ssp, token_addr;
> +	int err;
> +
> +	if (!ret_addr)
> +		return -EINVAL;
> +
> +	ssp = get_user_shstk_addr();
> +	if (!ssp)
> +		return -EINVAL;
> +
> +	err = create_rstor_token(ssp, &token_addr);
> +	if (err)
> +		return err;
> +
> +	ssp = token_addr - sizeof(u64);
> +	err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
> +
> +	if (!err)
> +		*new_ssp = ssp;
> +
> +	return err;
> +}
> +
> +static int shstk_push_sigframe(unsigned long *ssp)
> +{
> +	unsigned long target_ssp = *ssp;
> +
> +	/* Token must be aligned */
> +	if (!IS_ALIGNED(*ssp, 8))
> +		return -EINVAL;
>  
> -	token_addr = get_user_shstk_addr();
> -	if (!token_addr)
> +	if (!IS_ALIGNED(target_ssp, 8))
>  		return -EINVAL;
>  
> -	if (get_user(token, (unsigned long __user *)token_addr))
> +	*ssp -= SS_FRAME_SIZE;
> +	if (put_shstk_data((void *__user)*ssp, target_ssp))
>  		return -EFAULT;
>  
> -	/* Is mode flag correct? */
> -	if (!(token & BIT(0)))
> +	return 0;
> +}
> +
> +
> +static int shstk_pop_sigframe(unsigned long *ssp)
> +{
> +	unsigned long token_addr;
> +	int err;
> +
> +	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
> +	if (unlikely(err))
> +		return err;
> +
> +	/* Restore SSP aligned? */
> +	if (unlikely(!IS_ALIGNED(token_addr, 8)))
>  		return -EINVAL;

Why doesn't this always fail, given BIT(0) being set? I don't see it
getting cleared until the end of this function.

>  
> -	/* Is busy flag set? */
> -	if (token & BIT(1))
> +	/* SSP in userspace? */
> +	if (unlikely(token_addr >= TASK_SIZE_MAX))
>  		return -EINVAL;

BIT(63) already got cleared by here (in get_shstk_data(), but yes,
this is still a reasonable check.

>  
> -	/* Mask out flags */
> -	token &= ~3UL;
> +	*ssp = token_addr;
> +
> +	return 0;
> +}

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-03 16:26   ` Kirill A . Shutemov
@ 2022-10-03 21:36     ` Edgecombe, Rick P
  2022-10-03 21:54       ` Jann Horn
  2022-10-03 22:14       ` Dave Hansen
  0 siblings, 2 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 21:36 UTC (permalink / raw)
  To: kirill.shutemov, Hansen, Dave
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 19:26 +0300, Kirill A . Shutemov wrote:
> On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > +/*
> > + * Normally the Dirty bit is used to denote COW memory on x86. But
> > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> > + * since the Dirty=1,Write=0 will result in the memory being
> > treated
> > + * as shaodw stack by the HW. So when creating COW memory, a
> > software
> > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow()
> > and
> > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1)
> > and
> > + * transition it to the shadow stack compatible version of COW
> > (Cow=1).
> > + */
> > +
> > +static inline pte_t pte_mkcow(pte_t pte)
> > +{
> > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +             return pte;
> > +
> > +     pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > +     return pte_set_flags(pte, _PAGE_COW);
> > +}
> > +
> > +static inline pte_t pte_clear_cow(pte_t pte)
> > +{
> > +     /*
> > +      * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
> > +      * See the _PAGE_COW definition for more details.
> > +      */
> > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +             return pte;
> > +
> > +     /*
> > +      * PTE is getting copied-on-write, so it will be dirtied
> > +      * if writable, or made shadow stack if shadow stack and
> > +      * being copied on access. Set they dirty bit for both
> > +      * cases.
> > +      */
> > +     pte = pte_set_flags(pte, _PAGE_DIRTY);
> > +     return pte_clear_flags(pte, _PAGE_COW);
> > +}
> 
> These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
> _PAGE_COW
> logic for all machines with 64-bit entries. It will get you much more
> coverage and more universal rules.

Yes, I didn't like them either at first. The reasoning originally was
that _PAGE_COW is a bit more work and it might show up for some
benchmark.

Looking at this again though, it is just a few more operations on
memory that is already getting touched either way. It must be a very
tiny amount of impact if any. I'm fine removing them. Having just one
set of logic around this would make it easier to reason about.

Dave, any thoughts on this?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-03 21:36     ` Edgecombe, Rick P
@ 2022-10-03 21:54       ` Jann Horn
  2022-10-03 22:20         ` Edgecombe, Rick P
  2022-10-03 22:14       ` Dave Hansen
  1 sibling, 1 reply; 241+ messages in thread
From: Jann Horn @ 2022-10-03 21:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kirill.shutemov, Hansen, Dave, bsingharora, hpa, Syromiatnikov,
	Eugene, peterz, rdunlap, keescook, Yu, Yu-cheng, dave.hansen,
	Eranian, Stephane, linux-mm, fweimer, nadav.amit, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov

On Mon, Oct 3, 2022 at 11:36 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Mon, 2022-10-03 at 19:26 +0300, Kirill A . Shutemov wrote:
> > On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > > +/*
> > > + * Normally the Dirty bit is used to denote COW memory on x86. But
> > > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> > > + * since the Dirty=1,Write=0 will result in the memory being
> > > treated
> > > + * as shaodw stack by the HW. So when creating COW memory, a
> > > software
> > > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow()
> > > and
> > > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1)
> > > and
> > > + * transition it to the shadow stack compatible version of COW
> > > (Cow=1).
> > > + */
> > > +
> > > +static inline pte_t pte_mkcow(pte_t pte)
> > > +{
> > > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > > +             return pte;
> > > +
> > > +     pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > > +     return pte_set_flags(pte, _PAGE_COW);
> > > +}
> > > +
> > > +static inline pte_t pte_clear_cow(pte_t pte)
> > > +{
> > > +     /*
> > > +      * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
> > > +      * See the _PAGE_COW definition for more details.
> > > +      */
> > > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > > +             return pte;
> > > +
> > > +     /*
> > > +      * PTE is getting copied-on-write, so it will be dirtied
> > > +      * if writable, or made shadow stack if shadow stack and
> > > +      * being copied on access. Set they dirty bit for both
> > > +      * cases.
> > > +      */
> > > +     pte = pte_set_flags(pte, _PAGE_DIRTY);
> > > +     return pte_clear_flags(pte, _PAGE_COW);
> > > +}
> >
> > These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
> > _PAGE_COW
> > logic for all machines with 64-bit entries. It will get you much more
> > coverage and more universal rules.
>
> Yes, I didn't like them either at first. The reasoning originally was
> that _PAGE_COW is a bit more work and it might show up for some
> benchmark.
>
> Looking at this again though, it is just a few more operations on
> memory that is already getting touched either way. It must be a very
> tiny amount of impact if any. I'm fine removing them. Having just one
> set of logic around this would make it easier to reason about.
>
> Dave, any thoughts on this?

But the rules wouldn't actually be universal - you'd still have to
look at X86_FEATURE_SHSTK in code that wants to figure out whether a
PTE is shadow stack (on a newer CPU) or readonly dirty (on an older
CPU that can set dirty bits on non-present PTEs), right?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-03 21:36     ` Edgecombe, Rick P
  2022-10-03 21:54       ` Jann Horn
@ 2022-10-03 22:14       ` Dave Hansen
  1 sibling, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2022-10-03 22:14 UTC (permalink / raw)
  To: Edgecombe, Rick P, kirill.shutemov
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On 10/3/22 14:36, Edgecombe, Rick P wrote:
>>> +static inline pte_t pte_clear_cow(pte_t pte)
>>> +{
>>> +     /*
>>> +      * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
>>> +      * See the _PAGE_COW definition for more details.
>>> +      */
>>> +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
>>> +             return pte;
>>> +
>>> +     /*
>>> +      * PTE is getting copied-on-write, so it will be dirtied
>>> +      * if writable, or made shadow stack if shadow stack and
>>> +      * being copied on access. Set they dirty bit for both
>>> +      * cases.
>>> +      */
>>> +     pte = pte_set_flags(pte, _PAGE_DIRTY);
>>> +     return pte_clear_flags(pte, _PAGE_COW);
>>> +}
>> These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
>> _PAGE_COW
>> logic for all machines with 64-bit entries. It will get you much more
>> coverage and more universal rules.
> Yes, I didn't like them either at first. The reasoning originally was
> that _PAGE_COW is a bit more work and it might show up for some
> benchmark.
> 
> Looking at this again though, it is just a few more operations on
> memory that is already getting touched either way. It must be a very
> tiny amount of impact if any. I'm fine removing them. Having just one
> set of logic around this would make it easier to reason about.
> 
> Dave, any thoughts on this?

The cpu_feature_enabled(X86_FEATURE_SHSTK) checks enable both
compile-time and runtime optimization.  What makes this even more fun is:

+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW      (_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW      (_AT(pteval_t, 0))
+#endif

which I think means that the pte_clear_flags() goes away if
CONFIG_X86_SHADOW_STACK is disabled.  So, what Rick posted here ends up
doing the following with:

	  | X86_FEATURE_SHSTK=1	|  X86_FEATURE_SHSTK=0
==========+=====================+========================
CONFIG=n  |  compiled out	|  compiled out
CONFIG=y  |  set/clear		|  boot-time patched out


If we pull the cpu_feature_enabled() out, I think we end up getting
behavior like this:

	  | X86_FEATURE_SHSTK=1	|  X86_FEATURE_SHSTK=0
==========+=====================+========================
CONFIG=n  |  set _PAGE_DIRTY	|  set _PAGE_DIRTY
CONFIG=y  |  set/clear		|  set/clear

It ends up adding instruction overhead (set _PAGE_DIRTY) to two cases
where it completely compiled out before.  It also adds runtime overhead
(the "tiny amount of impact" you mentioned) to set/clear where it would
have runtime patched out before.

None of this is a deal breaker in terms of runtime overhead.  But, I do
think the benefits of the cpu_feature_enabled() are worth it, even if
it's just an optimization.  You could move it to the end of the series
and we can debate it on its own merits if you want.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-03 21:54       ` Jann Horn
@ 2022-10-03 22:20         ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 22:20 UTC (permalink / raw)
  To: jannh
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, nadav.amit, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, Moreira, Joao, linux-doc, pavel,
	mike.kravetz, x86, tglx, john.allen, rppt, mingo, Hansen, Dave,
	Shankar, Ravi V, corbet, linux-api, linux-kernel, gorcunov

On Mon, 2022-10-03 at 23:54 +0200, Jann Horn wrote:
> > > These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
> > > _PAGE_COW
> > > logic for all machines with 64-bit entries. It will get you much
> > > more
> > > coverage and more universal rules.
> > 
> > Yes, I didn't like them either at first. The reasoning originally
> > was
> > that _PAGE_COW is a bit more work and it might show up for some
> > benchmark.
> > 
> > Looking at this again though, it is just a few more operations on
> > memory that is already getting touched either way. It must be a
> > very
> > tiny amount of impact if any. I'm fine removing them. Having just
> > one
> > set of logic around this would make it easier to reason about.
> > 
> > Dave, any thoughts on this?
> 
> But the rules wouldn't actually be universal - you'd still have to
> look at X86_FEATURE_SHSTK in code that wants to figure out whether a
> PTE is shadow stack (on a newer CPU) or readonly dirty (on an older
> CPU that can set dirty bits on non-present PTEs), right?

Good point. It still would need a check in pte_shstk() or pte_write(),
so pte_write() doesn't think CPU created Write=0,Dirty=1 memory is
writable. Which then percolates to most of the other checks anyway.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall
  2022-09-29 22:29 ` [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
@ 2022-10-03 22:23   ` Kees Cook
  2022-10-04 22:56     ` Edgecombe, Rick P
  2022-10-10 11:13   ` Florian Weimer
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 22:23 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:25PM -0700, Rick Edgecombe wrote:
> [...]
> The following example demonstrates how to create a new shadow stack with
> map_shadow_stack:
> void *shstk = map_shadow_stack(adrr, stack_size, SHADOW_STACK_SET_TOKEN);

typo: addr

> [...]
> +451	common	map_shadow_stack	sys_map_shadow_stack

Isn't this "64", not "common"?

> [...]
> +#define SHADOW_STACK_SET_TOKEN	0x1	/* Set up a restore token in the shadow stack */

I think this should get an intro comment, like:

/* Flags for map_shadow_stack(2) */

Also, as with the other UAPI fields, please use "(1ULL << 0)" here.

> @@ -62,24 +63,34 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
>  	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
>  		return -EFAULT;
>  
> -	*token_addr = addr;
> +	if (token_addr)
> +		*token_addr = addr;
>  
>  	return 0;
>  }
>  

Can this just be collapsed into the patch that introduces create_rstor_token()?

> -static unsigned long alloc_shstk(unsigned long size)
> +static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
> +				 unsigned long token_offset, bool set_res_tok)
>  {
>  	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
>  	struct mm_struct *mm = current->mm;
> -	unsigned long addr, unused;
> +	unsigned long mapped_addr, unused;
>  
>  	mmap_write_lock(mm);
> -	addr = do_mmap(NULL, addr, size, PROT_READ, flags,

Oops, I missed in the other patch that "addr" was being passed here.
(uninitialized?)

> -		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
> -
> +	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> +			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);

I don't see do_mmap() doing anything here to avoid remapping a prior vma
as shstk. Is the intention to allow userspace to convert existing VMAs?
This has caused pain in the past, perhaps force MAP_FIXED_NOREPLACE ?

> [...]
> @@ -174,6 +185,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
>  
>  
>  	stack_size = PAGE_ALIGN(stack_size);
> +	addr = alloc_shstk(0, stack_size, 0, false);
>  	if (IS_ERR_VALUE(addr))
>  		return PTR_ERR((void *)addr);
>  

As mentioned earlier, I was expecting this patch to replace a (missing)
call to alloc_shstk. i.e. expecting:

-	addr = alloc_shstk(stack_size);

> @@ -395,6 +407,26 @@ int shstk_disable(void)
>  	return 0;
>  }
>  
> +
> +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)

Please add kern-doc for this, with some notes. E.g. at least one thing isn't immediately
obvious, maybe more: "addr" must be a multiple of 8.

> +{
> +	unsigned long aligned_size;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return -ENOSYS;

This needs to explicitly reject unknown flags[1], or expanding them in the
future becomes very painful:

	if (flags & ~(SHADOW_STACK_SET_TOKEN))
		return -EINVAL;


[1] https://docs.kernel.org/process/adding-syscalls.html#designing-the-api-planning-for-extension

> +
> +	/*
> +	 * An overflow would result in attempting to write the restore token
> +	 * to the wrong location. Not catastrophic, but just return the right
> +	 * error code and block it.
> +	 */
> +	aligned_size = PAGE_ALIGN(size);
> +	if (aligned_size < size)
> +		return -EOVERFLOW;

The intention here is to allow userspace to ask for _less_ than a page
size multiple, and to put the restore token there?

Is it worth adding a check for size >= 8 here? Or, I guess it would just
immediately crash on the next call?

> +
> +	return alloc_shstk(addr, aligned_size, size, flags & SHADOW_STACK_SET_TOKEN);
> +}

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 18:11   ` Nadav Amit
  2022-10-03 18:51     ` Dave Hansen
@ 2022-10-03 22:28     ` Edgecombe, Rick P
  2022-10-03 23:17       ` Nadav Amit
  1 sibling, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 22:28 UTC (permalink / raw)
  To: nadav.amit
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, jannh, dethoma, linux-arch, kcc, bp,
	oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
> Did you have a look at ptep_set_access_flags() and friends and
> checked they
> do not need to be changed too? 

ptep_set_access_flags() doesn't actually set any additional dirty bits
on x86, so I think it's ok.

> Perhaps you should at least add some
> assertion just to ensure nothing breaks.

You mean in ptep_set_access_flags()? I think some assertions would be
really great, I'm just not sure where.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-09-29 22:29 ` [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace Rick Edgecombe
@ 2022-10-03 22:28   ` Kees Cook
  2022-10-03 23:00     ` Andy Lutomirski
  2022-10-04  8:30     ` Mike Rapoport
  0 siblings, 2 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 22:28 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:26PM -0700, Rick Edgecombe wrote:
> For the current shadow stack implementation, shadow stacks contents easily
> be arbitrarily provisioned with data.

I can't parse this sentence.

> This property helps apps protect
> themselves better, but also restricts any potential apps that may want to
> do exotic things at the expense of a little security.

Is anything using this right now? Wouldn't thing be safer without WRSS?
(Why can't we skip this patch?)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status
  2022-09-29 22:29 ` [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status Rick Edgecombe
@ 2022-10-03 22:37   ` Kees Cook
  2022-10-03 22:45     ` Andy Lutomirski
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 22:37 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:27PM -0700, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Applications and loaders can have logic to decide whether to enable CET.
> They usually don't report whether CET has been enabled or not, so there
> is no way to verify whether an application actually is protected by CET
> features.
> 
> Add two lines in /proc/$PID/arch_status to report enabled and locked
> features.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> [Switched to CET, added to commit log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> 
> v2:
>  - New patch
> 
>  arch/x86/kernel/Makefile     |  2 ++
>  arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
>  arch/x86/kernel/proc.c       | 63 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 65 insertions(+), 47 deletions(-)
>  create mode 100644 arch/x86/kernel/proc.c

This is two patches: one to create proc.c, the other to add CET support.

I found where the "arch_status" conversation was:
https://lore.kernel.org/all/CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com/

Andy, what did you mean "make sure that everything in it is namespaced"?
Everything already has a field name. And arch_status doesn't exactly
solve having compat fields -- it still needs to be handled manually?
Anyway... we have arch_status, so I guess it's fine.

> [...]
> +int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
> +			struct pid *pid, struct task_struct *task)
> +{
> +	/*
> +	 * Report AVX512 state if the processor and build option supported.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
> +		avx512_status(m, task);
> +
> +	seq_puts(m, "Thread_features:\t");
> +	dump_features(m, task->thread.features);
> +	seq_putc(m, '\n');
> +
> +	seq_puts(m, "Thread_features_locked:\t");
> +	dump_features(m, task->thread.features_locked);
> +	seq_putc(m, '\n');

Why are these always present instead of ifdefed?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 31/39] x86/cet/shstk: Wire in CET interface
  2022-09-29 22:29 ` [PATCH v2 31/39] x86/cet/shstk: Wire in CET interface Rick Edgecombe
@ 2022-10-03 22:41   ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 22:41 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:28PM -0700, Rick Edgecombe wrote:
> The kernel now has the main CET functionality to support applications.
> Wire in the WRSS and shadow stack enable/disable functions into the
> existing CET API skeleton.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status
  2022-10-03 22:37   ` Kees Cook
@ 2022-10-03 22:45     ` Andy Lutomirski
  2022-10-04  4:18       ` Kees Cook
  0 siblings, 1 reply; 241+ messages in thread
From: Andy Lutomirski @ 2022-10-03 22:45 UTC (permalink / raw)
  To: Kees Cook, Rick P Edgecombe
  Cc: the arch/x86 maintainers, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Linux Kernel Mailing List, linux-doc, linux-mm,
	linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra (Intel),
	Randy Dunlap, Shankar, Ravi V, Weijiang Yang, Kirill A. Shutemov,
	Moreira, Joao, john.allen, kcc, Eranian, Stephane, Mike Rapoport,
	jamorris, dethoma



On Mon, Oct 3, 2022, at 3:37 PM, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:27PM -0700, Rick Edgecombe wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> 
>> Applications and loaders can have logic to decide whether to enable CET.
>> They usually don't report whether CET has been enabled or not, so there
>> is no way to verify whether an application actually is protected by CET
>> features.
>> 
>> Add two lines in /proc/$PID/arch_status to report enabled and locked
>> features.
>> 
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> [Switched to CET, added to commit log]
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> 
>> ---
>> 
>> v2:
>>  - New patch
>> 
>>  arch/x86/kernel/Makefile     |  2 ++
>>  arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
>>  arch/x86/kernel/proc.c       | 63 ++++++++++++++++++++++++++++++++++++
>>  3 files changed, 65 insertions(+), 47 deletions(-)
>>  create mode 100644 arch/x86/kernel/proc.c
>
> This is two patches: one to create proc.c, the other to add CET support.
>
> I found where the "arch_status" conversation was:
> https://lore.kernel.org/all/CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com/
>
> Andy, what did you mean "make sure that everything in it is namespaced"?
> Everything already has a field name. And arch_status doesn't exactly
> solve having compat fields -- it still needs to be handled manually?
> Anyway... we have arch_status, so I guess it's fine.

I think I meant that, since it's "arch_status" not "x86_status", the fields should have names like "x86.Thread_features".  Otherwise if another architecture adds a Thread_features field, then anything running under something like qemu userspace emulation could be confused.

Assuming that's what I meant, I think my comment still stands :)

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-10-03 18:39   ` Kees Cook
@ 2022-10-03 22:49     ` Andy Lutomirski
  2022-10-04  4:21       ` Kees Cook
  0 siblings, 1 reply; 241+ messages in thread
From: Andy Lutomirski @ 2022-10-03 22:49 UTC (permalink / raw)
  To: Kees Cook, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma

On 10/3/22 11:39, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:19PM -0700, Rick Edgecombe wrote:
>> [...]
>> Still allow FOLL_FORCE to write through shadow stack protections, as it
>> does for read-only protections.
> 
> As I asked in the cover letter: why do we need to add this for shstk? It
> was a mistake for general memory. :P

For debuggers, which use FOLL_FORCE, quite intentionally, to modify 
text.  And once a debugger has ptrace write access to a target, shadow 
stacks provide exactly no protection -- ptrace can modify text and all 
registers.

But /proc/.../mem may be a different story, and I'd be okay with having 
FOLL_PROC_MEM for legacy compatibility via /proc/.../mem and not 
allowing that to access shadow stacks.  This does seem like it may not 
be very useful, though.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-10-03 19:01   ` Kees Cook
@ 2022-10-03 22:51     ` Edgecombe, Rick P
  2022-10-06 18:50       ` Mike Rapoport
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 22:51 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	rppt, john.allen, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

CC Mike about ptrace/CRIU question.

On Mon, 2022-10-03 at 12:01 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:20PM -0700, Rick Edgecombe wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Add three new arch_prctl() handles:
> > 
> >  - ARCH_CET_ENABLE/DISABLE enables or disables the specified
> >    feature. Returns 0 on success or an error.
> > 
> >  - ARCH_CET_LOCK prevents future disabling or enabling of the
> >    specified feature. Returns 0 on success or an error
> > 
> > The features are handled per-thread and inherited over
> > fork(2)/clone(2),
> > but reset on exec().
> > 
> > This is preparation patch. It does not impelement any features.
> 
> typo: "implement"

Oops, thanks.

> 
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > [tweaked with feedback from tglx]
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > ---
> > 
> > v2:
> >  - Only allow one enable/disable per call (tglx)
> >  - Return error code like a normal arch_prctl() (Alexander
> > Potapenko)
> >  - Make CET only (tglx)
> > 
> >  arch/x86/include/asm/cet.h        | 20 ++++++++++++++++
> >  arch/x86/include/asm/processor.h  |  3 +++
> >  arch/x86/include/uapi/asm/prctl.h |  6 +++++
> >  arch/x86/kernel/process.c         |  4 ++++
> >  arch/x86/kernel/process_64.c      |  5 +++-
> >  arch/x86/kernel/shstk.c           | 38
> > +++++++++++++++++++++++++++++++
> >  6 files changed, 75 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/x86/include/asm/cet.h
> >  create mode 100644 arch/x86/kernel/shstk.c
> > 
> > diff --git a/arch/x86/include/asm/cet.h
> > b/arch/x86/include/asm/cet.h
> > new file mode 100644
> > index 000000000000..0fa4dbc98c49
> > --- /dev/null
> > +++ b/arch/x86/include/asm/cet.h
> > @@ -0,0 +1,20 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_CET_H
> > +#define _ASM_X86_CET_H
> > +
> > +#ifndef __ASSEMBLY__
> > +#include <linux/types.h>
> > +
> > +struct task_struct;
> > +
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +long cet_prctl(struct task_struct *task, int option,
> > +		      unsigned long features);
> > +#else
> > +static inline long cet_prctl(struct task_struct *task, int option,
> > +		      unsigned long features) { return -EINVAL; }
> > +#endif /* CONFIG_X86_SHADOW_STACK */
> > +
> > +#endif /* __ASSEMBLY__ */
> > +
> > +#endif /* _ASM_X86_CET_H */
> > diff --git a/arch/x86/include/asm/processor.h
> > b/arch/x86/include/asm/processor.h
> > index 356308c73951..a92bf76edafe 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -530,6 +530,9 @@ struct thread_struct {
> >  	 */
> >  	u32			pkru;
> >  
> > +	unsigned long		features;
> > +	unsigned long		features_locked;
> 
> Should these be wrapped in #ifdef CONFIG_X86_SHADOW_STACK (or
> CONFIG_X86_CET) ?
> 
> Also, just named "features"? Is this expected to be more than CET?

Sigh, there have been many ideas about how this API and features
tracking could be shared with LAM. At some point there was some
discussion about LAM using the `features` as well, even if it had a
separate arch_prctl() interface. Just checking the last LAM posting,
I'm not sure it needs it. So yes, this could go back to being CET only
for the time being.

> 
> > +
> >  	/* Floating point and extended processor state */
> >  	struct fpu		fpu;
> >  	/*
> > diff --git a/arch/x86/include/uapi/asm/prctl.h
> > b/arch/x86/include/uapi/asm/prctl.h
> > index 500b96e71f18..028158e35269 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -20,4 +20,10 @@
> >  #define ARCH_MAP_VDSO_32		0x2002
> >  #define ARCH_MAP_VDSO_64		0x2003
> >  
> > +/* Don't use 0x3001-0x3004 because of old glibcs */
> > +
> > +#define ARCH_CET_ENABLE			0x4001
> > +#define ARCH_CET_DISABLE		0x4002
> > +#define ARCH_CET_LOCK			0x4003
> > +
> >  #endif /* _ASM_X86_PRCTL_H */
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 58a6ea472db9..034880311e6b 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -367,6 +367,10 @@ void arch_setup_new_exec(void)
> >  		task_clear_spec_ssb_noexec(current);
> >  		speculation_ctrl_update(read_thread_flags());
> >  	}
> > +
> > +	/* Reset thread features on exec */
> > +	current->thread.features = 0;
> > +	current->thread.features_locked = 0;
> 
> Same ifdef question here.
> 
> >  }
> >  
> >  #ifdef CONFIG_X86_IOPL_IOPERM
> > diff --git a/arch/x86/kernel/process_64.c
> > b/arch/x86/kernel/process_64.c
> > index 1962008fe743..8fa2c2b7de65 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -829,7 +829,10 @@ long do_arch_prctl_64(struct task_struct
> > *task, int option, unsigned long arg2)
> >  	case ARCH_MAP_VDSO_64:
> >  		return prctl_map_vdso(&vdso_image_64, arg2);
> >  #endif
> > -
> > +	case ARCH_CET_ENABLE:
> > +	case ARCH_CET_DISABLE:
> > +	case ARCH_CET_LOCK:
> > +		return cet_prctl(task, option, arg2);
> >  	default:
> >  		ret = -EINVAL;
> >  		break;
> 
> I remain annoyed that prctl interfaces didn't use -ENOTSUP for
> "unknown
> option". :P
> 
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > new file mode 100644
> > index 000000000000..e3276ac9e9b9
> > --- /dev/null
> > +++ b/arch/x86/kernel/shstk.c
> 
> I think the Makefile addition should be moved from "x86/cet/shstk:
> Add user-mode shadow stack support" to here, yes? Otherwise, there is
> a
> bisectability randconfig-with-CONFIG_X86_SHADOW_STACK risk here
> (nothing
> will implement "cet_prctl").

Oh, yep, good point.

> 
> > @@ -0,0 +1,38 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * shstk.c - Intel shadow stack support
> > + *
> > + * Copyright (c) 2021, Intel Corporation.
> > + * Yu-cheng Yu <yu-cheng.yu@intel.com>
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/bitops.h>
> > +#include <asm/prctl.h>
> > +
> > +long cet_prctl(struct task_struct *task, int option, unsigned long
> > features)
> > +{
> > +	if (option == ARCH_CET_LOCK) {
> > +		task->thread.features_locked |= features;
> > +		return 0;
> > +	}
> > +
> > +	/* Don't allow via ptrace */
> > +	if (task != current)
> > +		return -EINVAL;
> 
> ... but locking _is_ allowed via ptrace? If that intended, it should
> be
> explicitly mentioned in the commit log and in a comment here.

I believe CRIU needs to lock via ptrace as well. Maybe Mike can
confirm.

I can mention it.

> 
> Also, perhaps -ESRCH ?
> 
> > +
> > +	/* Do not allow to change locked features */
> > +	if (features & task->thread.features_locked)
> > +		return -EPERM;
> > +
> > +	/* Only support enabling/disabling one feature at a time. */
> > +	if (hweight_long(features) > 1)
> > +		return -EINVAL;
> 
> Perhaps -E2BIG ?

Ehh, I don't know. E2MUCH maybe. It's not necessarily too big. Like if
a third bit was added for IBT some day, you could set SHSTK and WRSS,
it would be invalid, but still be "less" than the valid passing of just
the (hypothetical IBT bit).

> 
> > +	if (option == ARCH_CET_DISABLE) {
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* Handle ARCH_CET_ENABLE */
> > +	return -EINVAL;
> > +}
> > -- 
> > 2.17.1
> > 
> 
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
  2022-10-03 14:01   ` Kirill A . Shutemov
  2022-10-03 18:04   ` Kees Cook
@ 2022-10-03 22:51   ` Andy Lutomirski
  2022-10-03 23:09     ` H. Peter Anvin
  2022-10-03 23:11     ` Edgecombe, Rick P
  2022-10-05  1:20   ` Andrew Cooper
  2022-10-05  9:39   ` Peter Zijlstra
  4 siblings, 2 replies; 241+ messages in thread
From: Andy Lutomirski @ 2022-10-03 22:51 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma
  Cc: Yu-cheng Yu, Michael Kerrisk

On 9/29/22 15:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 

> +static void do_user_control_protection_fault(struct pt_regs *regs,
> +					     unsigned long error_code)
>   {
> -	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> -		pr_err("Unexpected #CP\n");
> -		BUG();
> +	struct task_struct *tsk;
> +	unsigned long ssp;
> +
> +	/* Read SSP before enabling interrupts. */
> +	rdmsrl(MSR_IA32_PL3_SSP, ssp); > +
> +	cond_local_irq_enable(regs);

I feel like I'm missing something.  Either PL3_SSL is context switched 
correctly and reading it with IRQs off is useless, or it's not context 
switched, and I'm very confused.

Please either improve the comment or move it after the 
cond_local_irq_enable().

--Andy

> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
> +
> +	tsk = current;
> +	tsk->thread.error_code = error_code;
> +	tsk->thread.trap_nr = X86_TRAP_CP;
> +
> +	/* Ratelimit to prevent log spamming. */
> +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> +	    __ratelimit(&cpf_rate)) {
> +		unsigned int cpec;
> +
> +		cpec = error_code & CP_EC;
> +		if (cpec >= ARRAY_SIZE(control_protection_err))
> +			cpec = 0;
> +
> +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
> +			 tsk->comm, task_pid_nr(tsk),
> +			 regs->ip, regs->sp, ssp, error_code,
> +			 control_protection_err[cpec],
> +			 error_code & CP_ENCL ? " in enclave" : "");
> +		print_vma_addr(KERN_CONT " in ", regs->ip);
> +		pr_cont("\n");
>   	}
>   
> -	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
> -		return;
> +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> +	cond_local_irq_disable(regs);
> +}
> +#else
> +static void do_user_control_protection_fault(struct pt_regs *regs,
> +					     unsigned long error_code)
> +{
> +	WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
> +}
> +#endif
> +
> +#ifdef CONFIG_X86_KERNEL_IBT
> +
> +static __ro_after_init bool ibt_fatal = true;
> +
> +extern void ibt_selftest_ip(void); /* code label defined in asm below */
>   
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
>   	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
>   		regs->ax = 0;
>   		return;
> @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
>   }
>   
>   __setup("ibt=", ibt_setup);
> -
> +#else
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
> +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");
> +}
>   #endif /* CONFIG_X86_KERNEL_IBT */
>   
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pr_err("Unexpected #CP\n");
> +		BUG();
> +	}
> +
> +	if (user_mode(regs))
> +		do_user_control_protection_fault(regs, error_code);
> +	else
> +		do_kernel_control_protection_fault(regs);
> +}
> +#endif /* defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK) */
> +
>   #ifdef CONFIG_X86_F00F_BUG
>   void handle_invalid_op(struct pt_regs *regs)
>   #else
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 0ed2e487a693..57faa287163f 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -628,7 +628,7 @@ static struct trap_array_entry trap_array[] = {
>   	TRAP_ENTRY(exc_coprocessor_error,		false ),
>   	TRAP_ENTRY(exc_alignment_check,			false ),
>   	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
> -#ifdef CONFIG_X86_KERNEL_IBT
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
>   	TRAP_ENTRY(exc_control_protection,		false ),
>   #endif
>   };
> diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
> index 6b4fdf6b9542..e45ff6300c7d 100644
> --- a/arch/x86/xen/xen-asm.S
> +++ b/arch/x86/xen/xen-asm.S
> @@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
>   xen_pv_trap asm_exc_spurious_interrupt_bug
>   xen_pv_trap asm_exc_coprocessor_error
>   xen_pv_trap asm_exc_alignment_check
> -#ifdef CONFIG_X86_KERNEL_IBT
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
>   xen_pv_trap asm_exc_control_protection
>   #endif
>   #ifdef CONFIG_X86_MCE
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index ffbe4cec9f32..0f52d0ac47c5 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -242,7 +242,8 @@ typedef struct siginfo {
>   #define SEGV_ADIPERR	7	/* Precise MCD exception */
>   #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
>   #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
> -#define NSIGSEGV	9
> +#define SEGV_CPERR	10	/* Control protection fault */
> +#define NSIGSEGV	10
>   
>   /*
>    * SIGBUS si_codes


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-10-03 22:28   ` Kees Cook
@ 2022-10-03 23:00     ` Andy Lutomirski
  2022-10-04  4:37       ` Kees Cook
  2022-10-04  8:30     ` Mike Rapoport
  1 sibling, 1 reply; 241+ messages in thread
From: Andy Lutomirski @ 2022-10-03 23:00 UTC (permalink / raw)
  To: Kees Cook, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma

On 10/3/22 15:28, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:26PM -0700, Rick Edgecombe wrote:
>> For the current shadow stack implementation, shadow stacks contents easily
>> be arbitrarily provisioned with data.
> 
> I can't parse this sentence.
> 
>> This property helps apps protect
>> themselves better, but also restricts any potential apps that may want to
>> do exotic things at the expense of a little security.
> 
> Is anything using this right now? Wouldn't thing be safer without WRSS?
> (Why can't we skip this patch?)
> 

So that people don't write programs that need either (shstk off) or 
(shstk on and WRSS on) and crash or otherwise fail on kernels that 
support shstk but don't support WRSS, perhaps?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-10-03 22:51   ` Andy Lutomirski
@ 2022-10-03 23:09     ` H. Peter Anvin
  2022-10-03 23:11     ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: H. Peter Anvin @ 2022-10-03 23:09 UTC (permalink / raw)
  To: Andy Lutomirski, Rick Edgecombe, x86, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma
  Cc: Yu-cheng Yu, Michael Kerrisk

On October 3, 2022 3:51:59 PM PDT, Andy Lutomirski <luto@kernel.org> wrote:
>On 9/29/22 15:29, Rick Edgecombe wrote:
>> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>> 
>
>> +static void do_user_control_protection_fault(struct pt_regs *regs,
>> +					     unsigned long error_code)
>>   {
>> -	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
>> -		pr_err("Unexpected #CP\n");
>> -		BUG();
>> +	struct task_struct *tsk;
>> +	unsigned long ssp;
>> +
>> +	/* Read SSP before enabling interrupts. */
>> +	rdmsrl(MSR_IA32_PL3_SSP, ssp); > +
>> +	cond_local_irq_enable(regs);
>
>I feel like I'm missing something.  Either PL3_SSL is context switched correctly and reading it with IRQs off is useless, or it's not context switched, and I'm very confused.
>
>Please either improve the comment or move it after the cond_local_irq_enable().
>
>--Andy
>
>> +
>> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
>> +		WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
>> +
>> +	tsk = current;
>> +	tsk->thread.error_code = error_code;
>> +	tsk->thread.trap_nr = X86_TRAP_CP;
>> +
>> +	/* Ratelimit to prevent log spamming. */
>> +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
>> +	    __ratelimit(&cpf_rate)) {
>> +		unsigned int cpec;
>> +
>> +		cpec = error_code & CP_EC;
>> +		if (cpec >= ARRAY_SIZE(control_protection_err))
>> +			cpec = 0;
>> +
>> +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
>> +			 tsk->comm, task_pid_nr(tsk),
>> +			 regs->ip, regs->sp, ssp, error_code,
>> +			 control_protection_err[cpec],
>> +			 error_code & CP_ENCL ? " in enclave" : "");
>> +		print_vma_addr(KERN_CONT " in ", regs->ip);
>> +		pr_cont("\n");
>>   	}
>>   -	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
>> -		return;
>> +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
>> +	cond_local_irq_disable(regs);
>> +}
>> +#else
>> +static void do_user_control_protection_fault(struct pt_regs *regs,
>> +					     unsigned long error_code)
>> +{
>> +	WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
>> +}
>> +#endif
>> +
>> +#ifdef CONFIG_X86_KERNEL_IBT
>> +
>> +static __ro_after_init bool ibt_fatal = true;
>> +
>> +extern void ibt_selftest_ip(void); /* code label defined in asm below */
>>   +static void do_kernel_control_protection_fault(struct pt_regs *regs)
>> +{
>>   	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
>>   		regs->ax = 0;
>>   		return;
>> @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
>>   }
>>     __setup("ibt=", ibt_setup);
>> -
>> +#else
>> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
>> +{
>> +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");
>> +}
>>   #endif /* CONFIG_X86_KERNEL_IBT */
>>   +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
>> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
>> +{
>> +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
>> +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> +		pr_err("Unexpected #CP\n");
>> +		BUG();
>> +	}
>> +
>> +	if (user_mode(regs))
>> +		do_user_control_protection_fault(regs, error_code);
>> +	else
>> +		do_kernel_control_protection_fault(regs);
>> +}
>> +#endif /* defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK) */
>> +
>>   #ifdef CONFIG_X86_F00F_BUG
>>   void handle_invalid_op(struct pt_regs *regs)
>>   #else
>> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
>> index 0ed2e487a693..57faa287163f 100644
>> --- a/arch/x86/xen/enlighten_pv.c
>> +++ b/arch/x86/xen/enlighten_pv.c
>> @@ -628,7 +628,7 @@ static struct trap_array_entry trap_array[] = {
>>   	TRAP_ENTRY(exc_coprocessor_error,		false ),
>>   	TRAP_ENTRY(exc_alignment_check,			false ),
>>   	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
>> -#ifdef CONFIG_X86_KERNEL_IBT
>> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
>>   	TRAP_ENTRY(exc_control_protection,		false ),
>>   #endif
>>   };
>> diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
>> index 6b4fdf6b9542..e45ff6300c7d 100644
>> --- a/arch/x86/xen/xen-asm.S
>> +++ b/arch/x86/xen/xen-asm.S
>> @@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
>>   xen_pv_trap asm_exc_spurious_interrupt_bug
>>   xen_pv_trap asm_exc_coprocessor_error
>>   xen_pv_trap asm_exc_alignment_check
>> -#ifdef CONFIG_X86_KERNEL_IBT
>> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
>>   xen_pv_trap asm_exc_control_protection
>>   #endif
>>   #ifdef CONFIG_X86_MCE
>> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
>> index ffbe4cec9f32..0f52d0ac47c5 100644
>> --- a/include/uapi/asm-generic/siginfo.h
>> +++ b/include/uapi/asm-generic/siginfo.h
>> @@ -242,7 +242,8 @@ typedef struct siginfo {
>>   #define SEGV_ADIPERR	7	/* Precise MCD exception */
>>   #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
>>   #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
>> -#define NSIGSEGV	9
>> +#define SEGV_CPERR	10	/* Control protection fault */
>> +#define NSIGSEGV	10
>>     /*
>>    * SIGBUS si_codes
>

Could something change the value under a switched-out thread, though?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-10-03 22:51   ` Andy Lutomirski
  2022-10-03 23:09     ` H. Peter Anvin
@ 2022-10-03 23:11     ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 23:11 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng, mtk.manpages

On Mon, 2022-10-03 at 15:51 -0700, Andy Lutomirski wrote:
> On 9/29/22 15:29, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > +static void do_user_control_protection_fault(struct pt_regs *regs,
> > +                                          unsigned long
> > error_code)
> >    {
> > -     if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> > -             pr_err("Unexpected #CP\n");
> > -             BUG();
> > +     struct task_struct *tsk;
> > +     unsigned long ssp;
> > +
> > +     /* Read SSP before enabling interrupts. */
> > +     rdmsrl(MSR_IA32_PL3_SSP, ssp); > +
> > +     cond_local_irq_enable(regs);
> 
> I feel like I'm missing something.  Either PL3_SSL is context
> switched 
> correctly and reading it with IRQs off is useless, or it's not
> context 
> switched, and I'm very confused.
> 
> Please either improve the comment or move it after the 
> cond_local_irq_enable().

The thinking was, we were just in userspace and we took a #CP. Since we
were in userspace, we had a live SSP. After we re-enable interrupts we
could get scheduled and it would be in the xsave buffer. So we can grab
it for free now, otherwise we would have to force restore it and read
it after we re-enable interrupts.

I can clarify the comments, unless there is something wrong with that
reasoning.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 22:28     ` Edgecombe, Rick P
@ 2022-10-03 23:17       ` Nadav Amit
  2022-10-03 23:20         ` Nadav Amit
  0 siblings, 1 reply; 241+ messages in thread
From: Nadav Amit @ 2022-10-03 23:17 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, jannh, dethoma, linux-arch, kcc, bp,
	oleg, hjl.tools, Yang, Weijiang, Andy Lutomirski, pavel, arnd,
	Moreira, Joao, Thomas Gleixner, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:

> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>> Did you have a look at ptep_set_access_flags() and friends and
>> checked they
>> do not need to be changed too? 
> 
> ptep_set_access_flags() doesn't actually set any additional dirty bits
> on x86, so I think it's ok.

Are you sure about that? (lost my confidence today so I am hesitant).

Looking on insert_pfn(), I see:

                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
                        if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...

This appears to set the dirty bit while potentially leaving the write-bit
clear. This is the scenario you want to avoid, no?

>> Perhaps you should at least add some
>> assertion just to ensure nothing breaks.
> 
> You mean in ptep_set_access_flags()? I think some assertions would be
> really great, I'm just not sure where.

Yes, on x86’s version of the function.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 23:17       ` Nadav Amit
@ 2022-10-03 23:20         ` Nadav Amit
  2022-10-03 23:25           ` Nadav Amit
  0 siblings, 1 reply; 241+ messages in thread
From: Nadav Amit @ 2022-10-03 23:20 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, jannh, dethoma, linux-arch, kcc, bp,
	oleg, hjl.tools, Yang, Weijiang, Andy Lutomirski, pavel, arnd,
	Moreira, Joao, Thomas Gleixner, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Oct 3, 2022, at 4:17 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
>> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>>> Did you have a look at ptep_set_access_flags() and friends and
>>> checked they
>>> do not need to be changed too? 
>> 
>> ptep_set_access_flags() doesn't actually set any additional dirty bits
>> on x86, so I think it's ok.
> 
> Are you sure about that? (lost my confidence today so I am hesitant).
> 
> Looking on insert_pfn(), I see:
> 
>                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>                        if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
> 
> This appears to set the dirty bit while potentially leaving the write-bit
> clear. This is the scenario you want to avoid, no?

No. I am not paying attention. Ignore.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support Rick Edgecombe
@ 2022-10-03 23:21   ` Andy Lutomirski
  2022-10-04 16:12     ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Andy Lutomirski @ 2022-10-03 23:21 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma

On 9/29/22 15:29, Rick Edgecombe wrote:
> To handle stack overflows, applications can register a separate signal alt
> stack to use for the stack to handle signals. To handle shadow stack
> overflows the kernel can similarly provide the ability to have an alt
> shadow stack.


The overall SHSTK mechanism has a concept of a shadow stack that is 
valid and not in use and a shadow stack that is in use.  This is used, 
for example, by RSTORSSP.  I would like to imagine that this serves a 
real purpose (presumably preventing two different threads from using the 
same shadow stack and thus corrupting each others' state).

So maybe altshstk should use exactly the same mechanism.  Either signal 
delivery should do the atomic very-and-mark-busy routine or registering 
the stack as an altstack should do it.

I think your patch has this maybe 1/3 implemented, but I don't see any 
atomics, and you seem to have removed (?) the code that actually 
modifies the token on the stack.

>   
> +static bool on_alt_shstk(unsigned long ssp)
> +{
> +	unsigned long alt_ss_start = current->thread.sas_shstk_sp;
> +	unsigned long alt_ss_end = alt_ss_start + current->thread.sas_shstk_size;
> +
> +	return ssp >= alt_ss_start && ssp < alt_ss_end;
> +}

We're forcing AUTODISARM behavior (right?), so I don't think this is 
needed at all.  User code is never "on the alt stack".  It's either "on 
the alt stack but the alt stack is disarmed, so it's not on the alt 
stack" or it's just straight up not on the alt stack.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 23:20         ` Nadav Amit
@ 2022-10-03 23:25           ` Nadav Amit
  2022-10-03 23:38             ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Nadav Amit @ 2022-10-03 23:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, jannh, dethoma, linux-arch, kcc, bp,
	oleg, hjl.tools, Yang, Weijiang, Andy Lutomirski, pavel, arnd,
	Moreira, Joao, Thomas Gleixner, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Oct 3, 2022, at 4:20 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Oct 3, 2022, at 4:17 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>> 
>>> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>>>> Did you have a look at ptep_set_access_flags() and friends and
>>>> checked they
>>>> do not need to be changed too? 
>>> 
>>> ptep_set_access_flags() doesn't actually set any additional dirty bits
>>> on x86, so I think it's ok.
>> 
>> Are you sure about that? (lost my confidence today so I am hesitant).
>> 
>> Looking on insert_pfn(), I see:
>> 
>>                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>                       if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
>> 
>> This appears to set the dirty bit while potentially leaving the write-bit
>> clear. This is the scenario you want to avoid, no?
> 
> No. I am not paying attention. Ignore.

Sorry for the spam. Just this “dirty” argument is confusing. This indeed
seems like a flow that can set the dirty bit. I think.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 23:25           ` Nadav Amit
@ 2022-10-03 23:38             ` Edgecombe, Rick P
  2022-10-04  0:40               ` Nadav Amit
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-03 23:38 UTC (permalink / raw)
  To: nadav.amit
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	jamorris, arnd, Moreira, Joao, tglx, bp, mike.kravetz, x86,
	linux-doc, rppt, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 16:25 -0700, Nadav Amit wrote:
> On Oct 3, 2022, at 4:20 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > On Oct 3, 2022, at 4:17 PM, Nadav Amit <nadav.amit@gmail.com>
> > wrote:
> > 
> > > On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <
> > > rick.p.edgecombe@intel.com> wrote:
> > > 
> > > > On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
> > > > > Did you have a look at ptep_set_access_flags() and friends
> > > > > and
> > > > > checked they
> > > > > do not need to be changed too? 
> > > > 
> > > > ptep_set_access_flags() doesn't actually set any additional
> > > > dirty bits
> > > > on x86, so I think it's ok.
> > > 
> > > Are you sure about that? (lost my confidence today so I am
> > > hesitant).
> > > 
> > > Looking on insert_pfn(), I see:
> > > 
> > >                        entry = maybe_mkwrite(pte_mkdirty(entry),
> > > vma);
> > >                        if (ptep_set_access_flags(vma, addr, pte,
> > > entry, 1)) ...
> > > 
> > > This appears to set the dirty bit while potentially leaving the
> > > write-bit
> > > clear. This is the scenario you want to avoid, no?
> > 
> > No. I am not paying attention. Ignore.
> 
> Sorry for the spam. Just this “dirty” argument is confusing. This
> indeed
> seems like a flow that can set the dirty bit. I think.

I think the HW dirty bit will not be set here. How it works is,
pte_mkdirty() will not actually set the HW dirty bit, but instead the
software COW bit. Here is the relevant snippet:

static inline pte_t pte_mkdirty(pte_t pte)
{
	pteval_t dirty = _PAGE_DIRTY;

	/* Avoid creating Dirty=1,Write=0 PTEs */
	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
		dirty = _PAGE_COW;

	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
}

So for a !VM_WRITE vma, you end up with Write=0,Cow=1 PTE passed
into ptep_set_access_flags(). Does it make sense?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack
  2022-09-29 22:29 ` [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
  2022-10-03 18:22   ` Kees Cook
@ 2022-10-03 23:53   ` Kirill A . Shutemov
  2022-10-14 15:32   ` Peter Zijlstra
  2 siblings, 0 replies; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 23:53 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:13PM -0700, Rick Edgecombe wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8cd413c5a329..fef14ab3abcb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -981,13 +981,25 @@ void free_compound_page(struct page *page);
>   * servicing faults for write access.  In the normal case, do always want
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
> + *
> + * If a vma is shadow stack (a type of writable memory), mark the pte shadow
> + * stack.
>   */
> +#ifndef maybe_mkwrite
>  static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  {
> -	if (likely(vma->vm_flags & VM_WRITE))
> +	if (!(vma->vm_flags & VM_WRITE))
> +		goto out;
> +
> +	if (vma->vm_flags & VM_SHADOW_STACK)
> +		pte = pte_mkwrite_shstk(pte);
> +	else
>  		pte = pte_mkwrite(pte);
> +
> +out:
>  	return pte;
>  }
> +#endif

Maybe take opportunity to move it to <linux/pgtable.h>? It is really not a
place for the helper.


-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
  2022-10-03 18:24   ` Kees Cook
@ 2022-10-03 23:56   ` Kirill A . Shutemov
  2022-10-04 16:15     ` Edgecombe, Rick P
  2022-10-04  1:56   ` Nadav Amit
  2022-10-14 15:52   ` Peter Zijlstra
  3 siblings, 1 reply; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-03 23:56 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> With the introduction of shadow stack memory there are two ways a pte can
> be writable: regular writable memory and shadow stack memory.
> 
> In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
> or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
> where a PTE is made writable. However, there are places where pte_mkwrite()
> is called directly and the logic should now also create a shadow stack PTE
> in the case of a shadow stack VMA.
> 
>  - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
>    directly and call pte_mkwrite(), which is the same as maybe_mkwrite()
>    in logic and intention. Just change them to maybe_mkwrite().

Looks like you folded change for do_anonymous_page() into the wrong patch.
I see the relevant change in the previous patch.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 32/39] selftests/x86: Add shadow stack test
  2022-09-29 22:29 ` [PATCH v2 32/39] selftests/x86: Add shadow stack test Rick Edgecombe
@ 2022-10-03 23:56   ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-03 23:56 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:29PM -0700, Rick Edgecombe wrote:
> Add a simple selftest for exercising some shadow stack behavior:
>  - map_shadow_stack syscall and pivot
>  - Faulting in shadow stack memory
>  - Handling shadow stack violations
>  - GUP of shadow stack memory
>  - mprotect() of shadow stack memory
>  - Userfaultfd on shadow stack memory
> 
> Since this test exercises a recently added syscall manually, it needs
> to find the automatically created __NR_foo defines. Per the selftest
> documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
> find the headers from the kernel source. This way the new selftest can
> be built inside the kernel source tree without installing the headers
> to the system. So also add KHDR_INCLUDES as described in the selftest
> docs, to facilitate this.
> 
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Yay tests! Thank you thank you! :)

> @@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
>  			test_FCMOV test_FCOMI test_FISTTP \
>  			vdso_restorer
>  TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
> -			corrupt_xstate_header amx
> +			corrupt_xstate_header amx test_shadow_stack

At present, there is still a map_shadow_stack syscall on 32-bit, so it
should be tested (that it correctly does nothing with the expected error
results), if it is kept. :P

> [...]
> +#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
> +int main(int argc, char *argv[])
> +{
> +	printf("[SKIP]\tCompiler does not support CET.\n");
> +	return 0;
> +}

I realize other x86 selftests doesn't use the standard kselftest test
harness, but if an entirely new test is being written, like here, it
makes sense to use that instead. It would avoid bugs like the above,
where a SKIP is seen as a success, not a skip (i.e. wrong exit code).
See tools/testing/selftests/kselftest_harness.h

Note that each TEST is run as a separate process.

The skip here would be rewritten as:

...
#include "../kselftest_harness.h"

#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
TEST(compiler_support)
{
	SKIP(return, "Compiler does not support CET.");
}
#else
...rest of tests...
#endif

TEST_HARNESS_MAIN


I'll give some other examples of replacements below...

> +#else
> +void write_shstk(unsigned long *addr, unsigned long val)
> +{
> +	asm volatile("wrssq %[val], (%[addr])\n"
> +		     : "+m" (addr)
> +		     : [addr] "r" (addr), [val] "r" (val));
> +}
> +
> +static inline unsigned long __attribute__((always_inline)) get_ssp(void)
> +{
> +	unsigned long ret = 0;
> +
> +	asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
> +	return ret;
> +}
> +
> +/*
> + * For use in inline enablement of shadow stack.
> + *
> + * The program can't return from the point where shadow stack get's enabled
> + * because there will be no address on the shadow stack. So it can't use
> + * syscall() for enablement, since it is a function.

Hmm, this will be a problem for glibc too?

> + *
> + * Based on code from nolibc.h. Keep a copy here because this can't pull in all
> + * of nolibc.h.
> + */
> +#define ARCH_PRCTL(arg1, arg2)					\
> +({								\
> +	long _ret;						\
> +	register long _num  asm("eax") = __NR_arch_prctl;	\
> +	register long _arg1 asm("rdi") = (long)(arg1);		\
> +	register long _arg2 asm("rsi") = (long)(arg2);		\
> +								\
> +	asm volatile (						\
> +		"syscall\n"					\
> +		: "=a"(_ret)					\
> +		: "r"(_arg1), "r"(_arg2),			\
> +		  "0"(_num)					\
> +		: "rcx", "r11", "memory", "cc"			\
> +	);							\
> +	_ret;							\
> +})
> +
> +void *create_shstk(void *addr)
> +{
> +	return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
> +}

Hmm, I'd suggest adding some wider exercising of the syscall itself.
(This only ever tests SS_SIZE and SHADOW_STACK_SET_TOKEN). I'd expect to
see testing of error conditions too:

TEST(map_shadow_stack_bad_args)
{
	int ret;

	ret = ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK);
	ASSERT_EQ(0, ret) {
		TH_LOG("Could not enable SHSTK");
	}

	ret = syscall(__NR_map_shadow_stack, addr, SS_SIZE, 0);
	EXPECT_EQ(-1, ret);
	EXPECT_EQ(errno, EINVAL);

	ret = syscall(__NR_map_shadow_stack, addr, SS_SIZE, ~(SHADOW_STACK_SET_TOKEN));
	EXPECT_EQ(-1, ret);
	EXPECT_EQ(errno, EINVAL);

	ret = syscall(__NR_map_shadow_stack, addr, ULONG_MAX, SHADOW_STACK_SET_TOKEN);
	EXPECT_EQ(-1, ret);
	EXPECT_EQ(errno, ENOMEM);

	ret = syscall(__NR_map_shadow_stack, addr, 0, SHADOW_STACK_SET_TOKEN);
	EXPECT_EQ(-1, ret);
	EXPECT_EQ(errno, EINVAL);

	...
}

Although the last example there will probably segv, so that could be
extracted to a separate test:

TEST_SIGNAL(map_shadow_stack_tiny, SIGSEGV)
{
	int ret;

	ret = ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK);
	ASSERT_EQ(0, ret) {
		TH_LOG("Could not enable SHSTK");
	}

	ret = syscall(__NR_map_shadow_stack, addr, 0, SHADOW_STACK_SET_TOKEN);
	EXPECT_EQ(0, ret) {
		TH_LOG("Wasn't expecting to survive the syscall");
	}
}

Helpers are easier to plumb as expression statement macros, so:

#define create_shstk(addr)	({				\
	void *__addr;						\
	__addr = (void *)syscall(__NR_map_shadow_stack, addr,	\
				 SS_SIZE, SHADOW_STACK_SET_TOKEN); \
	ASSERT_NE(MMAP_FAILED, __addr) { \
		TH_LOG("Error creating shadow stack: %d", errno); \
	} \
	__addr;	\
})

And I expect the enable will need to be in each test, so:

#define enable_shstk		do {	\
	int __ret;			\
					\
	__ret = ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK);	\
	ASSERT_EQ(0, __ret) { \
		TH_LOG("Could not enable SHSTK"); \
} while (0)

> +void *create_normal_mem(void *addr)
> +{
> +	return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
> +		    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +}
> +
> +void free_shstk(void *shstk)
> +{
> +	munmap(shstk, SS_SIZE);
> +}
> +
> +int reset_shstk(void *shstk)
> +{
> +	return madvise(shstk, SS_SIZE, MADV_DONTNEED);
> +}
> +
> +void try_shstk(unsigned long new_ssp)
> +{
> +	unsigned long ssp;
> +
> +	printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
> +		new_ssp, *((unsigned long *)new_ssp));
> +
> +	ssp = get_ssp();
> +	printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
> +
> +	asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
> +	asm volatile("saveprevssp");
> +	printf("[INFO]\tssp is now %lx\n", get_ssp());
> +
> +	/* Switch back to original shadow stack */
> +	ssp -= 8;
> +	asm volatile("rstorssp (%0)\n":: "r" (ssp));
> +	asm volatile("saveprevssp");
> +}
> +
> +int test_shstk_pivot(void)
> +{
> +	void *shstk = create_shstk(0);
> +
> +	if (shstk == MAP_FAILED) {
> +		printf("[FAIL]\tError creating shadow stack: %d\n", errno);
> +		return 1;
> +	}
> +	try_shstk((unsigned long)shstk + SS_SIZE - 8);
> +	free_shstk(shstk);
> +
> +	printf("[OK]\tShadow stack pivot\n");
> +	return 0;
> +}

e.g., the above could be written as this, using the previous
create_shstk macro:

TEST(shstk_pivot)
{
	unsigned long ssp, new_ssp;
	void *shstk = create_shstk(0);

	new_ssp = (unsigned long)shstk + SS_SIZE - 8;
	TH_LOG("new_ssp = %lx, *new_ssp = %lx",
		new_ssp, *((unsigned long *)new_ssp);

	ssp = get_ssp();
	TH_LOG("changing ssp from %lx to %lx", ssp, new_ssp);
	asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
	asm volatile("saveprevssp");
	TH_LOG("ssp is now %lx", get_ssp());
	ssp -= 8;
	asm volatile("rstorssp (%0)\n":: "r" (ssp));
	asm volatile("saveprevssp");

	free_shstk(shstk);
}


> +
> +int test_shstk_faults(void)
> +{
> +	unsigned long *shstk = create_shstk(0);
> +
> +	/* Read shadow stack, test if it's zero to not get read optimized out */
> +	if (*shstk != 0)
> +		goto err;
> +
> +	/* Wrss memory that was already read. */
> +	write_shstk(shstk, 1);
> +	if (*shstk != 1)
> +		goto err;
> +
> +	/* Page out memory, so we can wrss it again. */
> +	if (reset_shstk((void *)shstk))
> +		goto err;
> +
> +	write_shstk(shstk, 1);
> +	if (*shstk != 1)
> +		goto err;
> +
> +	printf("[OK]\tShadow stack faults\n");
> +	return 0;
> +
> +err:
> +	return 1;
> +}
> +
> +unsigned long saved_ssp;
> +unsigned long saved_ssp_val;
> +volatile bool segv_triggered;
> +
> +void __attribute__((noinline)) violate_ss(void)
> +{
> +	saved_ssp = get_ssp();
> +	saved_ssp_val = *(unsigned long *)saved_ssp;
> +
> +	/* Corrupt shadow stack */
> +	printf("[INFO]\tCorrupting shadow stack\n");
> +	write_shstk((void *)saved_ssp, 0);
> +}
> +
> +void segv_handler(int signum, siginfo_t *si, void *uc)
> +{
> +	printf("[INFO]\tGenerated shadow stack violation successfully\n");
> +
> +	segv_triggered = true;
> +
> +	/* Fix shadow stack */
> +	write_shstk((void *)saved_ssp, saved_ssp_val);
> +}

To call TH_LOG() or EXPECT(), etc from a signal handler, you'll need to
store a global and use it local with the name _metadata:

struct __test_metadata *global_test_metadata;

And I'd expect a test for SEGV_CPERR (add in below example).

void segv_handler(int signum, siginfo_t *si, void *uc)
{
	struct __test_metadata *_metadata = global_test_metadata;

	TH_LOG("enerated shadow stack violation successfully");

	EXPECT_EQ(si.si_code, SEGV_CPERR);
	segv_triggered = true;

	/* Fix shadow stack */
	write_shstk((void *)saved_ssp, saved_ssp_val);
}

> +
> +int test_shstk_violation(void)
> +{
> +	struct sigaction sa;
> +
> +	sa.sa_sigaction = segv_handler;
> +	if (sigaction(SIGSEGV, &sa, NULL))
> +		return 1;
> +	sa.sa_flags = SA_SIGINFO;
> +
> +	segv_triggered = false;
> +
> +	/* Make sure segv_triggered is set before violate_ss() */
> +	asm volatile("" : : : "memory");
> +
> +	violate_ss();
> +
> +	signal(SIGSEGV, SIG_DFL);
> +
> +	printf("[OK]\tShadow stack violation test\n");
> +
> +	return !segv_triggered;
> +}
> +

becomes:

TEST(shstk_violation)
{
	struct sigaction sa = {
		.sa_sigaction = segv_handler;
		.sa_flags = SA_SIGINFO;
	};

	global_test_metadata = _metadata;
	ASSERT_EQ(sigaction(SIGSEGV, &sa, NULL), 0);

	segv_triggered = false;

	/* Make sure segv_triggered is set before violate_ss() */
	asm volatile("" : : : "memory");
	violate_ss();
	signal(SIGSEGV, SIG_DFL);
	EXPECT_EQ(segv_trigger, 1) {
		TH_LOG("Segfault did not happen");
	}
}

Without the SEGV_CPERR test, the entire thing could just be:

TEST_SIGNAL(shstk_violation, SIGSEGV)
{
	enable_shstk();
	violate_ss();
}


> +/* Gup test state */
> +#define MAGIC_VAL 0x12345678
> +bool is_shstk_access;
> +void *shstk_ptr;
> +int fd;
> +
> +void reset_test_shstk(void *addr)
> +{
> +	if (shstk_ptr != NULL)
> +		free_shstk(shstk_ptr);
> +	shstk_ptr = create_shstk(addr);
> +}
> +
> +void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
> +{
> +	printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
> +
> +	segv_triggered = true;
> +
> +	/* Fix shadow stack */
> +	if (is_shstk_access) {
> +		reset_test_shstk(shstk_ptr);
> +		return;
> +	}
> +
> +	free_shstk(shstk_ptr);
> +	create_normal_mem(shstk_ptr);
> +}
> +
> +bool test_shstk_access(void *ptr)
> +{
> +	is_shstk_access = true;
> +	segv_triggered = false;
> +	write_shstk(ptr, MAGIC_VAL);
> +
> +	asm volatile("" : : : "memory");
> +
> +	return segv_triggered;
> +}
> +
> +bool test_write_access(void *ptr)
> +{
> +	is_shstk_access = false;
> +	segv_triggered = false;
> +	*(unsigned long *)ptr = MAGIC_VAL;
> +
> +	asm volatile("" : : : "memory");
> +
> +	return segv_triggered;
> +}
> +
> +bool gup_write(void *ptr)
> +{
> +	unsigned long val;
> +
> +	lseek(fd, (unsigned long)ptr, SEEK_SET);
> +	if (write(fd, &val, sizeof(val)) < 0)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +bool gup_read(void *ptr)
> +{
> +	unsigned long val;
> +
> +	lseek(fd, (unsigned long)ptr, SEEK_SET);
> +	if (read(fd, &val, sizeof(val)) < 0)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +int test_gup(void)
> +{
> +	struct sigaction sa;
> +	int status;
> +	pid_t pid;
> +
> +	sa.sa_sigaction = test_access_fix_handler;
> +	if (sigaction(SIGSEGV, &sa, NULL))
> +		return 1;
> +	sa.sa_flags = SA_SIGINFO;
> +
> +	segv_triggered = false;
> +
> +	fd = open("/proc/self/mem", O_RDWR);
> +	if (fd == -1)
> +		return 1;
> +
> +	reset_test_shstk(0);
> +	if (gup_read(shstk_ptr))
> +		return 1;
> +	if (test_shstk_access(shstk_ptr))
> +		return 1;
> +	printf("[INFO]\tGup read -> shstk access success\n");
> +
> +	reset_test_shstk(0);
> +	if (gup_write(shstk_ptr))
> +		return 1;
> +	if (test_shstk_access(shstk_ptr))
> +		return 1;
> +	printf("[INFO]\tGup write -> shstk access success\n");

For multiple thing with the same setup, you can use a fixture:

FIXTURE(GUP) {
	int fd;
	void *shstk_ptr;
};

FIXTURE_SETUP(GUP)
{
	... sigaction ...

	self->fd = open("/proc/self/mem", O_RDWR);
	ASSERT_GE(fd, 0);
	self->shstk_ptr = create_shstk(0);
	ASSERT_NE(self->shstk_ptr, NULL);
}

/* Don't need to clean up fd nor sigaction since process will die */

TEST_F(GUP, read)
{
	gup_read ...
	test_shstk_access ...
}

TEST_F(GUP, write)
...


Anyway, I won't cry if this doesn't get swapped to kselftest_harness,
but it would be much nicer. Writing tests for that is way way easier.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-09-29 22:29 ` [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs Rick Edgecombe
@ 2022-10-03 23:57   ` Kees Cook
  2022-10-04  0:09     ` Dave Hansen
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 23:57 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe wrote:
> Shadow stack is supported on newer AMD processors, but the kernel
> implementation has not been tested on them. Prevent basic issues from
> showing up for normal users by disabling shadow stack on all CPUs except
> Intel until it has been tested. At which point the limitation should be
> removed.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

So running the selftests on an AMD system is sufficient to drop this
patch?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET Rick Edgecombe
@ 2022-10-03 23:59   ` Kees Cook
  2022-10-04  8:44     ` Mike Rapoport
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-03 23:59 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:34PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Some applications (like GDB and CRIU) would like to tweak CET state via

Eee. Does GDB really need this? Can we make this whole thing
CONFIG-depend on CRIU?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 38/39] x86/cet/shstk: Add ARCH_CET_UNLOCK
  2022-09-29 22:29 ` [OPTIONAL/RFC v2 38/39] x86/cet/shstk: Add ARCH_CET_UNLOCK Rick Edgecombe
@ 2022-10-04  0:00   ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04  0:00 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Mike Rapoport

On Thu, Sep 29, 2022 at 03:29:35PM -0700, Rick Edgecombe wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Userspace loaders may lock features before a CRIU restore operation has
> the chance to set them to whatever state is required by the process
> being restored. Allow a way for CRIU to unlock features. Add it as an
> arch_prctl() like the other CET operations, but restrict it being called
> by the ptrace arch_pctl() interface.

Hrm, please make this build-depend on CRIU...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting
  2022-09-29 22:29 ` [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
  2022-10-03 18:31   ` Kees Cook
@ 2022-10-04  0:03   ` Kirill A . Shutemov
  2022-10-04  0:32     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Kirill A . Shutemov @ 2022-10-04  0:03 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, joao.moreira, John Allen, kcc,
	eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:16PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Account shadow stack pages to stack memory.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> 
> ---
> 
> v2:
>  - Remove is_shadow_stack_mapping() and just change it to directly bitwise
>    and VM_SHADOW_STACK.
> 
> Yu-cheng v26:
>  - Remove redundant #ifdef CONFIG_MMU.
> 
> Yu-cheng v25:
>  - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().
> 
>  mm/mmap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f0d2e9143bd0..8569ef09614c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1682,6 +1682,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
>  	if (file && is_file_hugepages(file))
>  		return 0;
>  
> +	if (vm_flags & VM_SHADOW_STACK)
> +		return 1;
> +
>  	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;

Hm. Isn't the last check true for shadow stack too? IIUC, shadow stack has
VM_WRITE set, so accountable_mapping() should work correctly as is.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-03 23:57   ` Kees Cook
@ 2022-10-04  0:09     ` Dave Hansen
  2022-10-04  4:54       ` Kees Cook
  2022-10-04  8:36       ` Mike Rapoport
  0 siblings, 2 replies; 241+ messages in thread
From: Dave Hansen @ 2022-10-04  0:09 UTC (permalink / raw)
  To: Kees Cook, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Tom Lendacky, Moger, Babu

On 10/3/22 16:57, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe wrote:
>> Shadow stack is supported on newer AMD processors, but the kernel
>> implementation has not been tested on them. Prevent basic issues from
>> showing up for normal users by disabling shadow stack on all CPUs except
>> Intel until it has been tested. At which point the limitation should be
>> removed.
>>
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> So running the selftests on an AMD system is sufficient to drop this
> patch?

Yes, that's enough.

I _thought_ the AMD folks provided some tested-by's at some point in the
past.  But, maybe I'm confusing this for one of the other shared
features.  Either way, I'm sure no tested-by's were dropped on purpose.

I'm sure Rick is eager to trim down his series and this would be a great
patch to drop.  Does anyone want to make that easy for Rick?

<hint> <hint>

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory
  2022-10-03 17:47   ` Kirill A . Shutemov
@ 2022-10-04  0:29     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04  0:29 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 20:47 +0300, Kirill A . Shutemov wrote:
> > @@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
> >   
> >   const char *arch_vma_name(struct vm_area_struct *vma)
> >   {
> > +     if (vma->vm_flags & VM_SHADOW_STACK)
> > +             return "[shadow stack]";
> >        return NULL;
> >   }
> >   
> 
> But why here?
> 
> CONFIG_ARCH_HAS_SHADOW_STACK implies that there will be more than one
> arch
> that supports shadow stack. The name has to come from generic code
> too, no?

I'm not aware of any other arch that will, so I wonder if I should just
remove ARCH_HAS_SHADOW_STACK actually.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting
  2022-10-04  0:03   ` Kirill A . Shutemov
@ 2022-10-04  0:32     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04  0:32 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Tue, 2022-10-04 at 03:03 +0300, Kirill A . Shutemov wrote:
> > +     if (vm_flags & VM_SHADOW_STACK)
> > +             return 1;
> > +
> >        return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) ==
> > VM_WRITE;
> 
> Hm. Isn't the last check true for shadow stack too? IIUC, shadow
> stack has
> VM_WRITE set, so accountable_mapping() should work correctly as is.

They are not always VM_WRITE, that can have it removed via mprotect().
But in that case it is just specially tagged read only memory, so
probably isn't accountable. So, yea, I'll remove it. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW
  2022-10-03 23:38             ` Edgecombe, Rick P
@ 2022-10-04  0:40               ` Nadav Amit
  0 siblings, 0 replies; 241+ messages in thread
From: Nadav Amit @ 2022-10-04  0:40 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, fweimer, jannh, dethoma, kcc, linux-arch,
	pavel, oleg, hjl.tools, Yang, Weijiang, Andy Lutomirski,
	jamorris, arnd, Moreira, Joao, Thomas Gleixner, bp, mike.kravetz,
	x86, linux-doc, rppt, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Oct 3, 2022, at 4:38 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:

> I think the HW dirty bit will not be set here. How it works is,
> pte_mkdirty() will not actually set the HW dirty bit, but instead the
> software COW bit. Here is the relevant snippet:
> 
> static inline pte_t pte_mkdirty(pte_t pte)
> {
> 	pteval_t dirty = _PAGE_DIRTY;
> 
> 	/* Avoid creating Dirty=1,Write=0 PTEs */
> 	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
> 		dirty = _PAGE_COW;
> 
> 	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> }
> 
> So for a !VM_WRITE vma, you end up with Write=0,Cow=1 PTE passed
> into ptep_set_access_flags(). Does it make sense?

Thanks for your patience with me. I should have read the series in order.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
  2022-10-03 18:24   ` Kees Cook
  2022-10-03 23:56   ` Kirill A . Shutemov
@ 2022-10-04  1:56   ` Nadav Amit
  2022-10-04 16:21     ` Edgecombe, Rick P
  2022-10-14 15:52   ` Peter Zijlstra
  3 siblings, 1 reply; 241+ messages in thread
From: Nadav Amit @ 2022-10-04  1:56 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: X86 ML, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	linux-doc, Linux MM, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, Mike Rapoport, jamorris,
	dethoma, Yu-cheng Yu

Hopefully I will not waste your time again… If it has been discussed in the
last 26 iterations, just tell me and ignore.

On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -606,8 +606,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> 			goto abort;
> 		}
> 		entry = mk_pte(page, vma->vm_page_prot);
> -		if (vma->vm_flags & VM_WRITE)
> -			entry = pte_mkwrite(pte_mkdirty(entry));
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> 	}

This is not exactly the same logic. You might dirty read-only pages since
you call pte_mkdirty() unconditionally. It has been known not to be very
robust (e.g., dirty-COW and friends). Perhaps it is not dangerous following
some recent enhancements, but why do you want to take the risk?

Instead, although it might seem redundant, the compiler will hopefully would
make it efficient:

		if (vma->vm_flags & VM_WRITE) {
			entry = pte_mkdirty(entry);
			entry = maybe_mkwrite(entry, vma);
		}


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-03 19:35     ` John Hubbard
  2022-10-03 19:39       ` Dave Hansen
@ 2022-10-04  2:13       ` Bagas Sanjaya
  1 sibling, 0 replies; 241+ messages in thread
From: Bagas Sanjaya @ 2022-10-04  2:13 UTC (permalink / raw)
  To: John Hubbard, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On 10/4/22 02:35, John Hubbard wrote:
> It's always a judgment call, as to whether to use something like ``CALL`
> or just plain CALL. Here, I'd like to opine that that the benefits of
> ``CALL`` are very small, whereas plain text in cet.rst has been made
> significantly worse. So the result is, "this is not worth it".
> 

Hmm, seems like neither CALL nor ``CALL`` is better, right?

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-03 16:56         ` Edgecombe, Rick P
@ 2022-10-04  2:16           ` Bagas Sanjaya
  2022-10-05  9:10           ` Peter Zijlstra
  1 sibling, 0 replies; 241+ messages in thread
From: Bagas Sanjaya @ 2022-10-04  2:16 UTC (permalink / raw)
  To: Edgecombe, Rick P, corbet
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, Moreira, Joao, tglx, mike.kravetz, x86, Yang,
	Weijiang, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	linux-kernel, linux-api, gorcunov

On 10/3/22 23:56, Edgecombe, Rick P wrote:
> On Fri, 2022-09-30 at 20:41 +0700, Bagas Sanjaya wrote:
>> On 9/30/22 20:33, Jonathan Corbet wrote:
>>>>   CET introduces Shadow Stack and Indirect Branch Tracking.
>>>> Shadow stack is
>>>>   a secondary stack allocated from memory and cannot be directly
>>>> modified by
>>>> -applications. When executing a CALL instruction, the processor
>>>> pushes the
>>>> +applications. When executing a ``CALL`` instruction, the
>>>> processor pushes the
>>>
>>> Just to be clear, not everybody is fond of sprinkling lots of
>>> ``literal
>>> text`` throughout the documentation in this way.  Heavy use of it
>>> will
>>> certainly clutter the plain-text file and can be a net negative
>>> overall.
>>>
>>
>> Actually there is a trade-off between semantic correctness and plain-
>> text
>> clarity. With regards to inline code samples (like identifiers), I
>> fall
>> into the former camp. But when I'm reviewing patches for which the
>> surrounding documentation go latter camp (leave code samples alone
>> without
>> markup), I can adapt to that style as long as it causes no warnings
>> whatsover.
> 
> Thanks. Unless anyone has any objections, I think I'll take all these
> changes, except for the literal-izing of the instructions. They are not
> really being used as code samples in this case.
> 
> Bagas, can you reply with your sign-off and I'll just apply it?

OK, here goes...

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 00/39] Shadowstacks for userspace
  2022-10-03 18:33   ` Edgecombe, Rick P
@ 2022-10-04  3:59     ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04  3:59 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, Oct 03, 2022 at 06:33:52PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-10-03 at 10:04 -0700, Kees Cook wrote:
> > > Shadow stack signal format
> > > --------------------------
> > > So to handle alt shadow stacks we need to push some data onto a
> > > stack. To 
> > > prevent SROP we need to push something to the shadow stack that the
> > > kernel can 
> > > [...]
> > > shadow stack return address or a shadow stack tokens. To make sure
> > > it can’t be 
> > > used, data is pushed with the high bit (bit 63) set. This bit is a
> > > linear 
> > > address bit in both the token format and a normal return address,
> > > so it should 
> > > not conflict with anything. It puts any return address in the
> > > kernel half of 
> > > the address space, so would never be created naturally by a
> > > userspace program. 
> > > It will not be a valid restore token either, as the kernel address
> > > will never 
> > > be pointing to the previous frame in the shadow stack.
> > > 
> > > When a signal hits, the format pushed to the stack that is handling
> > > the signal 
> > > is four 8 byte values (since we are 64 bit only):
> > > > 1...old SSP|1...alt stack size|1...alt stack base|0|
> > 
> > Do these end up being non-canonical addresses? (To avoid confusion
> > with
> > "real" kernel addresses?)
> 
> Usually, but not necessarily with LAM. LAM cannot mask bit 63 though.
> So hypothetically they could become "real" kernel addresses some day.
> To keep them in the user half but still make sure they are not usable,
> you would either have to encode the bits over a lot of entries which
> would use extra space, or shrink the available address space, which
> could cause compatibility problems.
> 
> Do you think it's an issue?

Nah; I think it's a good solution. I was just trying to make sure I
understood it correctly. Thanks!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-03 20:04     ` Dave Hansen
@ 2022-10-04  4:04       ` Kees Cook
  2022-10-04 16:25         ` Edgecombe, Rick P
  2022-10-04 10:17       ` David Laight
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-04  4:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Mon, Oct 03, 2022 at 01:04:37PM -0700, Dave Hansen wrote:
> On 10/3/22 12:43, Kees Cook wrote:
> >> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> >> +{
> >> +	u64 val, new_val;
> >> +
> >> +	rdmsrl(msr, val);
> >> +	new_val = (val & ~clear) | set;
> >> +
> >> +	if (new_val != val)
> >> +		wrmsrl(msr, new_val);
> >> +}
> > I always get uncomfortable when I see these kinds of generalized helper
> > functions for touching cpu bits, etc. It just begs for future attacker
> > abuse to muck with arbitrary bits -- even marked inline there is a risk
> > the compiler will ignore that in some circumstances (not as currently
> > used in the code, but I'm imagining future changes leading to such a
> > condition). Will you humor me and change this to a macro instead? That'll
> > force it always inline (even __always_inline isn't always inline):
> 
> Oh, are you thinking that this is dangerous because it's so surgical and
> non-intrusive?  It's even more powerful to an attacker than, say
> wrmsrl(), because there they actually have to know what the existing
> value is to update it.  With this helper, it's quite easy to flip an
> individual bit without disturbing the neighboring bits.
> 
> Is that it?

Yeah, it was kind of the combo: both a potential entry point to wrmsrl
for arbitrary values, but also one where all the work is done to mask
stuff out.

> I don't _like_ the #defines, but doing one here doesn't seem too onerous
> considering how critical MSRs are.

I bet there are others, but this just weirded me out. I'll live with a
macro, especially since the chance of it mutating in a non-inline is
very small, but I figured I'd mention the idea.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate
  2022-10-03 20:05     ` Edgecombe, Rick P
@ 2022-10-04  4:05       ` Kees Cook
  2022-10-04 14:18       ` Dave Hansen
  1 sibling, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04  4:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, Oct 03, 2022 at 08:05:13PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-10-03 at 10:48 -0700, Kees Cook wrote:
> > > The easiest way to modify supervisor xfeature data is to force
> > > restore
> > > the registers and write directly to the MSRs. Often times this is
> > > just fine
> > > anyway as the registers need to be restored before returning to
> > > userspace.
> > > Do this for now, leaving buffer writing optimizations for the
> > > future.
> > 
> > Just for my own clarity, does this mean lock/load _needs_ to happen
> > before MSR access, or is it just a convenient place to do it? From
> > later
> > patches it seems it's a requirement during MSR access, which might be
> > a
> > good idea to detail here. It answers the question "when is this
> > function
> > needed?"
> 
> The CET state is xsaves managed. It gets lazily restored before
> returning to userspace with the rest of the fpu stuff. This function
> will force restore all the fpu state to the registers early and lock
> them from being automatically saved/restored. Then the tasks CET state
> can be modified in the MSRs, before unlocking the fpregs. Last time I
> tried to modify the state directly in the xsave buffer when it was
> efficient, but it had issues and Thomas suggested this.

Okay, gotcha. Thanks!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status
  2022-10-03 22:45     ` Andy Lutomirski
@ 2022-10-04  4:18       ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04  4:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick P Edgecombe, the arch/x86 maintainers, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List,
	linux-doc, linux-mm, linux-arch, Linux API, Arnd Bergmann,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra (Intel),
	Randy Dunlap, Shankar, Ravi V, Weijiang Yang, Kirill A. Shutemov,
	Moreira, Joao, john.allen, kcc, Eranian, Stephane, Mike Rapoport,
	jamorris, dethoma

On Mon, Oct 03, 2022 at 03:45:50PM -0700, Andy Lutomirski wrote:
> 
> 
> On Mon, Oct 3, 2022, at 3:37 PM, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:27PM -0700, Rick Edgecombe wrote:
> >> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >> 
> >> Applications and loaders can have logic to decide whether to enable CET.
> >> They usually don't report whether CET has been enabled or not, so there
> >> is no way to verify whether an application actually is protected by CET
> >> features.
> >> 
> >> Add two lines in /proc/$PID/arch_status to report enabled and locked
> >> features.
> >> 
> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> [Switched to CET, added to commit log]
> >> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> >> 
> >> ---
> >> 
> >> v2:
> >>  - New patch
> >> 
> >>  arch/x86/kernel/Makefile     |  2 ++
> >>  arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
> >>  arch/x86/kernel/proc.c       | 63 ++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 65 insertions(+), 47 deletions(-)
> >>  create mode 100644 arch/x86/kernel/proc.c
> >
> > This is two patches: one to create proc.c, the other to add CET support.
> >
> > I found where the "arch_status" conversation was:
> > https://lore.kernel.org/all/CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com/
> >
> > Andy, what did you mean "make sure that everything in it is namespaced"?
> > Everything already has a field name. And arch_status doesn't exactly
> > solve having compat fields -- it still needs to be handled manually?
> > Anyway... we have arch_status, so I guess it's fine.
> 
> I think I meant that, since it's "arch_status" not "x86_status", the fields should have names like "x86.Thread_features".  Otherwise if another architecture adds a Thread_features field, then anything running under something like qemu userspace emulation could be confused.
> 
> Assuming that's what I meant, I think my comment still stands :)

Ah, but that would be needed for compat things too in "arch_status", and
could just as well live in "status".

How about moving both of these into "status", with appropriate names?

x86_64.Thread_features: ...
i386.LDT_or_something: ...

?

Does anything consume arch_status yet? Looks like probably not:
https://codesearch.debian.net/search?q=%5Cbarch_status%5Cb&literal=0&perpkg=1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory
  2022-10-03 22:49     ` Andy Lutomirski
@ 2022-10-04  4:21       ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04  4:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H . J . Lu, Jann Horn, Jonathan Corbet,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma

On Mon, Oct 03, 2022 at 03:49:18PM -0700, Andy Lutomirski wrote:
> On 10/3/22 11:39, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:19PM -0700, Rick Edgecombe wrote:
> > > [...]
> > > Still allow FOLL_FORCE to write through shadow stack protections, as it
> > > does for read-only protections.
> > 
> > As I asked in the cover letter: why do we need to add this for shstk? It
> > was a mistake for general memory. :P
> 
> For debuggers, which use FOLL_FORCE, quite intentionally, to modify text.
> And once a debugger has ptrace write access to a target, shadow stacks
> provide exactly no protection -- ptrace can modify text and all registers.

i.e. via ptrace? Yeah, I grudgingly accept the ptrace need for
FOLL_FORCE.

> But /proc/.../mem may be a different story, and I'd be okay with having
> FOLL_PROC_MEM for legacy compatibility via /proc/.../mem and not allowing
> that to access shadow stacks.  This does seem like it may not be very
> useful, though.

I *really* don't like the /mem use of FOLL_FORCE, though. I think the
rationale has been "using PTRACE_POKE is too slow". Again, I can live
with it, I was just hoping we could avoid expanding that questionable
behavior, especially since it's a bypass of WRSS.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-10-03 23:00     ` Andy Lutomirski
@ 2022-10-04  4:37       ` Kees Cook
  2022-10-06  0:38         ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-04  4:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H . J . Lu, Jann Horn, Jonathan Corbet,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma

On Mon, Oct 03, 2022 at 04:00:36PM -0700, Andy Lutomirski wrote:
> On 10/3/22 15:28, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:26PM -0700, Rick Edgecombe wrote:
> > > For the current shadow stack implementation, shadow stacks contents easily
> > > be arbitrarily provisioned with data.
> > 
> > I can't parse this sentence.
> > 
> > > This property helps apps protect
> > > themselves better, but also restricts any potential apps that may want to
> > > do exotic things at the expense of a little security.
> > 
> > Is anything using this right now? Wouldn't thing be safer without WRSS?
> > (Why can't we skip this patch?)
> > 
> 
> So that people don't write programs that need either (shstk off) or (shstk
> on and WRSS on) and crash or otherwise fail on kernels that support shstk
> but don't support WRSS, perhaps?

Right, yes. I meant more "what programs currently need WRSS to operate
under shstk? (And what is it that they are doing that needs it?)"

All is see currently is compiler self-tests and emulators using it?
https://codesearch.debian.net/search?q=%5Cb%28wrss%7CWRSS%29%5Cb&literal=0&perpkg=1

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04  0:09     ` Dave Hansen
@ 2022-10-04  4:54       ` Kees Cook
  2022-10-04 15:47         ` Nathan Chancellor
  2022-10-04  8:36       ` Mike Rapoport
  1 sibling, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-04  4:54 UTC (permalink / raw)
  To: Dave Hansen, Gustavo A. R. Silva, Nathan Chancellor, Nick Desaulniers
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Tom Lendacky, Moger, Babu

On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
> On 10/3/22 16:57, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe wrote:
> >> Shadow stack is supported on newer AMD processors, but the kernel
> >> implementation has not been tested on them. Prevent basic issues from
> >> showing up for normal users by disabling shadow stack on all CPUs except
> >> Intel until it has been tested. At which point the limitation should be
> >> removed.
> >>
> >> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > So running the selftests on an AMD system is sufficient to drop this
> > patch?
> 
> Yes, that's enough.
> 
> I _thought_ the AMD folks provided some tested-by's at some point in the
> past.  But, maybe I'm confusing this for one of the other shared
> features.  Either way, I'm sure no tested-by's were dropped on purpose.
> 
> I'm sure Rick is eager to trim down his series and this would be a great
> patch to drop.  Does anyone want to make that easy for Rick?
> 
> <hint> <hint>

Hey Gustavo, Nathan, or Nick! I know y'all have some fancy AMD testing
rigs. Got a moment to spin up this series and run the selftests? :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 00/39] Shadowstacks for userspace
  2022-10-03 17:25   ` Jann Horn
@ 2022-10-04  5:01     ` Kees Cook
  2022-10-04  9:57       ` David Laight
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-04  5:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma

On Mon, Oct 03, 2022 at 07:25:03PM +0200, Jann Horn wrote:
> On Mon, Oct 3, 2022 at 7:04 PM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 29, 2022 at 03:28:57PM -0700, Rick Edgecombe wrote:
> > > This is an overdue followup to the “Shadow stacks for userspace” CET series.
> > > Thanks for all the comments on the first version [0]. They drove a decent
> > > amount of changes for v2. Since it has been awhile, I’ll try to summarize the
> > > areas that got major changes since last time. Smaller changes are listed in
> > > each patch.
> >
> > Thanks for the write-up!
> >
> > > [...]
> > >         GUP
> > >         ---
> > >         Shadow stack memory is generally treated as writable by the kernel, but
> > >         it behaves differently then other writable memory with respect to GUP.
> > >         FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
> > >         set. Shadow stack memory is writable from the perspective of being
> > >         changeable by userspace, but it is also protected memory from
> > >         userspace’s perspective. So preventing it from being writable via
> > >         FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
> > >         it. However, like read-only memory, FOLL_FORCE can still write through
> > >         it. This means shadow stacks can be written to via things like
> > >         “/proc/self/mem”. Apps that want extra security will have to prevent
> > >         access to kernel features that can write with FOLL_FORCE.
> >
> > This seems like a problem to me -- the point of SS is that there cannot be
> > a way to write to them without specific instruction sequences. The fact
> > that /proc/self/mem bypasses memory protections was an old design mistake
> > that keeps leading to surprising behaviors. It would be much nicer to
> > draw the line somewhere and just say that FOLL_FORCE doesn't work on
> > VM_SHADOW_STACK. Why must FOLL_FORCE be allowed to write to SS?
> 
> But once you have FOLL_FORCE, you can also just write over stuff like
> executable code instead of writing over the stack. I don't think
> allowing FOLL_FORCE writes over shadow stacks from /proc/$pid/mem is
> making things worse in any way, and it's probably helpful for stuff
> like debuggers.
> 
> If you don't want /proc/$pid/mem to be able to do stuff like that,
> then IMO the way to go is to change when /proc/$pid/mem uses
> FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.

Yeah, all reasonable. I just wish we could ditch FOLL_FORCE; it continues
to weird me out how powerful that fd's side-effects are.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-10-03 22:28   ` Kees Cook
  2022-10-03 23:00     ` Andy Lutomirski
@ 2022-10-04  8:30     ` Mike Rapoport
  1 sibling, 0 replies; 241+ messages in thread
From: Mike Rapoport @ 2022-10-04  8:30 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, jamorris, dethoma

On Mon, Oct 03, 2022 at 03:28:47PM -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:26PM -0700, Rick Edgecombe wrote:
> > For the current shadow stack implementation, shadow stacks contents easily
> > be arbitrarily provisioned with data.
> 
> I can't parse this sentence.
> 
> > This property helps apps protect
> > themselves better, but also restricts any potential apps that may want to
> > do exotic things at the expense of a little security.
> 
> Is anything using this right now? Wouldn't thing be safer without WRSS?
> (Why can't we skip this patch?)

CRIU uses WRSS to restore the shadow stack contents.
 
> -- 
> Kees Cook

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04  0:09     ` Dave Hansen
  2022-10-04  4:54       ` Kees Cook
@ 2022-10-04  8:36       ` Mike Rapoport
  1 sibling, 0 replies; 241+ messages in thread
From: Mike Rapoport @ 2022-10-04  8:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, jamorris, dethoma, Tom Lendacky, Moger, Babu

On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
> On 10/3/22 16:57, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe wrote:
> >> Shadow stack is supported on newer AMD processors, but the kernel
> >> implementation has not been tested on them. Prevent basic issues from
> >> showing up for normal users by disabling shadow stack on all CPUs except
> >> Intel until it has been tested. At which point the limitation should be
> >> removed.
> >>
> >> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > So running the selftests on an AMD system is sufficient to drop this
> > patch?
> 
> Yes, that's enough.
> 
> I _thought_ the AMD folks provided some tested-by's at some point in the
> past.  But, maybe I'm confusing this for one of the other shared
> features.  Either way, I'm sure no tested-by's were dropped on purpose.
> 
> I'm sure Rick is eager to trim down his series and this would be a great
> patch to drop.  Does anyone want to make that easy for Rick?

FWIW, I've run CRIU test suite with the previous version of this set on an
AMD machine.
 
> <hint> <hint>

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET
  2022-10-03 23:59   ` Kees Cook
@ 2022-10-04  8:44     ` Mike Rapoport
  2022-10-04 19:24       ` Kees Cook
  0 siblings, 1 reply; 241+ messages in thread
From: Mike Rapoport @ 2022-10-04  8:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, jamorris, dethoma, Yu-cheng Yu

On Mon, Oct 03, 2022 at 04:59:43PM -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:34PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > Some applications (like GDB and CRIU) would like to tweak CET state via
> 
> Eee. Does GDB really need this? Can we make this whole thing
> CONFIG-depend on CRIU?

GDB, at least its Intel fork uses this. I don't see how they can jump
between frames without an ability to modify shadow stack contents.

Last I looked they used NT_X86_CET to update SSP and ptrace(POKEDATA) to
write to the shadow stack.
 
> -- 
> Kees Cook

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [PATCH v2 00/39] Shadowstacks for userspace
  2022-10-04  5:01     ` Kees Cook
@ 2022-10-04  9:57       ` David Laight
  2022-10-04 19:28         ` Kees Cook
  0 siblings, 1 reply; 241+ messages in thread
From: David Laight @ 2022-10-04  9:57 UTC (permalink / raw)
  To: 'Kees Cook', Jann Horn
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma

From: Kees Cook <keescook@chromium.org>
...
> >
> > If you don't want /proc/$pid/mem to be able to do stuff like that,
> > then IMO the way to go is to change when /proc/$pid/mem uses
> > FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.
> 
> Yeah, all reasonable. I just wish we could ditch FOLL_FORCE; it continues
> to weird me out how powerful that fd's side-effects are.

Could you remove FOLL_FORCE from /proc/$pid/mem and add a
/proc/$pid/mem_force that enable FOLL_FORCE but requires root
(or similar) access.

Although I suspect gdb may like to have write access to
code?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-03 20:04     ` Dave Hansen
  2022-10-04  4:04       ` Kees Cook
@ 2022-10-04 10:17       ` David Laight
  2022-10-04 19:32         ` Kees Cook
  1 sibling, 1 reply; 241+ messages in thread
From: David Laight @ 2022-10-04 10:17 UTC (permalink / raw)
  To: 'Dave Hansen', Kees Cook, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

From: Dave Hansen
> Sent: 03 October 2022 21:05
> 
> On 10/3/22 12:43, Kees Cook wrote:
> >> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> >> +{
> >> +	u64 val, new_val;
> >> +
> >> +	rdmsrl(msr, val);
> >> +	new_val = (val & ~clear) | set;
> >> +
> >> +	if (new_val != val)
> >> +		wrmsrl(msr, new_val);
> >> +}
> > I always get uncomfortable when I see these kinds of generalized helper
> > functions for touching cpu bits, etc. It just begs for future attacker
> > abuse to muck with arbitrary bits -- even marked inline there is a risk
> > the compiler will ignore that in some circumstances (not as currently
> > used in the code, but I'm imagining future changes leading to such a
> > condition). Will you humor me and change this to a macro instead? That'll
> > force it always inline (even __always_inline isn't always inline):
> 
> Oh, are you thinking that this is dangerous because it's so surgical and
> non-intrusive?  It's even more powerful to an attacker than, say
> wrmsrl(), because there they actually have to know what the existing
> value is to update it.  With this helper, it's quite easy to flip an
> individual bit without disturbing the neighboring bits.
> 
> Is that it?
> 
> I don't _like_ the #defines, but doing one here doesn't seem too onerous
> considering how critical MSRs are.

How often is the 'msr' number not a compile-time constant?
Adding rd/wrmsr variants that verify this would reduce the
attack surface as well.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate
  2022-10-03 20:05     ` Edgecombe, Rick P
  2022-10-04  4:05       ` Kees Cook
@ 2022-10-04 14:18       ` Dave Hansen
  2022-10-04 16:13         ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2022-10-04 14:18 UTC (permalink / raw)
  To: Edgecombe, Rick P, keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On 10/3/22 13:05, Edgecombe, Rick P wrote:
> The CET state is xsaves managed. It gets lazily restored before
> returning to userspace with the rest of the fpu stuff. This function
> will force restore all the fpu state to the registers early and lock
> them from being automatically saved/restored. Then the tasks CET state
> can be modified in the MSRs, before unlocking the fpregs. Last time I
> tried to modify the state directly in the xsave buffer when it was
> efficient, but it had issues and Thomas suggested this.

Can you get the gist of this in a comment, please?


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04  4:54       ` Kees Cook
@ 2022-10-04 15:47         ` Nathan Chancellor
  2022-10-04 19:43           ` John Allen
  0 siblings, 1 reply; 241+ messages in thread
From: Nathan Chancellor @ 2022-10-04 15:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dave Hansen, Gustavo A. R. Silva, Nick Desaulniers,
	Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Tom Lendacky, Moger, Babu

Hi Kees,

On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
> On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
> > On 10/3/22 16:57, Kees Cook wrote:
> > > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe wrote:
> > >> Shadow stack is supported on newer AMD processors, but the kernel
> > >> implementation has not been tested on them. Prevent basic issues from
> > >> showing up for normal users by disabling shadow stack on all CPUs except
> > >> Intel until it has been tested. At which point the limitation should be
> > >> removed.
> > >>
> > >> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > So running the selftests on an AMD system is sufficient to drop this
> > > patch?
> > 
> > Yes, that's enough.
> > 
> > I _thought_ the AMD folks provided some tested-by's at some point in the
> > past.  But, maybe I'm confusing this for one of the other shared
> > features.  Either way, I'm sure no tested-by's were dropped on purpose.
> > 
> > I'm sure Rick is eager to trim down his series and this would be a great
> > patch to drop.  Does anyone want to make that easy for Rick?
> > 
> > <hint> <hint>
> 
> Hey Gustavo, Nathan, or Nick! I know y'all have some fancy AMD testing
> rigs. Got a moment to spin up this series and run the selftests? :)

I do have access to a system with an EPYC 7513, which does have Shadow
Stack support (I can see 'shstk' in the "Flags" section of lscpu with
this series). As far as I understand it, AMD only added Shadow Stack
with Zen 3; my regular AMD test system is Zen 2 (probably should look at
procurring a Zen 3 or Zen 4 one at some point).

I applied this series on top of 6.0 and reverted this change then booted
it on that system. After building the selftest (which did require
'make headers_install' and a small addition to make it build beyond
that, see below), I ran it and this was the result. I am not sure if
that is expected or not but the other results seem promising for
dropping this patch.

  $ ./test_shadow_stack_64
  [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
  [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
  [INFO]  ssp is now 7f8a36ca0000
  [OK]    Shadow stack pivot
  [OK]    Shadow stack faults
  [INFO]  Corrupting shadow stack
  [INFO]  Generated shadow stack violation successfully
  [OK]    Shadow stack violation test
  [INFO]  Gup read -> shstk access success
  [INFO]  Gup write -> shstk access success
  [INFO]  Violation from normal write
  [INFO]  Gup read -> write access success
  [INFO]  Violation from normal write
  [INFO]  Gup write -> write access success
  [INFO]  Cow gup write -> write access success
  [OK]    Shadow gup test
  [INFO]  Violation from shstk access
  [OK]    mprotect() test
  [OK]    Userfaultfd test
  [FAIL]  Alt shadow stack test

  $ echo $?
  1

I am happy to provide any information that would be useful for exploring
this failure and test further iterations of this series as necessary.

Cheers,
Nathan

test_shadow_stack.c: In function ‘create_shstk’:
test_shadow_stack.c:86:70: error: ‘SHADOW_STACK_SET_TOKEN’ undeclared (first use in this function)
   86 |         return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
      |                                                                      ^~~~~~~~~~~~~~~~~~~~~~
test_shadow_stack.c:86:70: note: each undeclared identifier is reported only once for each function it appears in
test_shadow_stack.c:87:1: warning: control reaches end of non-void function [-Wreturn-type]
   87 | }
      | ^

diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
index 22b856de5cdd..958dbb248518 100644
--- a/tools/testing/selftests/x86/test_shadow_stack.c
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -11,6 +11,7 @@
 #define _GNU_SOURCE
 
 #include <sys/syscall.h>
+#include <asm/mman.h>
 #include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/wait.h>

^ permalink raw reply related	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support
  2022-10-03 23:21   ` Andy Lutomirski
@ 2022-10-04 16:12     ` Edgecombe, Rick P
  2022-10-04 17:46       ` Andy Lutomirski
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 16:12 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 16:21 -0700, Andy Lutomirski wrote:
> On 9/29/22 15:29, Rick Edgecombe wrote:
> > To handle stack overflows, applications can register a separate
> > signal alt
> > stack to use for the stack to handle signals. To handle shadow
> > stack
> > overflows the kernel can similarly provide the ability to have an
> > alt
> > shadow stack.
> 
> 
> The overall SHSTK mechanism has a concept of a shadow stack that is 
> valid and not in use and a shadow stack that is in use.  This is
> used, 
> for example, by RSTORSSP.  I would like to imagine that this serves
> a 
> real purpose (presumably preventing two different threads from using
> the 
> same shadow stack and thus corrupting each others' state).
> 
> So maybe altshstk should use exactly the same mechanism.  Either
> signal 
> delivery should do the atomic very-and-mark-busy routine or
> registering 
> the stack as an altstack should do it.
> 
> I think your patch has this maybe 1/3 implemented

I'm not following how it breaks down into 3 parts, so hopefully I'm not
missing something. We could do a software busy bit for the token at the
end of alt shstk though. It seems like a good idea.

The busy-like bit in the RSTORSSP-type token is not called out as a
busy bit, but instead defined as reserved (must be 0) in some states.
(Note, it is different than the supervisor shadow stack format). Yea,
we could just probably use it like RSTORSSP does for this operation.

Or just invent another new token format and stay away from bits marked
reserved. Then it wouldn't have to be atomic either, since userspace
couldn't use it.

> , but I don't see any 
> atomics, and you seem to have removed (?) the code that actually 
> modifies the token on the stack.

The past series didn't do any busy bit like operation. The token just
marked where the sigreturn should be called. There was actually a
similar problem to what you described above, in that the token marking
the sigreturn point could have been usable by RSTORSSP from another
thread. In this version (even back in the non-RFC patches) using a made
up token format that RSTORSSP knows nothing about, avoids this a
different way than a busy bit. Two threads couldn't use a shstk
sigframe at the same time unless they somehow were already using the
same shadow stack.

> 
> >    
> > +static bool on_alt_shstk(unsigned long ssp)
> > +{
> > +     unsigned long alt_ss_start = current->thread.sas_shstk_sp;
> > +     unsigned long alt_ss_end = alt_ss_start + current-
> > >thread.sas_shstk_size;
> > +
> > +     return ssp >= alt_ss_start && ssp < alt_ss_end;
> > +}
> 
> We're forcing AUTODISARM behavior (right?), so I don't think this is 
> needed at all.  User code is never "on the alt stack".  It's either
> "on 
> the alt stack but the alt stack is disarmed, so it's not on the alt 
> stack" or it's just straight up not on the alt stack.

Err, right. This can be dropped. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate
  2022-10-04 14:18       ` Dave Hansen
@ 2022-10-04 16:13         ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 16:13 UTC (permalink / raw)
  To: keescook, Hansen, Dave
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	Eranian, Stephane, kirill.shutemov, dave.hansen, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, jamorris, arnd,
	Moreira, Joao, tglx, pavel, mike.kravetz, x86, linux-doc, rppt,
	john.allen, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Tue, 2022-10-04 at 07:18 -0700, Dave Hansen wrote:
> On 10/3/22 13:05, Edgecombe, Rick P wrote:
> > The CET state is xsaves managed. It gets lazily restored before
> > returning to userspace with the rest of the fpu stuff. This
> > function
> > will force restore all the fpu state to the registers early and
> > lock
> > them from being automatically saved/restored. Then the tasks CET
> > state
> > can be modified in the MSRs, before unlocking the fpregs. Last time
> > I
> > tried to modify the state directly in the xsave buffer when it was
> > efficient, but it had issues and Thomas suggested this.
> 
> Can you get the gist of this in a comment, please?

Sure.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-10-03 23:56   ` Kirill A . Shutemov
@ 2022-10-04 16:15     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 16:15 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Tue, 2022-10-04 at 02:56 +0300, Kirill A . Shutemov wrote:
> On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > With the introduction of shadow stack memory there are two ways a
> > pte can
> > be writable: regular writable memory and shadow stack memory.
> > 
> > In past patches, maybe_mkwrite() has been updated to apply
> > pte_mkwrite()
> > or pte_mkwrite_shstk() depending on the VMA flag. This covers most
> > cases
> > where a PTE is made writable. However, there are places where
> > pte_mkwrite()
> > is called directly and the logic should now also create a shadow
> > stack PTE
> > in the case of a shadow stack VMA.
> > 
> >   - do_anonymous_page() and migrate_vma_insert_page() check
> > VM_WRITE
> >     directly and call pte_mkwrite(), which is the same as
> > maybe_mkwrite()
> >     in logic and intention. Just change them to maybe_mkwrite().
> 
> Looks like you folded change for do_anonymous_page() into the wrong
> patch.
> I see the relevant change in the previous patch.

Arg, yep thanks. It got moved accidentally.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-10-04  1:56   ` Nadav Amit
@ 2022-10-04 16:21     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 16:21 UTC (permalink / raw)
  To: nadav.amit
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, jannh, dethoma, linux-arch, kcc, bp,
	oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 18:56 -0700, Nadav Amit wrote:
> Hopefully I will not waste your time again… If it has been discussed
> in the
> last 26 iterations, just tell me and ignore.
> 
> On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <
> rick.p.edgecombe@intel.com> wrote:
> 
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -606,8 +606,7 @@ static void migrate_vma_insert_page(struct
> > migrate_vma *migrate,
> > 			goto abort;
> > 		}
> > 		entry = mk_pte(page, vma->vm_page_prot);
> > -		if (vma->vm_flags & VM_WRITE)
> > -			entry = pte_mkwrite(pte_mkdirty(entry));
> > +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > 	}
> 
> This is not exactly the same logic. You might dirty read-only pages
> since
> you call pte_mkdirty() unconditionally. It has been known not to be
> very
> robust (e.g., dirty-COW and friends). Perhaps it is not dangerous
> following
> some recent enhancements, but why do you want to take the risk?

Yea those changes let me drop a patch. But, it's a good point.

> 
> Instead, although it might seem redundant, the compiler will
> hopefully would
> make it efficient:
> 
> 		if (vma->vm_flags & VM_WRITE) {
> 			entry = pte_mkdirty(entry);
> 			entry = maybe_mkwrite(entry, vma);
> 		}
> 

Thanks Nadav. I think you're right, it should have the open coded logic
here and in the do_anonymous_page() chunk that got moved to the
previous patch on accident.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-04  4:04       ` Kees Cook
@ 2022-10-04 16:25         ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 16:25 UTC (permalink / raw)
  To: keescook, Hansen, Dave
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 21:04 -0700, Kees Cook wrote:
> > I don't _like_ the #defines, but doing one here doesn't seem too
> > onerous
> > considering how critical MSRs are.
> 
> I bet there are others, but this just weirded me out. I'll live with
> a
> macro, especially since the chance of it mutating in a non-inline is
> very small, but I figured I'd mention the idea.

Makes sense. I'll change it to a define.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support
  2022-10-04 16:12     ` Edgecombe, Rick P
@ 2022-10-04 17:46       ` Andy Lutomirski
  2022-10-04 18:04         ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Andy Lutomirski @ 2022-10-04 17:46 UTC (permalink / raw)
  To: Rick P Edgecombe, Balbir Singh, H. Peter Anvin,
	Eugene Syromiatnikov, Peter Zijlstra (Intel),
	Randy Dunlap, Kees Cook, Dave Hansen, Kirill A. Shutemov,
	Eranian, Stephane, linux-mm, Florian Weimer, Nadav Amit,
	Jann Horn, dethoma, linux-arch, kcc, Borislav Petkov,
	Oleg Nesterov, H.J. Lu, Weijiang Yang, Pavel Machek,
	Arnd Bergmann, Moreira, Joao, Thomas Gleixner, Mike Kravetz,
	the arch/x86 maintainers, linux-doc, jamorris, john.allen,
	Mike Rapoport, Ingo Molnar, Shankar, Ravi V, Jonathan Corbet,
	Linux Kernel Mailing List, Linux API, Cyrill Gorcunov



On Tue, Oct 4, 2022, at 9:12 AM, Edgecombe, Rick P wrote:
> On Mon, 2022-10-03 at 16:21 -0700, Andy Lutomirski wrote:
>> On 9/29/22 15:29, Rick Edgecombe wrote:
>> > To handle stack overflows, applications can register a separate
>> > signal alt
>> > stack to use for the stack to handle signals. To handle shadow
>> > stack
>> > overflows the kernel can similarly provide the ability to have an
>> > alt
>> > shadow stack.
>> 
>> 
>> The overall SHSTK mechanism has a concept of a shadow stack that is 
>> valid and not in use and a shadow stack that is in use.  This is
>> used, 
>> for example, by RSTORSSP.  I would like to imagine that this serves
>> a 
>> real purpose (presumably preventing two different threads from using
>> the 
>> same shadow stack and thus corrupting each others' state).
>> 
>> So maybe altshstk should use exactly the same mechanism.  Either
>> signal 
>> delivery should do the atomic very-and-mark-busy routine or
>> registering 
>> the stack as an altstack should do it.
>> 
>> I think your patch has this maybe 1/3 implemented
>
> I'm not following how it breaks down into 3 parts, so hopefully I'm not
> missing something. We could do a software busy bit for the token at the
> end of alt shstk though. It seems like a good idea.

I didn't mean there were three parts.  I just wild @&! guessed the amount of code written versus needed :)

>
> The busy-like bit in the RSTORSSP-type token is not called out as a
> busy bit, but instead defined as reserved (must be 0) in some states.
> (Note, it is different than the supervisor shadow stack format). Yea,
> we could just probably use it like RSTORSSP does for this operation.
>
> Or just invent another new token format and stay away from bits marked
> reserved. Then it wouldn't have to be atomic either, since userspace
> couldn't use it.

But userspace *can* use it by delivering a signal.  I consider the scenario where two user threads set up the same altshstk and take signals concurrently to be about as dangerous and about as likely (under accidental or malicious conditions) as two user threads doing RSTORSSP at the same time.  Someone at Intel thought the latter was a big deal, so maybe we should match its behavior.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support
  2022-10-04 17:46       ` Andy Lutomirski
@ 2022-10-04 18:04         ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 18:04 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, jamorris,
	arnd, Moreira, Joao, tglx, pavel, mike.kravetz, x86, linux-doc,
	rppt, john.allen, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Tue, 2022-10-04 at 10:46 -0700, Andy Lutomirski wrote:
> > The busy-like bit in the RSTORSSP-type token is not called out as a
> > busy bit, but instead defined as reserved (must be 0) in some
> > states.
> > (Note, it is different than the supervisor shadow stack format).
> > Yea,
> > we could just probably use it like RSTORSSP does for this
> > operation.
> > 
> > Or just invent another new token format and stay away from bits
> > marked
> > reserved. Then it wouldn't have to be atomic either, since
> > userspace
> > couldn't use it.
> 
> But userspace *can* use it by delivering a signal.  I consider the
> scenario where two user threads set up the same altshstk and take
> signals concurrently to be about as dangerous and about as likely
> (under accidental or malicious conditions) as two user threads doing
> RSTORSSP at the same time.  Someone at Intel thought the latter was a
> big deal, so maybe we should match its behavior.

Right, for alt shadow stack there should be some busy like checking or
that could happen. For regular on-thread stack signals (earlier in this
series) we don't need a busy bit.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET
  2022-10-04  8:44     ` Mike Rapoport
@ 2022-10-04 19:24       ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04 19:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, jamorris, dethoma, Yu-cheng Yu

On Tue, Oct 04, 2022 at 11:44:16AM +0300, Mike Rapoport wrote:
> On Mon, Oct 03, 2022 at 04:59:43PM -0700, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:34PM -0700, Rick Edgecombe wrote:
> > > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > 
> > > Some applications (like GDB and CRIU) would like to tweak CET state via
> > 
> > Eee. Does GDB really need this? Can we make this whole thing
> > CONFIG-depend on CRIU?
> 
> GDB, at least its Intel fork uses this. I don't see how they can jump
> between frames without an ability to modify shadow stack contents.
> 
> Last I looked they used NT_X86_CET to update SSP and ptrace(POKEDATA) to
> write to the shadow stack.

Okay, thanks! I appreciate having specific examples. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 00/39] Shadowstacks for userspace
  2022-10-04  9:57       ` David Laight
@ 2022-10-04 19:28         ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-04 19:28 UTC (permalink / raw)
  To: David Laight
  Cc: Jann Horn, Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma

On Tue, Oct 04, 2022 at 09:57:48AM +0000, David Laight wrote:
> From: Kees Cook <keescook@chromium.org>
> ...
> > >
> > > If you don't want /proc/$pid/mem to be able to do stuff like that,
> > > then IMO the way to go is to change when /proc/$pid/mem uses
> > > FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.
> > 
> > Yeah, all reasonable. I just wish we could ditch FOLL_FORCE; it continues
> > to weird me out how powerful that fd's side-effects are.
> 
> Could you remove FOLL_FORCE from /proc/$pid/mem and add a
> /proc/$pid/mem_force that enable FOLL_FORCE but requires root
> (or similar) access.
> 
> Although I suspect gdb may like to have write access to
> code?

As Jann has reminded me out of band, while FOLL_FORCE is still worrisome,
it's really /proc/$pid/mem access at all without an active ptrace
attachment (and to self).

Here's my totally untested idea to require access to /proc/$pid/mem
having an established ptrace relationship:

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index c952c5ba8fab..0393741eeabb 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -64,6 +64,7 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
 #define PTRACE_MODE_NOAUDIT	0x04
 #define PTRACE_MODE_FSCREDS	0x08
 #define PTRACE_MODE_REALCREDS	0x10
+#define PTRACE_MODE_ATTACHED	0x20
 
 /* shorthands for READ/ATTACH and FSCREDS/REALCREDS combinations */
 #define PTRACE_MODE_READ_FSCREDS (PTRACE_MODE_READ | PTRACE_MODE_FSCREDS)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 93f7e3d971e4..fadec587d133 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -826,7 +826,7 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
 
 static int mem_open(struct inode *inode, struct file *file)
 {
-	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
+	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACHED);
 
 	/* OK to pass negative loff_t, we can catch out-of-range */
 	file->f_mode |= FMODE_UNSIGNED_OFFSET;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1893d909e45c..c97e6d734ae5 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -304,6 +304,12 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	 * or halting the specified task is impossible.
 	 */
 
+	/*
+	 * If an existing ptrace relationship is required, not even
+	 * introspection is allowed.
+	 */
+	if ((mode & PTRACE_MODE_ATTACHED) && ptrace_parent(task) != current)
+		return -EPERM;
 	/* Don't let security modules deny introspection */
 	if (same_thread_group(task, current))
 		return 0;

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-04 10:17       ` David Laight
@ 2022-10-04 19:32         ` Kees Cook
  2022-10-05 13:32           ` David Laight
  0 siblings, 1 reply; 241+ messages in thread
From: Kees Cook @ 2022-10-04 19:32 UTC (permalink / raw)
  To: David Laight
  Cc: 'Dave Hansen',
	Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Tue, Oct 04, 2022 at 10:17:57AM +0000, David Laight wrote:
> From: Dave Hansen
> > Sent: 03 October 2022 21:05
> > 
> > On 10/3/22 12:43, Kees Cook wrote:
> > >> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> > >> +{
> > >> +	u64 val, new_val;
> > >> +
> > >> +	rdmsrl(msr, val);
> > >> +	new_val = (val & ~clear) | set;
> > >> +
> > >> +	if (new_val != val)
> > >> +		wrmsrl(msr, new_val);
> > >> +}
> > > I always get uncomfortable when I see these kinds of generalized helper
> > > functions for touching cpu bits, etc. It just begs for future attacker
> > > abuse to muck with arbitrary bits -- even marked inline there is a risk
> > > the compiler will ignore that in some circumstances (not as currently
> > > used in the code, but I'm imagining future changes leading to such a
> > > condition). Will you humor me and change this to a macro instead? That'll
> > > force it always inline (even __always_inline isn't always inline):
> > 
> > Oh, are you thinking that this is dangerous because it's so surgical and
> > non-intrusive?  It's even more powerful to an attacker than, say
> > wrmsrl(), because there they actually have to know what the existing
> > value is to update it.  With this helper, it's quite easy to flip an
> > individual bit without disturbing the neighboring bits.
> > 
> > Is that it?
> > 
> > I don't _like_ the #defines, but doing one here doesn't seem too onerous
> > considering how critical MSRs are.
> 
> How often is the 'msr' number not a compile-time constant?
> Adding rd/wrmsr variants that verify this would reduce the
> attack surface as well.

Oh, yes! I do this all the time with FORTIFY shenanigans. Right, so,
instead of a macro, the "cannot be un-inlined" could be enforced with
this (untested):

static __always_inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
{
	u64 val, new_val;

	BUILD_BUG_ON(!__builtin_constant_p(msr) ||
		     !__builtin_constant_p(set) ||
		     !__builtin_constant_p(clear));

	rdmsrl(msr, val);
	new_val = (val & ~clear) | set;

	if (new_val != val)
		wrmsrl(msr, new_val);
}

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 15:47         ` Nathan Chancellor
@ 2022-10-04 19:43           ` John Allen
  2022-10-04 20:34             ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: John Allen @ 2022-10-04 19:43 UTC (permalink / raw)
  To: Nathan Chancellor, Kees Cook
  Cc: Dave Hansen, Gustavo A. R. Silva, Nick Desaulniers,
	Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, kcc, eranian,
	rppt, jamorris, dethoma, Tom Lendacky, Moger, Babu

On 10/4/22 10:47 AM, Nathan Chancellor wrote:
> Hi Kees,
> 
> On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
>> On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
>>> On 10/3/22 16:57, Kees Cook wrote:
>>>> On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe wrote:
>>>>> Shadow stack is supported on newer AMD processors, but the kernel
>>>>> implementation has not been tested on them. Prevent basic issues from
>>>>> showing up for normal users by disabling shadow stack on all CPUs except
>>>>> Intel until it has been tested. At which point the limitation should be
>>>>> removed.
>>>>>
>>>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>> So running the selftests on an AMD system is sufficient to drop this
>>>> patch?
>>>
>>> Yes, that's enough.
>>>
>>> I _thought_ the AMD folks provided some tested-by's at some point in the
>>> past.  But, maybe I'm confusing this for one of the other shared
>>> features.  Either way, I'm sure no tested-by's were dropped on purpose.
>>>
>>> I'm sure Rick is eager to trim down his series and this would be a great
>>> patch to drop.  Does anyone want to make that easy for Rick?
>>>
>>> <hint> <hint>
>>
>> Hey Gustavo, Nathan, or Nick! I know y'all have some fancy AMD testing
>> rigs. Got a moment to spin up this series and run the selftests? :)
> 
> I do have access to a system with an EPYC 7513, which does have Shadow
> Stack support (I can see 'shstk' in the "Flags" section of lscpu with
> this series). As far as I understand it, AMD only added Shadow Stack
> with Zen 3; my regular AMD test system is Zen 2 (probably should look at
> procurring a Zen 3 or Zen 4 one at some point).
> 
> I applied this series on top of 6.0 and reverted this change then booted
> it on that system. After building the selftest (which did require
> 'make headers_install' and a small addition to make it build beyond
> that, see below), I ran it and this was the result. I am not sure if
> that is expected or not but the other results seem promising for
> dropping this patch.
> 
>   $ ./test_shadow_stack_64
>   [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
>   [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
>   [INFO]  ssp is now 7f8a36ca0000
>   [OK]    Shadow stack pivot
>   [OK]    Shadow stack faults
>   [INFO]  Corrupting shadow stack
>   [INFO]  Generated shadow stack violation successfully
>   [OK]    Shadow stack violation test
>   [INFO]  Gup read -> shstk access success
>   [INFO]  Gup write -> shstk access success
>   [INFO]  Violation from normal write
>   [INFO]  Gup read -> write access success
>   [INFO]  Violation from normal write
>   [INFO]  Gup write -> write access success
>   [INFO]  Cow gup write -> write access success
>   [OK]    Shadow gup test
>   [INFO]  Violation from shstk access
>   [OK]    mprotect() test
>   [OK]    Userfaultfd test
>   [FAIL]  Alt shadow stack test

The selftest is looking OK on my system (Dell PowerEdge R6515 w/ EPYC
7713). I also just pulled a fresh 6.0 kernel and applied the series
including the fix Nathan mentions below.

$ tools/testing/selftests/x86/test_shadow_stack_64
[INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
[INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
[INFO]  ssp is now 7f30cccc6000
[OK]    Shadow stack pivot
[OK]    Shadow stack faults
[INFO]  Corrupting shadow stack
[INFO]  Generated shadow stack violation successfully
[OK]    Shadow stack violation test
[INFO]  Gup read -> shstk access success
[INFO]  Gup write -> shstk access success
[INFO]  Violation from normal write
[INFO]  Gup read -> write access success
[INFO]  Violation from normal write
[INFO]  Gup write -> write access success
[INFO]  Cow gup write -> write access success
[OK]    Shadow gup test
[INFO]  Violation from shstk access
[OK]    mprotect() test
[OK]    Userfaultfd test
[OK]    Alt shadow stack test.

> 
>   $ echo $?
>   1
> 
> I am happy to provide any information that would be useful for exploring
> this failure and test further iterations of this series as necessary.
> 
> Cheers,
> Nathan
> 
> test_shadow_stack.c: In function ‘create_shstk’:
> test_shadow_stack.c:86:70: error: ‘SHADOW_STACK_SET_TOKEN’ undeclared (first use in this function)
>    86 |         return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
>       |                                                                      ^~~~~~~~~~~~~~~~~~~~~~
> test_shadow_stack.c:86:70: note: each undeclared identifier is reported only once for each function it appears in
> test_shadow_stack.c:87:1: warning: control reaches end of non-void function [-Wreturn-type]
>    87 | }
>       | ^
> 
> diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
> index 22b856de5cdd..958dbb248518 100644
> --- a/tools/testing/selftests/x86/test_shadow_stack.c
> +++ b/tools/testing/selftests/x86/test_shadow_stack.c
> @@ -11,6 +11,7 @@
>  #define _GNU_SOURCE
>  
>  #include <sys/syscall.h>
> +#include <asm/mman.h>
>  #include <sys/mman.h>
>  #include <sys/stat.h>
>  #include <sys/wait.h>


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 19:43           ` John Allen
@ 2022-10-04 20:34             ` Edgecombe, Rick P
  2022-10-04 20:50               ` Nathan Chancellor
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 20:34 UTC (permalink / raw)
  To: nathan, keescook, john.allen
  Cc: ndesaulniers, bsingharora, hpa, Syromiatnikov, Eugene,
	babu.moger, peterz, rdunlap, dave.hansen, kirill.shutemov,
	Eranian, Stephane, linux-mm, Shankar, Ravi V, fweimer,
	nadav.amit, jannh, dethoma, linux-arch, kcc, pavel, oleg,
	hjl.tools, bp, Lutomirski, Andy, thomas.lendacky, arnd, jamorris,
	Moreira, Joao, tglx, x86, mike.kravetz, linux-doc, gustavoars,
	rppt, Yang, Weijiang, mingo, Hansen, Dave, corbet, linux-kernel,
	linux-api, gorcunov

On Tue, 2022-10-04 at 14:43 -0500, John Allen wrote:
> On 10/4/22 10:47 AM, Nathan Chancellor wrote:
> > Hi Kees,
> > 
> > On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
> > > On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
> > > > On 10/3/22 16:57, Kees Cook wrote:
> > > > > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe
> > > > > wrote:
> > > > > > Shadow stack is supported on newer AMD processors, but the
> > > > > > kernel
> > > > > > implementation has not been tested on them. Prevent basic
> > > > > > issues from
> > > > > > showing up for normal users by disabling shadow stack on
> > > > > > all CPUs except
> > > > > > Intel until it has been tested. At which point the
> > > > > > limitation should be
> > > > > > removed.
> > > > > > 
> > > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > 
> > > > > So running the selftests on an AMD system is sufficient to
> > > > > drop this
> > > > > patch?
> > > > 
> > > > Yes, that's enough.
> > > > 
> > > > I _thought_ the AMD folks provided some tested-by's at some
> > > > point in the
> > > > past.  But, maybe I'm confusing this for one of the other
> > > > shared
> > > > features.  Either way, I'm sure no tested-by's were dropped on
> > > > purpose.
> > > > 
> > > > I'm sure Rick is eager to trim down his series and this would
> > > > be a great
> > > > patch to drop.  Does anyone want to make that easy for Rick?
> > > > 
> > > > <hint> <hint>
> > > 
> > > Hey Gustavo, Nathan, or Nick! I know y'all have some fancy AMD
> > > testing
> > > rigs. Got a moment to spin up this series and run the selftests?
> > > :)
> > 
> > I do have access to a system with an EPYC 7513, which does have
> > Shadow
> > Stack support (I can see 'shstk' in the "Flags" section of lscpu
> > with
> > this series). As far as I understand it, AMD only added Shadow
> > Stack
> > with Zen 3; my regular AMD test system is Zen 2 (probably should
> > look at
> > procurring a Zen 3 or Zen 4 one at some point).
> > 
> > I applied this series on top of 6.0 and reverted this change then
> > booted
> > it on that system. After building the selftest (which did require
> > 'make headers_install' and a small addition to make it build beyond
> > that, see below), I ran it and this was the result. I am not sure
> > if
> > that is expected or not but the other results seem promising for
> > dropping this patch.
> > 
> >    $ ./test_shadow_stack_64
> >    [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
> >    [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
> >    [INFO]  ssp is now 7f8a36ca0000
> >    [OK]    Shadow stack pivot
> >    [OK]    Shadow stack faults
> >    [INFO]  Corrupting shadow stack
> >    [INFO]  Generated shadow stack violation successfully
> >    [OK]    Shadow stack violation test
> >    [INFO]  Gup read -> shstk access success
> >    [INFO]  Gup write -> shstk access success
> >    [INFO]  Violation from normal write
> >    [INFO]  Gup read -> write access success
> >    [INFO]  Violation from normal write
> >    [INFO]  Gup write -> write access success
> >    [INFO]  Cow gup write -> write access success
> >    [OK]    Shadow gup test
> >    [INFO]  Violation from shstk access
> >    [OK]    mprotect() test
> >    [OK]    Userfaultfd test
> >    [FAIL]  Alt shadow stack test
> 
> The selftest is looking OK on my system (Dell PowerEdge R6515 w/ EPYC
> 7713). I also just pulled a fresh 6.0 kernel and applied the series
> including the fix Nathan mentions below.
> 
> $ tools/testing/selftests/x86/test_shadow_stack_64
> [INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
> [INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
> [INFO]  ssp is now 7f30cccc6000
> [OK]    Shadow stack pivot
> [OK]    Shadow stack faults
> [INFO]  Corrupting shadow stack
> [INFO]  Generated shadow stack violation successfully
> [OK]    Shadow stack violation test
> [INFO]  Gup read -> shstk access success
> [INFO]  Gup write -> shstk access success
> [INFO]  Violation from normal write
> [INFO]  Gup read -> write access success
> [INFO]  Violation from normal write
> [INFO]  Gup write -> write access success
> [INFO]  Cow gup write -> write access success
> [OK]    Shadow gup test
> [INFO]  Violation from shstk access
> [OK]    mprotect() test
> [OK]    Userfaultfd test
> [OK]    Alt shadow stack test.

Thanks for the testing. Based on the test, I wonder if this could be a
SW bug. Nathan, could I send you a tweaked test with some more debug
information?

John, are we sure AMD and Intel systems behave the same with respect to
CPUs not creating Dirty=1,Write=0 PTEs in rare situations? Or any other
CET related differences we should hash out? Otherwise I'll drop the
patch for the next version. (and assuming the issue Nathan hit doesn't
turn up anything HW related).

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 20:34             ` Edgecombe, Rick P
@ 2022-10-04 20:50               ` Nathan Chancellor
  2022-10-04 21:17                 ` H. Peter Anvin
  2022-10-20 21:22                 ` Edgecombe, Rick P
  0 siblings, 2 replies; 241+ messages in thread
From: Nathan Chancellor @ 2022-10-04 20:50 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: keescook, john.allen, ndesaulniers, bsingharora, hpa,
	Syromiatnikov, Eugene, babu.moger, peterz, rdunlap, dave.hansen,
	kirill.shutemov, Eranian, Stephane, linux-mm, Shankar, Ravi V,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, pavel,
	oleg, hjl.tools, bp, Lutomirski, Andy, thomas.lendacky, arnd,
	jamorris, Moreira, Joao, tglx, x86, mike.kravetz, linux-doc,
	gustavoars, rppt, Yang, Weijiang, mingo, Hansen, Dave, corbet,
	linux-kernel, linux-api, gorcunov

On Tue, Oct 04, 2022 at 08:34:54PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-10-04 at 14:43 -0500, John Allen wrote:
> > On 10/4/22 10:47 AM, Nathan Chancellor wrote:
> > > Hi Kees,
> > > 
> > > On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
> > > > On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
> > > > > On 10/3/22 16:57, Kees Cook wrote:
> > > > > > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe
> > > > > > wrote:
> > > > > > > Shadow stack is supported on newer AMD processors, but the
> > > > > > > kernel
> > > > > > > implementation has not been tested on them. Prevent basic
> > > > > > > issues from
> > > > > > > showing up for normal users by disabling shadow stack on
> > > > > > > all CPUs except
> > > > > > > Intel until it has been tested. At which point the
> > > > > > > limitation should be
> > > > > > > removed.
> > > > > > > 
> > > > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > 
> > > > > > So running the selftests on an AMD system is sufficient to
> > > > > > drop this
> > > > > > patch?
> > > > > 
> > > > > Yes, that's enough.
> > > > > 
> > > > > I _thought_ the AMD folks provided some tested-by's at some
> > > > > point in the
> > > > > past.  But, maybe I'm confusing this for one of the other
> > > > > shared
> > > > > features.  Either way, I'm sure no tested-by's were dropped on
> > > > > purpose.
> > > > > 
> > > > > I'm sure Rick is eager to trim down his series and this would
> > > > > be a great
> > > > > patch to drop.  Does anyone want to make that easy for Rick?
> > > > > 
> > > > > <hint> <hint>
> > > > 
> > > > Hey Gustavo, Nathan, or Nick! I know y'all have some fancy AMD
> > > > testing
> > > > rigs. Got a moment to spin up this series and run the selftests?
> > > > :)
> > > 
> > > I do have access to a system with an EPYC 7513, which does have
> > > Shadow
> > > Stack support (I can see 'shstk' in the "Flags" section of lscpu
> > > with
> > > this series). As far as I understand it, AMD only added Shadow
> > > Stack
> > > with Zen 3; my regular AMD test system is Zen 2 (probably should
> > > look at
> > > procurring a Zen 3 or Zen 4 one at some point).
> > > 
> > > I applied this series on top of 6.0 and reverted this change then
> > > booted
> > > it on that system. After building the selftest (which did require
> > > 'make headers_install' and a small addition to make it build beyond
> > > that, see below), I ran it and this was the result. I am not sure
> > > if
> > > that is expected or not but the other results seem promising for
> > > dropping this patch.
> > > 
> > >    $ ./test_shadow_stack_64
> > >    [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
> > >    [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
> > >    [INFO]  ssp is now 7f8a36ca0000
> > >    [OK]    Shadow stack pivot
> > >    [OK]    Shadow stack faults
> > >    [INFO]  Corrupting shadow stack
> > >    [INFO]  Generated shadow stack violation successfully
> > >    [OK]    Shadow stack violation test
> > >    [INFO]  Gup read -> shstk access success
> > >    [INFO]  Gup write -> shstk access success
> > >    [INFO]  Violation from normal write
> > >    [INFO]  Gup read -> write access success
> > >    [INFO]  Violation from normal write
> > >    [INFO]  Gup write -> write access success
> > >    [INFO]  Cow gup write -> write access success
> > >    [OK]    Shadow gup test
> > >    [INFO]  Violation from shstk access
> > >    [OK]    mprotect() test
> > >    [OK]    Userfaultfd test
> > >    [FAIL]  Alt shadow stack test
> > 
> > The selftest is looking OK on my system (Dell PowerEdge R6515 w/ EPYC
> > 7713). I also just pulled a fresh 6.0 kernel and applied the series
> > including the fix Nathan mentions below.
> > 
> > $ tools/testing/selftests/x86/test_shadow_stack_64
> > [INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
> > [INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
> > [INFO]  ssp is now 7f30cccc6000
> > [OK]    Shadow stack pivot
> > [OK]    Shadow stack faults
> > [INFO]  Corrupting shadow stack
> > [INFO]  Generated shadow stack violation successfully
> > [OK]    Shadow stack violation test
> > [INFO]  Gup read -> shstk access success
> > [INFO]  Gup write -> shstk access success
> > [INFO]  Violation from normal write
> > [INFO]  Gup read -> write access success
> > [INFO]  Violation from normal write
> > [INFO]  Gup write -> write access success
> > [INFO]  Cow gup write -> write access success
> > [OK]    Shadow gup test
> > [INFO]  Violation from shstk access
> > [OK]    mprotect() test
> > [OK]    Userfaultfd test
> > [OK]    Alt shadow stack test.
> 
> Thanks for the testing. Based on the test, I wonder if this could be a
> SW bug. Nathan, could I send you a tweaked test with some more debug
> information?

Yes, more than happy to help you look into this further!

> John, are we sure AMD and Intel systems behave the same with respect to
> CPUs not creating Dirty=1,Write=0 PTEs in rare situations? Or any other
> CET related differences we should hash out? Otherwise I'll drop the
> patch for the next version. (and assuming the issue Nathan hit doesn't
> turn up anything HW related).

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 20:50               ` Nathan Chancellor
@ 2022-10-04 21:17                 ` H. Peter Anvin
  2022-10-04 23:24                   ` Edgecombe, Rick P
  2022-10-20 21:22                 ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: H. Peter Anvin @ 2022-10-04 21:17 UTC (permalink / raw)
  To: Nathan Chancellor, Edgecombe, Rick P
  Cc: keescook, john.allen, ndesaulniers, bsingharora, Syromiatnikov,
	Eugene, babu.moger, peterz, rdunlap, dave.hansen,
	kirill.shutemov, Eranian, Stephane, linux-mm, Shankar, Ravi V,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, pavel,
	oleg, hjl.tools, bp, Lutomirski, Andy, thomas.lendacky, arnd,
	jamorris, Moreira, Joao, tglx, x86, mike.kravetz, linux-doc,
	gustavoars, rppt, Yang, Weijiang, mingo, Hansen, Dave, corbet,
	linux-kernel, linux-api, gorcunov

On October 4, 2022 1:50:20 PM PDT, Nathan Chancellor <nathan@kernel.org> wrote:
>On Tue, Oct 04, 2022 at 08:34:54PM +0000, Edgecombe, Rick P wrote:
>> On Tue, 2022-10-04 at 14:43 -0500, John Allen wrote:
>> > On 10/4/22 10:47 AM, Nathan Chancellor wrote:
>> > > Hi Kees,
>> > > 
>> > > On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
>> > > > On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
>> > > > > On 10/3/22 16:57, Kees Cook wrote:
>> > > > > > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe
>> > > > > > wrote:
>> > > > > > > Shadow stack is supported on newer AMD processors, but the
>> > > > > > > kernel
>> > > > > > > implementation has not been tested on them. Prevent basic
>> > > > > > > issues from
>> > > > > > > showing up for normal users by disabling shadow stack on
>> > > > > > > all CPUs except
>> > > > > > > Intel until it has been tested. At which point the
>> > > > > > > limitation should be
>> > > > > > > removed.
>> > > > > > > 
>> > > > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> > > > > > 
>> > > > > > So running the selftests on an AMD system is sufficient to
>> > > > > > drop this
>> > > > > > patch?
>> > > > > 
>> > > > > Yes, that's enough.
>> > > > > 
>> > > > > I _thought_ the AMD folks provided some tested-by's at some
>> > > > > point in the
>> > > > > past.  But, maybe I'm confusing this for one of the other
>> > > > > shared
>> > > > > features.  Either way, I'm sure no tested-by's were dropped on
>> > > > > purpose.
>> > > > > 
>> > > > > I'm sure Rick is eager to trim down his series and this would
>> > > > > be a great
>> > > > > patch to drop.  Does anyone want to make that easy for Rick?
>> > > > > 
>> > > > > <hint> <hint>
>> > > > 
>> > > > Hey Gustavo, Nathan, or Nick! I know y'all have some fancy AMD
>> > > > testing
>> > > > rigs. Got a moment to spin up this series and run the selftests?
>> > > > :)
>> > > 
>> > > I do have access to a system with an EPYC 7513, which does have
>> > > Shadow
>> > > Stack support (I can see 'shstk' in the "Flags" section of lscpu
>> > > with
>> > > this series). As far as I understand it, AMD only added Shadow
>> > > Stack
>> > > with Zen 3; my regular AMD test system is Zen 2 (probably should
>> > > look at
>> > > procurring a Zen 3 or Zen 4 one at some point).
>> > > 
>> > > I applied this series on top of 6.0 and reverted this change then
>> > > booted
>> > > it on that system. After building the selftest (which did require
>> > > 'make headers_install' and a small addition to make it build beyond
>> > > that, see below), I ran it and this was the result. I am not sure
>> > > if
>> > > that is expected or not but the other results seem promising for
>> > > dropping this patch.
>> > > 
>> > >    $ ./test_shadow_stack_64
>> > >    [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
>> > >    [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
>> > >    [INFO]  ssp is now 7f8a36ca0000
>> > >    [OK]    Shadow stack pivot
>> > >    [OK]    Shadow stack faults
>> > >    [INFO]  Corrupting shadow stack
>> > >    [INFO]  Generated shadow stack violation successfully
>> > >    [OK]    Shadow stack violation test
>> > >    [INFO]  Gup read -> shstk access success
>> > >    [INFO]  Gup write -> shstk access success
>> > >    [INFO]  Violation from normal write
>> > >    [INFO]  Gup read -> write access success
>> > >    [INFO]  Violation from normal write
>> > >    [INFO]  Gup write -> write access success
>> > >    [INFO]  Cow gup write -> write access success
>> > >    [OK]    Shadow gup test
>> > >    [INFO]  Violation from shstk access
>> > >    [OK]    mprotect() test
>> > >    [OK]    Userfaultfd test
>> > >    [FAIL]  Alt shadow stack test
>> > 
>> > The selftest is looking OK on my system (Dell PowerEdge R6515 w/ EPYC
>> > 7713). I also just pulled a fresh 6.0 kernel and applied the series
>> > including the fix Nathan mentions below.
>> > 
>> > $ tools/testing/selftests/x86/test_shadow_stack_64
>> > [INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
>> > [INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
>> > [INFO]  ssp is now 7f30cccc6000
>> > [OK]    Shadow stack pivot
>> > [OK]    Shadow stack faults
>> > [INFO]  Corrupting shadow stack
>> > [INFO]  Generated shadow stack violation successfully
>> > [OK]    Shadow stack violation test
>> > [INFO]  Gup read -> shstk access success
>> > [INFO]  Gup write -> shstk access success
>> > [INFO]  Violation from normal write
>> > [INFO]  Gup read -> write access success
>> > [INFO]  Violation from normal write
>> > [INFO]  Gup write -> write access success
>> > [INFO]  Cow gup write -> write access success
>> > [OK]    Shadow gup test
>> > [INFO]  Violation from shstk access
>> > [OK]    mprotect() test
>> > [OK]    Userfaultfd test
>> > [OK]    Alt shadow stack test.
>> 
>> Thanks for the testing. Based on the test, I wonder if this could be a
>> SW bug. Nathan, could I send you a tweaked test with some more debug
>> information?
>
>Yes, more than happy to help you look into this further!
>
>> John, are we sure AMD and Intel systems behave the same with respect to
>> CPUs not creating Dirty=1,Write=0 PTEs in rare situations? Or any other
>> CET related differences we should hash out? Otherwise I'll drop the
>> patch for the next version. (and assuming the issue Nathan hit doesn't
>> turn up anything HW related).

I have to admit to being a bit confused here... in general, we trust CPUID bits unless they are *known* to be wrong, in which case we blacklist them.

If AMD advertises the feature but it doesn't work or they didn't validate it, that would be a (serious!) bug on their part that we can address by blacklisting, but they should also fix with a microcode/BIOS patch.

What am I missing?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack
  2022-10-03 20:29   ` Kees Cook
@ 2022-10-04 22:09     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 22:09 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 13:29 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:22PM -0700, Rick Edgecombe wrote:
> > [...]
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > ssp)
> > +{
> > +	struct cet_user_state *xstate;
> > +
> > +	/* If ssp update is not needed. */
> > +	if (!ssp)
> > +		return 0;
> 
> My brain will work to undo the collision of Shadow Stack Pointer with
> Stack Smashing Protection. ;)
> 
> > [...]
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index a0b8d4adb2bf..db4e53f9fdaf 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -118,6 +118,46 @@ void reset_thread_shstk(void)
> >  	current->thread.features_locked = 0;
> >  }
> >  
> > +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned
> > long clone_flags,
> > +			     unsigned long stack_size, unsigned long
> > *shstk_addr)
> 
> Er, arg 3 is "stack_size". From later:
> 
> > +     ret = shstk_alloc_thread_stack(p, clone_flags, args->flags,
> > &shstk_addr);
> 
>                                                        ^^^^^^^^^^^
> 
> clone_flags and args->flags are identical ... this must be
> accidentally
> working. I was expecting 0 there.

Oh wow. A stack_size used to be passed into copy_thread(), but I messed
up the rebase badly. Thanks for catching it.

> 
> > +{
> > +	struct thread_shstk *shstk = &tsk->thread.shstk;
> > +	unsigned long addr;
> > +
> > +	/*
> > +	 * If shadow stack is not enabled on the new thread, skip any
> > +	 * switch to a new shadow stack.
> > +	 */
> > +	if (!feature_enabled(CET_SHSTK))
> > +		return 0;
> > +
> > +	/*
> > +	 * clone() does not pass stack_size, which was added to
> > clone3().
> > +	 * Use RLIMIT_STACK and cap to 4 GB.
> > +	 */
> > +	if (!stack_size)
> > +		stack_size = min_t(unsigned long long,
> > rlimit(RLIMIT_STACK), SZ_4G);
> 
> Again, perhaps the clamp should happen in alloc_shstk()?

The map_shadow_stack() is kind of like mmap(). I think it shouldn't get
the rlimit restriction. But I can pull the shared logic into a helper
for the other two cases.

> 
> > +
> > +	/*
> > +	 * For CLONE_VM, except vfork, the child needs a separate
> > shadow
> > +	 * stack.
> > +	 */
> > +	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> > +		return 0;
> > +
> > +
> > +	stack_size = PAGE_ALIGN(stack_size);
> 
> Uhm, I think a line went missing here. :P
> 
> "x86/cet/shstk: Introduce map_shadow_stack syscall" adds the missing:
> 
> +	addr = alloc_shstk(0, stack_size, 0, false);
> 
> Please add back the original. :)

Yes, more rebase mangling. Thanks.

> 
> > +	if (IS_ERR_VALUE(addr))
> > +		return PTR_ERR((void *)addr);
> > +
> > +	shstk->base = addr;
> > +	shstk->size = stack_size;
> > +
> > +	*shstk_addr = addr + stack_size;
> > +
> > +	return 0;
> > +}
> > +
> >  void shstk_free(struct task_struct *tsk)
> >  {
> >  	struct thread_shstk *shstk = &tsk->thread.shstk;
> > @@ -126,7 +166,13 @@ void shstk_free(struct task_struct *tsk)
> >  	    !feature_enabled(CET_SHSTK))
> >  		return;
> >  
> > -	if (!tsk->mm)
> > +	/*
> > +	 * When fork() with CLONE_VM fails, the child (tsk) already has
> > a
> > +	 * shadow stack allocated, and exit_thread() calls this
> > function to
> > +	 * free it.  In this case the parent (current) and the child
> > share
> > +	 * the same mm struct.
> > +	 */
> > +	if (!tsk->mm || tsk->mm != current->mm)
> >  		return;
> >  
> >  	unmap_shadow_stack(shstk->base, shstk->size);
> > -- 
> > 2.17.1
> > 
> 
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-10-03 20:44   ` Kees Cook
@ 2022-10-04 22:13     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 22:13 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 13:44 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:23PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > Shadow stack's are normally written to via CALL/RET or specific CET
> > instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
> > operations the kernel will need to write to directly using the
> > ring-0 only
> > WRUSS instruction.
> > 
> > A shadow stack restore token marks a restore point of the shadow
> > stack, and
> > the address in a token must point directly above the token, which
> > is within
> > the same shadow stack. This is distinctively different from other
> > pointers
> > on the shadow stack, since those pointers point to executable code
> > area.
> > 
> > Introduce token setup and verify routines. Also introduce WRUSS,
> > which is
> > a kernel-mode instruction but writes directly to user shadow stack.
> > 
> > In future patches that enable shadow stack to work with signals,
> > the kernel
> > will need something to denote the point in the stack where
> > sigreturn may be
> > called. This will prevent attackers calling sigreturn at arbitrary
> > places
> > in the stack, in order to help prevent SROP attacks.
> > 
> > To do this, something that can only be written by the kernel needs
> > to be
> > placed on the shadow stack. This can be accomplished by setting bit
> > 63 in
> > the frame written to the shadow stack. Userspace return addresses
> > can't
> > have this bit set as it is in the kernel range. It is also can't be
> > a
> > valid restore token.
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > 
> > ---
> > 
> > v2:
> >  - Add data helpers for writing to shadow stack.
> > 
> > v1:
> >  - Use xsave helpers.
> > 
> > Yu-cheng v30:
> >  - Update commit log, remove description about signals.
> >  - Update various comments.
> >  - Remove variable 'ssp' init and adjust return value accordingly.
> >  - Check get_user_shstk_addr() return value.
> >  - Replace 'ia32' with 'proc32'.
> > 
> > Yu-cheng v29:
> >  - Update comments for the use of get_xsave_addr().
> > 
> >  arch/x86/include/asm/special_insns.h |  13 ++++
> >  arch/x86/kernel/shstk.c              | 108
> > +++++++++++++++++++++++++++
> >  2 files changed, 121 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/special_insns.h
> > b/arch/x86/include/asm/special_insns.h
> > index 35f709f619fb..f096f52bd059 100644
> > --- a/arch/x86/include/asm/special_insns.h
> > +++ b/arch/x86/include/asm/special_insns.h
> > @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
> >  		: [pax] "a" (p));
> >  }
> >  
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > +{
> > +	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > +			  _ASM_EXTABLE(1b, %l[fail])
> > +			  :: [addr] "r" (addr), [val] "r" (val)
> > +			  :: fail);
> > +	return 0;
> > +fail:
> > +	return -EFAULT;
> > +}
> > +#endif /* CONFIG_X86_SHADOW_STACK */
> > +
> >  #define nop() asm volatile ("nop")
> >  
> >  static inline void serialize(void)
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index db4e53f9fdaf..8904aef487bf 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -25,6 +25,8 @@
> >  #include <asm/fpu/api.h>
> >  #include <asm/prctl.h>
> >  
> > +#define SS_FRAME_SIZE 8
> > +
> >  static bool feature_enabled(unsigned long features)
> >  {
> >  	return current->thread.features & features;
> > @@ -40,6 +42,31 @@ static void feature_clr(unsigned long features)
> >  	current->thread.features &= ~features;
> >  }
> >  
> > +/*
> > + * Create a restore token on the shadow stack.  A token is always
> > 8-byte
> > + * and aligned to 8.
> > + */
> > +static int create_rstor_token(unsigned long ssp, unsigned long
> > *token_addr)
> > +{
> > +	unsigned long addr;
> > +
> > +	/* Token must be aligned */
> > +	if (!IS_ALIGNED(ssp, 8))
> > +		return -EINVAL;
> > +
> > +	addr = ssp - SS_FRAME_SIZE;
> > +
> > +	/* Mark the token 64-bit */
> > +	ssp |= BIT(0);
> 
> Wow, that confused me for a moment. :) SDE says:
> 
> - Bit 63:2 – Value of shadow stack pointer when this restore point
> was created.
> - Bit 1 – Reserved. Must be zero.
> - Bit 0 – Mode bit. If 0, the token is a compatibility/legacy mode
>           “shadow stack restore” token. If 1, then this shadow stack
> restore
>           token can be used with a RSTORSSP instruction in 64-bit
> mode.
> 
> So shouldn't this actually be:
> 
> 	ssp &= ~BIT(1);	/* Reserved */
> 	ssp |=  BIT(0); /* RSTORSSP instruction in 64-bit mode */

Since SSP is aligned, it should not have bits 0 to 2 set. I'll add a
comment.

> 
> > +
> > +	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
> > +		return -EFAULT;
> > +
> > +	*token_addr = addr;
> > +
> > +	return 0;
> > +}
> > +
> >  static unsigned long alloc_shstk(unsigned long size)
> >  {
> >  	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> > @@ -158,6 +185,87 @@ int shstk_alloc_thread_stack(struct
> > task_struct *tsk, unsigned long clone_flags,
> >  	return 0;
> >  }
> >  
> > +static unsigned long get_user_shstk_addr(void)
> > +{
> > +	unsigned long long ssp;
> > +
> > +	fpu_lock_and_load();
> > +
> > +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > +	fpregs_unlock();
> > +
> > +	return ssp;
> > +}
> > +
> > +static int put_shstk_data(u64 __user *addr, u64 data)
> > +{
> > +	WARN_ON(data & BIT(63));
> 
> Let's make this a bit more defensive:
> 
> 	if (WARN_ON_ONCE(data & BIT(63)))
> 		return -EFAULT;

Hmm, sure. I'm thinking EINVAL since the failure has nothing to do with
faulting.

> 
> > +
> > +	/*
> > +	 * Mark the high bit so that the sigframe can't be processed as
> > a
> > +	 * return address.
> > +	 */
> > +	if (write_user_shstk_64(addr, data | BIT(63)))
> > +		return -EFAULT;
> > +	return 0;
> > +}
> > +
> > +static int get_shstk_data(unsigned long *data, unsigned long
> > __user *addr)
> > +{
> > +	unsigned long ldata;
> > +
> > +	if (unlikely(get_user(ldata, addr)))
> > +		return -EFAULT;
> > +
> > +	if (!(ldata & BIT(63)))
> > +		return -EINVAL;
> > +
> > +	*data = ldata & ~BIT(63);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Verify the user shadow stack has a valid token on it, and then
> > set
> > + * *new_ssp according to the token.
> > + */
> > +static int shstk_check_rstor_token(unsigned long *new_ssp)
> > +{
> > +	unsigned long token_addr;
> > +	unsigned long token;
> > +
> > +	token_addr = get_user_shstk_addr();
> > +	if (!token_addr)
> > +		return -EINVAL;
> > +
> > +	if (get_user(token, (unsigned long __user *)token_addr))
> > +		return -EFAULT;
> > +
> > +	/* Is mode flag correct? */
> > +	if (!(token & BIT(0)))
> > +		return -EINVAL;
> > +
> > +	/* Is busy flag set? */
> 
> "Busy"? Not "Reserved"?

Yes reserved is more accurate, I'll change it. In a previous-ssp token
it is 1, so kind of busy-like. That is probably where the comment came
from.

> 
> > +	if (token & BIT(1))
> > +		return -EINVAL;
> > +
> > +	/* Mask out flags */
> > +	token &= ~3UL;
> > +
> > +	/* Restore address aligned? */
> > +	if (!IS_ALIGNED(token, 8))
> > +		return -EINVAL;
> > +
> > +	/* Token placed properly? */
> > +	if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >=
> > TASK_SIZE_MAX)
> > +		return -EINVAL;
> > +
> > +	*new_ssp = token;
> > +
> > +	return 0;
> > +}
> > +
> >  void shstk_free(struct task_struct *tsk)
> >  {
> >  	struct thread_shstk *shstk = &tsk->thread.shstk;
> > -- 
> > 2.17.1
> > 
> 
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall
  2022-10-03 22:23   ` Kees Cook
@ 2022-10-04 22:56     ` Edgecombe, Rick P
  2022-10-04 23:16       ` H.J. Lu
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 22:56 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-03 at 15:23 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:25PM -0700, Rick Edgecombe wrote:
> > [...]
> > The following example demonstrates how to create a new shadow stack
> > with
> > map_shadow_stack:
> > void *shstk = map_shadow_stack(adrr, stack_size,
> > SHADOW_STACK_SET_TOKEN);
> 
> typo: addr

Yep, thanks.


> 
> > [...]
> > +451	common	map_shadow_stack	sys_map_shadow_stac
> > k
> 
> Isn't this "64", not "common"?

Yes, this should have been changed after dropping 32 bit.

> 
> > [...]
> > +#define SHADOW_STACK_SET_TOKEN	0x1	/* Set up a restore token
> > in the shadow stack */
> 
> I think this should get an intro comment, like:
> 
> /* Flags for map_shadow_stack(2) */
> 
> Also, as with the other UAPI fields, please use "(1ULL << 0)" here.

Ok.

> 
> > @@ -62,24 +63,34 @@ static int create_rstor_token(unsigned long
> > ssp, unsigned long *token_addr)
> >  	if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
> >  		return -EFAULT;
> >  
> > -	*token_addr = addr;
> > +	if (token_addr)
> > +		*token_addr = addr;
> >  
> >  	return 0;
> >  }
> >  
> 
> Can this just be collapsed into the patch that introduces
> create_rstor_token()?

I mean, yea, that would be simpler. Breaking the changes apart was left
over from when the signals placed a token, but didn't need this extra
bit of functionality.

> 
> > -static unsigned long alloc_shstk(unsigned long size)
> > +static unsigned long alloc_shstk(unsigned long addr, unsigned long
> > size,
> > +				 unsigned long token_offset, bool
> > set_res_tok)
> >  {
> >  	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> >  	struct mm_struct *mm = current->mm;
> > -	unsigned long addr, unused;
> > +	unsigned long mapped_addr, unused;
> >  
> >  	mmap_write_lock(mm);
> > -	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> 
> Oops, I missed in the other patch that "addr" was being passed here.
> (uninitialized?)

Argh, yes. I'll initialize in that patch and remove it here.

> 
> > -		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
> > -
> > +	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> > +			      VM_SHADOW_STACK | VM_WRITE, 0, &unused,
> > NULL);
> 
> I don't see do_mmap() doing anything here to avoid remapping a prior
> vma
> as shstk. Is the intention to allow userspace to convert existing
> VMAs?
> This has caused pain in the past, perhaps force MAP_FIXED_NOREPLACE ?

No that is not the intention. It should fail and MAP_FIXED_NOREPLACE
looks like it will fit the bill. Thanks!

> 
> > [...]
> > @@ -174,6 +185,7 @@ int shstk_alloc_thread_stack(struct task_struct
> > *tsk, unsigned long clone_flags,
> >  
> >  
> >  	stack_size = PAGE_ALIGN(stack_size);
> > +	addr = alloc_shstk(0, stack_size, 0, false);
> >  	if (IS_ERR_VALUE(addr))
> >  		return PTR_ERR((void *)addr);
> >  
> 
> As mentioned earlier, I was expecting this patch to replace a
> (missing)
> call to alloc_shstk. i.e. expecting:
> 
> -	addr = alloc_shstk(stack_size);
> 
> > @@ -395,6 +407,26 @@ int shstk_disable(void)
> >  	return 0;
> >  }
> >  
> > +
> > +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned
> > long, size, unsigned int, flags)
> 
> Please add kern-doc for this, with some notes. E.g. at least one
> thing isn't immediately
> obvious, maybe more: "addr" must be a multiple of 8.

Ok.

> 
> > +{
> > +	unsigned long aligned_size;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +		return -ENOSYS;
> 
> This needs to explicitly reject unknown flags[1], or expanding them
> in the
> future becomes very painful:
> 
> 	if (flags & ~(SHADOW_STACK_SET_TOKEN))
> 		return -EINVAL;
> 
> 
> [1] 
> https://docs.kernel.org/process/adding-syscalls.html#designing-the-api-planning-for-extension
> 

Ok, good idea.

> > +
> > +	/*
> > +	 * An overflow would result in attempting to write the restore
> > token
> > +	 * to the wrong location. Not catastrophic, but just return the
> > right
> > +	 * error code and block it.
> > +	 */
> > +	aligned_size = PAGE_ALIGN(size);
> > +	if (aligned_size < size)
> > +		return -EOVERFLOW;
> 
> The intention here is to allow userspace to ask for _less_ than a
> page
> size multiple, and to put the restore token there?
> 
> Is it worth adding a check for size >= 8 here? Or, I guess it would
> just
> immediately crash on the next call?

Funny you should ask... The glibc changes were doing this and then
looking for the token at the end of the length that it passed (not the
page aligned length). I had changed the kernel at one point to be page
aligned and then had the fun of debugging the results. I thought, glibc
 is just wasting shadow stack. It should ask for page aligned shadow
stacks. But HJ argued that the kernel shouldn't second guess what
userspace is asking for based on HW page size details that don't have
to do with the software interface. I was convinced by that argument,
even though glibc is still wasting space.

I could still be convinced the other way though. Glibc still has time
to (and should) change. But yea, that was actually the intention.

> 
> > +
> > +	return alloc_shstk(addr, aligned_size, size, flags &
> > SHADOW_STACK_SET_TOKEN);
> > +}
> 
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features
  2022-10-03 19:07   ` Chang S. Bae
@ 2022-10-04 23:05     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 23:05 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Bae, Chang Seok, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, Moreira, Joao, tglx, x86, mike.kravetz, Yang,
	Weijiang, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 12:07 -0700, Chang S. Bae wrote:
> > +/*
> > + * Given the xsave area and a state inside, this function
> > + * initializes an xfeature in the buffer.
> 
> But, this function sets XSTATE_BV bits in the buffer. That does not 
> *initialize* the state, right?

No, it doesn't actually write out the init state to the buffer.

> 
> > + *
> > + * get_xsave_addr() will return NULL if the feature bit is
> > + * not present in the header. This function will make it so
> > + * the xfeature buffer address is ready to be retrieved by
> > + * get_xsave_addr().
> 
> Looks like this is used in the next patch to help ptracer().
> 
> We have the state copy function -- copy_uabi_to_xstate() that
> retrieves 
> the address using __raw_xsave_addr() instead of get_xsave_addr(),
> copies 
> the state, and then updates XSTATE_BV.
> 
> __raw_xsave_addr() also ensures whether the state is in the
> compacted 
> format or not. I think you can use it.
> 
> Also, I'm curious about the reason why you want to update XSTATE_BV 
> first with this new helper.
> 
> Overall, I'm not sure these new helpers are necessary.

Thomas had experimented with this optimization where init state
features weren't saved:
https://lore.kernel.org/lkml/20220404103741.809025935@linutronix.de/

It made me think non-fpu code should not assume things about the state
of the buffer, as FPU code might have to move things when initing them.
So the operation is worth centralizing in a helper. I think you are
right, today it is not doing much and could be open coded. I guess the
question is, should it be open coded or centralized? I'm fine either
way.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall
  2022-10-04 22:56     ` Edgecombe, Rick P
@ 2022-10-04 23:16       ` H.J. Lu
  0 siblings, 0 replies; 241+ messages in thread
From: H.J. Lu @ 2022-10-04 23:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: keescook, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, Yang, Weijiang, Lutomirski, Andy, pavel, arnd, Moreira,
	Joao, tglx, mike.kravetz, x86, linux-doc, jamorris, john.allen,
	rppt, mingo, Shankar, Ravi V, corbet, linux-kernel, linux-api,
	gorcunov

On Tue, Oct 4, 2022 at 3:56 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-10-03 at 15:23 -0700, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:25PM -0700, Rick Edgecombe wrote:
> > > [...]
> > > The following example demonstrates how to create a new shadow stack
> > > with
> > > map_shadow_stack:
> > > void *shstk = map_shadow_stack(adrr, stack_size,
> > > SHADOW_STACK_SET_TOKEN);
> >
> > typo: addr
>
> Yep, thanks.
>
>
> >
> > > [...]
> > > +451        common  map_shadow_stack        sys_map_shadow_stac
> > > k
> >
> > Isn't this "64", not "common"?
>
> Yes, this should have been changed after dropping 32 bit.

We don't support ia32.  But this is used for x32 which is supported.

> >
> > > [...]
> > > +#define SHADOW_STACK_SET_TOKEN     0x1     /* Set up a restore token
> > > in the shadow stack */
> >
> > I think this should get an intro comment, like:
> >
> > /* Flags for map_shadow_stack(2) */
> >
> > Also, as with the other UAPI fields, please use "(1ULL << 0)" here.
>
> Ok.
>
> >
> > > @@ -62,24 +63,34 @@ static int create_rstor_token(unsigned long
> > > ssp, unsigned long *token_addr)
> > >     if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
> > >             return -EFAULT;
> > >
> > > -   *token_addr = addr;
> > > +   if (token_addr)
> > > +           *token_addr = addr;
> > >
> > >     return 0;
> > >  }
> > >
> >
> > Can this just be collapsed into the patch that introduces
> > create_rstor_token()?
>
> I mean, yea, that would be simpler. Breaking the changes apart was left
> over from when the signals placed a token, but didn't need this extra
> bit of functionality.
>
> >
> > > -static unsigned long alloc_shstk(unsigned long size)
> > > +static unsigned long alloc_shstk(unsigned long addr, unsigned long
> > > size,
> > > +                            unsigned long token_offset, bool
> > > set_res_tok)
> > >  {
> > >     int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> > >     struct mm_struct *mm = current->mm;
> > > -   unsigned long addr, unused;
> > > +   unsigned long mapped_addr, unused;
> > >
> > >     mmap_write_lock(mm);
> > > -   addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> >
> > Oops, I missed in the other patch that "addr" was being passed here.
> > (uninitialized?)
>
> Argh, yes. I'll initialize in that patch and remove it here.
>
> >
> > > -                  VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
> > > -
> > > +   mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> > > +                         VM_SHADOW_STACK | VM_WRITE, 0, &unused,
> > > NULL);
> >
> > I don't see do_mmap() doing anything here to avoid remapping a prior
> > vma
> > as shstk. Is the intention to allow userspace to convert existing
> > VMAs?
> > This has caused pain in the past, perhaps force MAP_FIXED_NOREPLACE ?
>
> No that is not the intention. It should fail and MAP_FIXED_NOREPLACE
> looks like it will fit the bill. Thanks!
>
> >
> > > [...]
> > > @@ -174,6 +185,7 @@ int shstk_alloc_thread_stack(struct task_struct
> > > *tsk, unsigned long clone_flags,
> > >
> > >
> > >     stack_size = PAGE_ALIGN(stack_size);
> > > +   addr = alloc_shstk(0, stack_size, 0, false);
> > >     if (IS_ERR_VALUE(addr))
> > >             return PTR_ERR((void *)addr);
> > >
> >
> > As mentioned earlier, I was expecting this patch to replace a
> > (missing)
> > call to alloc_shstk. i.e. expecting:
> >
> > -     addr = alloc_shstk(stack_size);
> >
> > > @@ -395,6 +407,26 @@ int shstk_disable(void)
> > >     return 0;
> > >  }
> > >
> > > +
> > > +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned
> > > long, size, unsigned int, flags)
> >
> > Please add kern-doc for this, with some notes. E.g. at least one
> > thing isn't immediately
> > obvious, maybe more: "addr" must be a multiple of 8.
>
> Ok.
>
> >
> > > +{
> > > +   unsigned long aligned_size;
> > > +
> > > +   if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > > +           return -ENOSYS;
> >
> > This needs to explicitly reject unknown flags[1], or expanding them
> > in the
> > future becomes very painful:
> >
> >       if (flags & ~(SHADOW_STACK_SET_TOKEN))
> >               return -EINVAL;
> >
> >
> > [1]
> > https://docs.kernel.org/process/adding-syscalls.html#designing-the-api-planning-for-extension
> >
>
> Ok, good idea.
>
> > > +
> > > +   /*
> > > +    * An overflow would result in attempting to write the restore
> > > token
> > > +    * to the wrong location. Not catastrophic, but just return the
> > > right
> > > +    * error code and block it.
> > > +    */
> > > +   aligned_size = PAGE_ALIGN(size);
> > > +   if (aligned_size < size)
> > > +           return -EOVERFLOW;
> >
> > The intention here is to allow userspace to ask for _less_ than a
> > page
> > size multiple, and to put the restore token there?
> >
> > Is it worth adding a check for size >= 8 here? Or, I guess it would
> > just
> > immediately crash on the next call?
>
> Funny you should ask... The glibc changes were doing this and then
> looking for the token at the end of the length that it passed (not the
> page aligned length). I had changed the kernel at one point to be page
> aligned and then had the fun of debugging the results. I thought, glibc
>  is just wasting shadow stack. It should ask for page aligned shadow
> stacks. But HJ argued that the kernel shouldn't second guess what
> userspace is asking for based on HW page size details that don't have
> to do with the software interface. I was convinced by that argument,
> even though glibc is still wasting space.
>
> I could still be convinced the other way though. Glibc still has time
> to (and should) change. But yea, that was actually the intention.

Glibc requests a shadow stack of a given size and expects the restore
token at the specific location.  This is how glibc uses the restore token
to switch to the new shadow stack.

> >
> > > +
> > > +   return alloc_shstk(addr, aligned_size, size, flags &
> > > SHADOW_STACK_SET_TOKEN);
> > > +}
> >
> >



-- 
H.J.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 21:17                 ` H. Peter Anvin
@ 2022-10-04 23:24                   ` Edgecombe, Rick P
  2022-11-03 17:39                     ` John Allen
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-04 23:24 UTC (permalink / raw)
  To: hpa, nathan
  Cc: bsingharora, Syromiatnikov, Eugene, babu.moger, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, hjl.tools, pavel, Lutomirski, Andy, thomas.lendacky,
	jamorris, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, gustavoars, john.allen, rppt, Shankar, Ravi V,
	ndesaulniers, Hansen, Dave, mingo, corbet, linux-api,
	linux-kernel, Yang, Weijiang, gorcunov

On Tue, 2022-10-04 at 14:17 -0700, H. Peter Anvin wrote:
> On October 4, 2022 1:50:20 PM PDT, Nathan Chancellor <
> nathan@kernel.org> wrote:
> > On Tue, Oct 04, 2022 at 08:34:54PM +0000, Edgecombe, Rick P wrote:
> > > On Tue, 2022-10-04 at 14:43 -0500, John Allen wrote:
> > > > On 10/4/22 10:47 AM, Nathan Chancellor wrote:
> > > > > Hi Kees,
> > > > > 
> > > > > On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
> > > > > > On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen
> > > > > > wrote:
> > > > > > > On 10/3/22 16:57, Kees Cook wrote:
> > > > > > > > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick
> > > > > > > > Edgecombe
> > > > > > > > wrote:
> > > > > > > > > Shadow stack is supported on newer AMD processors,
> > > > > > > > > but the
> > > > > > > > > kernel
> > > > > > > > > implementation has not been tested on them. Prevent
> > > > > > > > > basic
> > > > > > > > > issues from
> > > > > > > > > showing up for normal users by disabling shadow stack
> > > > > > > > > on
> > > > > > > > > all CPUs except
> > > > > > > > > Intel until it has been tested. At which point the
> > > > > > > > > limitation should be
> > > > > > > > > removed.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Rick Edgecombe <
> > > > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > > 
> > > > > > > > So running the selftests on an AMD system is sufficient
> > > > > > > > to
> > > > > > > > drop this
> > > > > > > > patch?
> > > > > > > 
> > > > > > > Yes, that's enough.
> > > > > > > 
> > > > > > > I _thought_ the AMD folks provided some tested-by's at
> > > > > > > some
> > > > > > > point in the
> > > > > > > past.  But, maybe I'm confusing this for one of the other
> > > > > > > shared
> > > > > > > features.  Either way, I'm sure no tested-by's were
> > > > > > > dropped on
> > > > > > > purpose.
> > > > > > > 
> > > > > > > I'm sure Rick is eager to trim down his series and this
> > > > > > > would
> > > > > > > be a great
> > > > > > > patch to drop.  Does anyone want to make that easy for
> > > > > > > Rick?
> > > > > > > 
> > > > > > > <hint> <hint>
> > > > > > 
> > > > > > Hey Gustavo, Nathan, or Nick! I know y'all have some fancy
> > > > > > AMD
> > > > > > testing
> > > > > > rigs. Got a moment to spin up this series and run the
> > > > > > selftests?
> > > > > > :)
> > > > > 
> > > > > I do have access to a system with an EPYC 7513, which does
> > > > > have
> > > > > Shadow
> > > > > Stack support (I can see 'shstk' in the "Flags" section of
> > > > > lscpu
> > > > > with
> > > > > this series). As far as I understand it, AMD only added
> > > > > Shadow
> > > > > Stack
> > > > > with Zen 3; my regular AMD test system is Zen 2 (probably
> > > > > should
> > > > > look at
> > > > > procurring a Zen 3 or Zen 4 one at some point).
> > > > > 
> > > > > I applied this series on top of 6.0 and reverted this change
> > > > > then
> > > > > booted
> > > > > it on that system. After building the selftest (which did
> > > > > require
> > > > > 'make headers_install' and a small addition to make it build
> > > > > beyond
> > > > > that, see below), I ran it and this was the result. I am not
> > > > > sure
> > > > > if
> > > > > that is expected or not but the other results seem promising
> > > > > for
> > > > > dropping this patch.
> > > > > 
> > > > >    $ ./test_shadow_stack_64
> > > > >    [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
> > > > >    [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
> > > > >    [INFO]  ssp is now 7f8a36ca0000
> > > > >    [OK]    Shadow stack pivot
> > > > >    [OK]    Shadow stack faults
> > > > >    [INFO]  Corrupting shadow stack
> > > > >    [INFO]  Generated shadow stack violation successfully
> > > > >    [OK]    Shadow stack violation test
> > > > >    [INFO]  Gup read -> shstk access success
> > > > >    [INFO]  Gup write -> shstk access success
> > > > >    [INFO]  Violation from normal write
> > > > >    [INFO]  Gup read -> write access success
> > > > >    [INFO]  Violation from normal write
> > > > >    [INFO]  Gup write -> write access success
> > > > >    [INFO]  Cow gup write -> write access success
> > > > >    [OK]    Shadow gup test
> > > > >    [INFO]  Violation from shstk access
> > > > >    [OK]    mprotect() test
> > > > >    [OK]    Userfaultfd test
> > > > >    [FAIL]  Alt shadow stack test
> > > > 
> > > > The selftest is looking OK on my system (Dell PowerEdge R6515
> > > > w/ EPYC
> > > > 7713). I also just pulled a fresh 6.0 kernel and applied the
> > > > series
> > > > including the fix Nathan mentions below.
> > > > 
> > > > $ tools/testing/selftests/x86/test_shadow_stack_64
> > > > [INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
> > > > [INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
> > > > [INFO]  ssp is now 7f30cccc6000
> > > > [OK]    Shadow stack pivot
> > > > [OK]    Shadow stack faults
> > > > [INFO]  Corrupting shadow stack
> > > > [INFO]  Generated shadow stack violation successfully
> > > > [OK]    Shadow stack violation test
> > > > [INFO]  Gup read -> shstk access success
> > > > [INFO]  Gup write -> shstk access success
> > > > [INFO]  Violation from normal write
> > > > [INFO]  Gup read -> write access success
> > > > [INFO]  Violation from normal write
> > > > [INFO]  Gup write -> write access success
> > > > [INFO]  Cow gup write -> write access success
> > > > [OK]    Shadow gup test
> > > > [INFO]  Violation from shstk access
> > > > [OK]    mprotect() test
> > > > [OK]    Userfaultfd test
> > > > [OK]    Alt shadow stack test.
> > > 
> > > Thanks for the testing. Based on the test, I wonder if this could
> > > be a
> > > SW bug. Nathan, could I send you a tweaked test with some more
> > > debug
> > > information?
> > 
> > Yes, more than happy to help you look into this further!
> > 
> > > John, are we sure AMD and Intel systems behave the same with
> > > respect to
> > > CPUs not creating Dirty=1,Write=0 PTEs in rare situations? Or any
> > > other
> > > CET related differences we should hash out? Otherwise I'll drop
> > > the
> > > patch for the next version. (and assuming the issue Nathan hit
> > > doesn't
> > > turn up anything HW related).
> 
> I have to admit to being a bit confused here... in general, we trust
> CPUID bits unless they are *known* to be wrong, in which case we
> blacklist them.
> 
> If AMD advertises the feature but it doesn't work or they didn't
> validate it, that would be a (serious!) bug on their part that we can
> address by blacklisting, but they should also fix with a
> microcode/BIOS patch.
> 
> What am I missing?

I have not read anything about the AMD implementation except hearing
that it is supported. But there are some microarchitectual-like aspects
to this CET Linux implementation, around requiring CPUs to not create
Dirty=1,Write=0 PTEs in some cases, where they did in the past. In
another thread Jann asked how the IOMMU works with respect to this edge
case and I'm currently trying to chase down that answer for even Intel
HW. So I just wanted to double check that we expect that everything
should be the same. In either case we still have time to iron things
out before anything gets upstream.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
  2022-09-30  3:41   ` Bagas Sanjaya
  2022-10-03 17:18   ` Kees Cook
@ 2022-10-05  0:02   ` Andrew Cooper
  2022-10-10 12:19   ` Florian Weimer
  3 siblings, 0 replies; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  0:02 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Andrew Cooper
  Cc: Yu-cheng Yu

On 29/09/2022 23:28, Rick Edgecombe wrote:
> diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
> new file mode 100644
> index 000000000000..4a0dfb6830f9
> --- /dev/null
> +++ b/Documentation/x86/cet.rst
> @@ -0,0 +1,140 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +
> +Overview
> +========
> +
> +Control-flow Enforcement Technology (CET) is term referring to several
> +related x86 processor features that provides protection against control
> +flow hijacking attacks. The HW feature itself can be set up to protect
> +both applications and the kernel. Only user-mode protection is implemented
> +in the 64-bit kernel.
> +
> +CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
> +a secondary stack allocated from memory and cannot be directly modified by
> +applications. When executing a CALL instruction, the processor pushes the
> +return address to both the normal stack and the shadow stack. Upon
> +function return, the processor pops the shadow stack copy and compares it
> +to the normal stack copy. If the two differ, the processor raises a
> +control-protection fault. Indirect branch tracking verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes. Not all CPU's have both Shadow Stack and Indirect Branch Tracking
> +and only Shadow Stack is currently supported in the kernel.

This paragraph is stale, isn't it?

AIUI, by the end of this series, what is supported is in-kernel
self-protection using CET-IBT, and userspace shadow stacks.

It is probably worth keeping the implementation-agnostic bits separate
from the "what is currently supported" matrix.  I'm not certain if its
worth splitting into cet.rst, cet-kernel.rst and cet-user.rst at this
point, but it's something to consider.

> +The Kconfig options is X86_SHADOW_STACK, and it can be disabled with
> +the kernel parameter clearcpuid, like this: "clearcpuid=shstk".

What about namespacing?  For the CPUID features themselves, yes they're
shstk and ibt.

But for the Kconfig options, the user and kernel implementations are
wildly different for both shstk and ibt.  Are they going to want to
share the same Kconfig option from the getgo?

Independent of the Kconfig symbol, user and kernel have separate
enablement criteria.  e.g. kernel shstk is likely going to be dependent
on the FRED feature, and simply looking at `shstk` in /proc/cpuinfo
doesn't necessarily tell you all you want to know.

> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1

What are the other dependences here?

In principle shstk only needs assembler support for the new
instructions, and that's Binutils 2.29 / LLVM 6 from my notes.

It's IBT which needs compiler support (and then, even only kernel IBT),
and that work is already done.

> +or later are required. To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +At run time, /proc/cpuinfo shows CET features if the processor supports
> +CET.

Probably helpful to state what these are.

> +
> +Application Enabling
> +====================
> +
> +An application's CET capability is marked in its ELF header and can be

Technically its in an ELF note, not the ELF header.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2022-09-29 22:29 ` [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
  2022-10-03 17:31   ` Kees Cook
@ 2022-10-05  0:55   ` Andrew Cooper
  2022-10-14 17:12   ` Borislav Petkov
  2 siblings, 0 replies; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  0:55 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Andrew Cooper
  Cc: Yu-cheng Yu

On 29/09/2022 23:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> Utilizing CET features requires a CR4 bit to be enabled as well as bits
> to be set in CET MSRs. Setting the CR4 bit does two things:
>  1. Enables the usage of WRUSS instruction, which the kernel can use to
>     write to userspace shadow stacks.
>  2. Allows those individual aspects of CET to be enabled later via the MSR.
>  3. Allows CET to be enabled in guests

Point 1, yes, but the others, not really.  Guests aren't interesting
because host CR4 != guest CR4.

CET is a tangled mess of control bits.  The MSRs can be configured and
context switched independently CR4.

The 4 main sub-feature enablement conditions are CR4.CET &&
MSR_{U,S}_CET.{SHSTK,ENDBR}_EN.

The WRUSS instruction is keyed on CR4.CET alone.  This is because
CR4.CET is the paging control which changes the interpretation of
R/O+Dirty, and is a prerequisite for any shstk memory accesses.  Most
other shstk instructions have finer grain enablement conditions.

I'd suggest simplifying the commit message massively, to say that
CR4.CET is a prerequisite for all CET operation, so extend setup_cet()
to enable it for user shadow stacks.

It hopefully goes without saying that you cannot do an equivalent piece
of code for supervisor shadow stacks.  If you try, you'll discover that
everything works fine until you try returning from the function which
activated the second of CR4.CET and MSR_S_CET.SHSTK_EN, and the valid
content on the shadow stack underflows.

~Andrew

P.S. There's a fun infoleak.

Userspace can probe for kernel shstk enablement using fault analysis on
the SETSSBUSY instruction.  It takes #UD for !CR4.CET ||
!MSR_S_CET.SHSTK_EN, and then #GP for CPL !=0.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
                     ` (2 preceding siblings ...)
  2022-10-03 22:51   ` Andy Lutomirski
@ 2022-10-05  1:20   ` Andrew Cooper
  2022-10-05 22:44     ` Edgecombe, Rick P
  2022-10-05  9:39   ` Peter Zijlstra
  4 siblings, 1 reply; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  1:20 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Andrew Cooper
  Cc: Yu-cheng Yu, Michael Kerrisk

On 29/09/2022 23:29, Rick Edgecombe wrote:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index d62b2cb85cea..b7dde8730236 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -229,16 +223,74 @@ enum cp_error_code {
>  	CP_ENCL	     = 1 << 15,
>  };
>  
> -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static const char * const control_protection_err[] = {
> +	"unknown",
> +	"near-ret",
> +	"far-ret/iret",
> +	"endbranch",
> +	"rstorssp",
> +	"setssbsy",
> +};

These are a mix of SHSTK and IBT errors.  They should be inside
CONFIG_X86_CET using Kees' suggestion.

Also, if you express this as

static const char errors[][10] = {
    [0] = "unknown",
    [1] = "near ret",
    [2] = "far/iret",
    [3] = "endbranch",
    [4] = "rstorssp",
    [5] = "setssbsy",
};

then you can encode all the strings in roughly the space it takes to lay
out the pointers above.

> +
> +static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
> +			      DEFAULT_RATELIMIT_BURST);
> +
> +static void do_user_control_protection_fault(struct pt_regs *regs,
> +					     unsigned long error_code)
>  {
> -	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> -		pr_err("Unexpected #CP\n");
> -		BUG();
> +	struct task_struct *tsk;
> +	unsigned long ssp;
> +
> +	/* Read SSP before enabling interrupts. */
> +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> +	cond_local_irq_enable(regs);
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");

So it's ok to get an unexpected #CP on CET-capable hardware, but not on
CET-incapable hardware?

The conditions for this WARN() (and others) probably want adjusting to
what the kernel has enabled, not what hardware is capable of.

> @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
>  }
>  
>  __setup("ibt=", ibt_setup);
> -
> +#else
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
> +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");
> +}
>  #endif /* CONFIG_X86_KERNEL_IBT */
>  
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pr_err("Unexpected #CP\n");

Do some future poor sole a favour and render the numeric error code
too.  Without it, the error is ambiguous between SHSTK and IBT when %rip
points at a call/ret instruction.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2022-09-29 22:29 ` [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
  2022-10-03 14:17   ` Kirill A . Shutemov
@ 2022-10-05  1:31   ` Andrew Cooper
  2022-10-05 11:16     ` Peter Zijlstra
  1 sibling, 1 reply; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  1:31 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Andrew Cooper
  Cc: Yu-cheng Yu, Christoph Hellwig

On 29/09/2022 23:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> Processors sometimes directly create Write=0,Dirty=1 PTEs.

Do they? (Rhetorical)

Yes, this is a relevant anecdote for why CET isn't available on pre-TGL
parts, but it one of the more wrong things to have as the first sentence
of this commit message.

The point you want to express is that under the CET-SS spec, R/O+Dirty
has a new meaning as type=shstk, so stop using this bit combination for
existing mappings.

I'm not even sure it's relevant to note that CET capable processors can
set D on a R/O mapping, because that depends on !CR0.WP which in turn
prohibits CR4.CET being enabled.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
  2022-09-30 15:16   ` Jann Horn
  2022-10-03 16:26   ` Kirill A . Shutemov
@ 2022-10-05  2:17   ` Andrew Cooper
  2022-10-05 14:08     ` Dave Hansen
  2022-10-05 23:01     ` Edgecombe, Rick P
  2022-10-05 11:33   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  2:17 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Andrew Cooper
  Cc: Yu-cheng Yu

On 29/09/2022 23:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux). That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0,Dirty=1.

How does "Some OSes have a greater dependence on software available bits
in PTEs than Linux" sound?

> The reason it's lightly used is that Dirty=1 is normally set _before_ a
> write. A write with a Write=0 PTE would typically only generate a fault,
> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
> stacks will no longer exhibit this oddity.

Again, an interesting anecdote but not salient information here.

> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 6496ec84b953..ad201dae7316 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -134,9 +142,17 @@ static inline int pte_young(pte_t pte)
>  	return pte_flags(pte) & _PAGE_ACCESSED;
>  }
>  
> -static inline int pmd_dirty(pmd_t pmd)
> +static inline bool pmd_dirty(pmd_t pmd)
>  {
> -	return pmd_flags(pmd) & _PAGE_DIRTY;
> +	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
> +}
> +
> +static inline bool pmd_shstk(pmd_t pmd)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return false;
> +
> +	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;

(flags & PSE|RW|D) == PSE|D;

R/O+D can exist higher in the paging structures and does not convey
type=shstk-ness to later stages of the walk.


However, there is a further complication which is bound rear its head
sooner or later, and warrants discussing.

type=shstk isn't actually only R/O+D on the leaf PTE; its also R/W on
the accumulated access rights on non-leaf PTEs.

Specifically, if you clear the RW bit on any higher level in the
pagetable, then everything mapped by that PTE ceases to be of type
shstk, even if the leaf has the R/O+D bit combination.

This is allegedly a feature for the database folks, where they can
create R/O and R/W aliases of the same memory, sharing intermediate
pagetables, where the R/W alias will set D bits per usual and the R/O
alias needs not to transmogrify itself into a shadow stack.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-10-03 18:30   ` Kees Cook
@ 2022-10-05  2:30     ` Andrew Cooper
  2022-10-10 12:33       ` Florian Weimer
  0 siblings, 1 reply; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  2:30 UTC (permalink / raw)
  To: Kees Cook, Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu, Andrew Cooper

On 03/10/2022 19:30, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:15PM -0700, Rick Edgecombe wrote:
>> [...]
>> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
>> +{
>> +	if (vma->vm_flags & VM_GROWSDOWN)
>> +		return stack_guard_gap;
>> +
>> +	/*
>> +	 * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).
>> +	 * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
>> +	 * (~1KB for INCSSPD) and touches the first and the last element
>> +	 * in the range, which triggers a page fault if the range is not
>> +	 * in a shadow stack. Because of this, creating 4-KB guard pages
>> +	 * around a shadow stack prevents these instructions from going
>> +	 * beyond.
>> +	 *
>> +	 * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
>> +	 * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
>> +	 */
> Thank you for the details on how the size choice is made here! :)

(In case anyone is hankering for some premature optimisation...)

You don't actually need a hole to create a guard.  Any mapping of type
!= shstk will do.

If you've got a load of threads, you can tightly pack stack / shstk /
stack / shstk with no holes, and they each act as each other guard pages.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-09-29 22:29 ` [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk Rick Edgecombe
  2022-10-03 20:44   ` Kees Cook
@ 2022-10-05  2:43   ` Andrew Cooper
  2022-10-05 22:47     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05  2:43 UTC (permalink / raw)
  To: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Andrew Cooper
  Cc: Yu-cheng Yu

On 29/09/2022 23:29, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 35f709f619fb..f096f52bd059 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
>  		: [pax] "a" (p));
>  }
>  
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> +{
> +	asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> +			  _ASM_EXTABLE(1b, %l[fail])
> +			  :: [addr] "r" (addr), [val] "r" (val)
> +			  :: fail);

"1: wrssq %[val], %[addr]\n"
_ASM_EXTABLE(1b, %l[fail])
: [addr] "+m" (*addr)
: [val] "r" (val)
:: fail

Otherwise you've failed to tell the compiler that you wrote to *addr.

With that fixed, it's not volatile because there are no unexpressed side
effects.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-03 16:56         ` Edgecombe, Rick P
  2022-10-04  2:16           ` Bagas Sanjaya
@ 2022-10-05  9:10           ` Peter Zijlstra
  2022-10-05  9:25             ` Bagas Sanjaya
  1 sibling, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-05  9:10 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: corbet, bagasdotme, bsingharora, hpa, Syromiatnikov, Eugene,
	rdunlap, keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov,
	Eranian, Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, Moreira, Joao, tglx, mike.kravetz, x86, Yang,
	Weijiang, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	linux-kernel, linux-api, gorcunov

On Mon, Oct 03, 2022 at 04:56:10PM +0000, Edgecombe, Rick P wrote:
> Thanks. Unless anyone has any objections

Well, I'll object. I still feel rst should burn in hell. Plain text FTW.



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-05  9:10           ` Peter Zijlstra
@ 2022-10-05  9:25             ` Bagas Sanjaya
  2022-10-05  9:46               ` Peter Zijlstra
  0 siblings, 1 reply; 241+ messages in thread
From: Bagas Sanjaya @ 2022-10-05  9:25 UTC (permalink / raw)
  To: Peter Zijlstra, Edgecombe, Rick P
  Cc: corbet, bsingharora, hpa, Syromiatnikov, Eugene, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, bp, Lutomirski, Andy,
	linux-doc, arnd, Moreira, Joao, tglx, mike.kravetz, x86, Yang,
	Weijiang, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	linux-kernel, linux-api, gorcunov

On 10/5/22 16:10, Peter Zijlstra wrote:
> On Mon, Oct 03, 2022 at 04:56:10PM +0000, Edgecombe, Rick P wrote:
>> Thanks. Unless anyone has any objections
> 
> Well, I'll object. I still feel rst should burn in hell. Plain text FTW.
> 
> 

.txt maybe?

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
                     ` (3 preceding siblings ...)
  2022-10-05  1:20   ` Andrew Cooper
@ 2022-10-05  9:39   ` Peter Zijlstra
  2022-10-05 22:45     ` Edgecombe, Rick P
  4 siblings, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-05  9:39 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu,
	Michael Kerrisk

On Thu, Sep 29, 2022 at 03:29:04PM -0700, Rick Edgecombe wrote:

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index d62b2cb85cea..b7dde8730236 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c

> @@ -229,16 +223,74 @@ enum cp_error_code {
>  	CP_ENCL	     = 1 << 15,
>  };
>  
> -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static const char * const control_protection_err[] = {
> +	"unknown",
> +	"near-ret",
> +	"far-ret/iret",
> +	"endbranch",
> +	"rstorssp",
> +	"setssbsy",
> +};
> +
> +static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
> +			      DEFAULT_RATELIMIT_BURST);
> +
> +static void do_user_control_protection_fault(struct pt_regs *regs,
> +					     unsigned long error_code)
>  {
> -	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> -		pr_err("Unexpected #CP\n");
> -		BUG();
> +	struct task_struct *tsk;
> +	unsigned long ssp;
> +
> +	/* Read SSP before enabling interrupts. */
> +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> +	cond_local_irq_enable(regs);
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
> +
> +	tsk = current;
> +	tsk->thread.error_code = error_code;
> +	tsk->thread.trap_nr = X86_TRAP_CP;
> +
> +	/* Ratelimit to prevent log spamming. */
> +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> +	    __ratelimit(&cpf_rate)) {
> +		unsigned int cpec;
> +
> +		cpec = error_code & CP_EC;
> +		if (cpec >= ARRAY_SIZE(control_protection_err))
> +			cpec = 0;
> +
> +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
> +			 tsk->comm, task_pid_nr(tsk),
> +			 regs->ip, regs->sp, ssp, error_code,
> +			 control_protection_err[cpec],
> +			 error_code & CP_ENCL ? " in enclave" : "");
> +		print_vma_addr(KERN_CONT " in ", regs->ip);
> +		pr_cont("\n");
>  	}
>  
> -	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
> -		return;

Why are you removing the (error_code & CP_EC) != CP_ENDBR check from the
kernel handler?

> +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> +	cond_local_irq_disable(regs);
> +}
> +#else
> +static void do_user_control_protection_fault(struct pt_regs *regs,
> +					     unsigned long error_code)
> +{
> +	WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
> +}
> +#endif
> +
> +#ifdef CONFIG_X86_KERNEL_IBT
> +
> +static __ro_after_init bool ibt_fatal = true;
> +
> +extern void ibt_selftest_ip(void); /* code label defined in asm below */
>  
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
>  	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
>  		regs->ax = 0;
>  		return;
> @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
>  }
>  
>  __setup("ibt=", ibt_setup);
> -
> +#else
> +static void do_kernel_control_protection_fault(struct pt_regs *regs)
> +{
> +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");
> +}
>  #endif /* CONFIG_X86_KERNEL_IBT */
>  
> +#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pr_err("Unexpected #CP\n");
> +		BUG();
> +	}
> +
> +	if (user_mode(regs))
> +		do_user_control_protection_fault(regs, error_code);
> +	else
> +		do_kernel_control_protection_fault(regs);

These function names are weirdly long, surely they can do without the
_fault part at the very least. And as stated above, I would really like
the kernel thing to retain the error_code argument.

> +}
> +#endif /* defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK) */



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-05  9:25             ` Bagas Sanjaya
@ 2022-10-05  9:46               ` Peter Zijlstra
  0 siblings, 0 replies; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-05  9:46 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Edgecombe, Rick P, corbet, bsingharora, hpa, Syromiatnikov,
	Eugene, rdunlap, keescook, Yu, Yu-cheng, dave.hansen,
	kirill.shutemov, Eranian, Stephane, linux-mm, fweimer,
	nadav.amit, jannh, dethoma, linux-arch, kcc, pavel, oleg,
	hjl.tools, bp, Lutomirski, Andy, linux-doc, arnd, Moreira, Joao,
	tglx, mike.kravetz, x86, Yang, Weijiang, jamorris, john.allen,
	rppt, mingo, Shankar, Ravi V, linux-kernel, linux-api, gorcunov

On Wed, Oct 05, 2022 at 04:25:39PM +0700, Bagas Sanjaya wrote:
> On 10/5/22 16:10, Peter Zijlstra wrote:
> > On Mon, Oct 03, 2022 at 04:56:10PM +0000, Edgecombe, Rick P wrote:
> >> Thanks. Unless anyone has any objections
> > 
> > Well, I'll object. I still feel rst should burn in hell. Plain text FTW.
> > 
> > 
> 
> .txt maybe?

We had that, but some idiots went and converted the lot to .rst :-(

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2022-10-05  1:31   ` Andrew Cooper
@ 2022-10-05 11:16     ` Peter Zijlstra
  2022-10-05 12:34       ` Andrew Cooper
  0 siblings, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-05 11:16 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu,
	Christoph Hellwig

On Wed, Oct 05, 2022 at 01:31:28AM +0000, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> >
> > Processors sometimes directly create Write=0,Dirty=1 PTEs.
> 
> Do they? (Rhetorical)
> 
> Yes, this is a relevant anecdote for why CET isn't available on pre-TGL
> parts, but it one of the more wrong things to have as the first sentence
> of this commit message.
> 
> The point you want to express is that under the CET-SS spec, R/O+Dirty
> has a new meaning as type=shstk, so stop using this bit combination for
> existing mappings.
> 
> I'm not even sure it's relevant to note that CET capable processors can
> set D on a R/O mapping, because that depends on !CR0.WP which in turn
> prohibits CR4.CET being enabled.

Whilst I agree that the Changelog is 'suboptimal' -- I do think it might
be good to mention how we ended up at the current state where we
explicitly set this non-sensical W=0,D=1 state.

Looking at the git history this seems to be a bit of a hysterical
accident, not something done on purpose to 'optimize' for these weird
CPUs.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
                     ` (2 preceding siblings ...)
  2022-10-05  2:17   ` Andrew Cooper
@ 2022-10-05 11:33   ` Peter Zijlstra
  2022-10-14  9:41   ` Peter Zijlstra
  2022-10-14  9:42   ` Peter Zijlstra
  5 siblings, 0 replies; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-05 11:33 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:

Mucho confusion here:

> (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
> (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
> (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack

are all identical cases;

> (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
> (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without

as are these.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  2022-10-05 11:16     ` Peter Zijlstra
@ 2022-10-05 12:34       ` Andrew Cooper
  0 siblings, 0 replies; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu,
	Christoph Hellwig, Andrew Cooper

On 05/10/2022 12:16, Peter Zijlstra wrote:
> On Wed, Oct 05, 2022 at 01:31:28AM +0000, Andrew Cooper wrote:
>> On 29/09/2022 23:29, Rick Edgecombe wrote:
>>> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>>
>>> Processors sometimes directly create Write=0,Dirty=1 PTEs.
>> Do they? (Rhetorical)
>>
>> Yes, this is a relevant anecdote for why CET isn't available on pre-TGL
>> parts, but it one of the more wrong things to have as the first sentence
>> of this commit message.
>>
>> The point you want to express is that under the CET-SS spec, R/O+Dirty
>> has a new meaning as type=shstk, so stop using this bit combination for
>> existing mappings.
>>
>> I'm not even sure it's relevant to note that CET capable processors can
>> set D on a R/O mapping, because that depends on !CR0.WP which in turn
>> prohibits CR4.CET being enabled.
> Whilst I agree that the Changelog is 'suboptimal' -- I do think it might
> be good to mention how we ended up at the current state where we
> explicitly set this non-sensical W=0,D=1 state.

Sure, but that's got nothing to do with hardware errata.

Having hardware set A/D bits is expensive.  Being a locked operation,
it's roughly a smp_mb() behind the scenes.

Therefore, when A/D tracking doesn't matter, traditional wisdom says set
both of them when creating the PTE.

It's only now that R/O+Dirty has a meaning (other than being a slightly
weird but safe bit combination), and we've got to be more careful about
using it.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-04 19:32         ` Kees Cook
@ 2022-10-05 13:32           ` David Laight
  0 siblings, 0 replies; 241+ messages in thread
From: David Laight @ 2022-10-05 13:32 UTC (permalink / raw)
  To: 'Kees Cook'
  Cc: 'Dave Hansen',
	Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Mike Kravetz, Nadav Amit, Oleg Nesterov,
	Pavel Machek, Peter Zijlstra, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

From: Kees Cook
> Sent: 04 October 2022 20:32
...
> Oh, yes! I do this all the time with FORTIFY shenanigans. Right, so,
> instead of a macro, the "cannot be un-inlined" could be enforced with
> this (untested):
> 
> static __always_inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> {
> 	u64 val, new_val;
> 
> 	BUILD_BUG_ON(!__builtin_constant_p(msr) ||
> 		     !__builtin_constant_p(set) ||
> 		     !__builtin_constant_p(clear));

You can reduce the amount of text the brain has to parse
by using:

	BUILD_BUG_ON(!__builtin_constant_p(msr + set + clear));

Just requires the brain to do a quick 'oh yes'...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-05  2:17   ` Andrew Cooper
@ 2022-10-05 14:08     ` Dave Hansen
  2022-10-05 23:06       ` Edgecombe, Rick P
  2022-10-05 23:01     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2022-10-05 14:08 UTC (permalink / raw)
  To: Andrew Cooper, Rick Edgecombe, x86, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: Yu-cheng Yu

On 10/4/22 19:17, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
>> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>
>> There is essentially no room left in the x86 hardware PTEs on some OSes
>> (not Linux). That left the hardware architects looking for a way to
>> represent a new memory type (shadow stack) within the existing bits.
>> They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> How does "Some OSes have a greater dependence on software available bits
> in PTEs than Linux" sound?
> 
>> The reason it's lightly used is that Dirty=1 is normally set _before_ a
>> write. A write with a Write=0 PTE would typically only generate a fault,
>> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
>> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
>> stacks will no longer exhibit this oddity.
> Again, an interesting anecdote but not salient information here.

As much as I like the sound of my own voice (and anecdotes), I agree
that this is a bit oblique for the patch.  Maybe this anecdote should
get banished elsewhere.

The changelog here could definitely get to the point faster.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-10-05  1:20   ` Andrew Cooper
@ 2022-10-05 22:44     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-05 22:44 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, andrew.cooper3, hjl.tools, Yang, Weijiang, oleg, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng, mtk.manpages

On Wed, 2022-10-05 at 01:20 +0000, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index d62b2cb85cea..b7dde8730236 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -229,16 +223,74 @@ enum cp_error_code {
> >  	CP_ENCL	     = 1 << 15,
> >  };
> >  
> > -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static const char * const control_protection_err[] = {
> > +	"unknown",
> > +	"near-ret",
> > +	"far-ret/iret",
> > +	"endbranch",
> > +	"rstorssp",
> > +	"setssbsy",
> > +};
> 
> These are a mix of SHSTK and IBT errors.  They should be inside
> CONFIG_X86_CET using Kees' suggestion.
> 
> Also, if you express this as
> 
> static const char errors[][10] = {
>     [0] = "unknown",
>     [1] = "near ret",
>     [2] = "far/iret",
>     [3] = "endbranch",
>     [4] = "rstorssp",
>     [5] = "setssbsy",
> };
> 
> then you can encode all the strings in roughly the space it takes to
> lay
> out the pointers above.

It is only used in the user shadow stack side of the handler. I guess
the kernel IBT side of the handler could print these out too.

Can you explain more about why this array is better than the other one?

> 
> > +
> > +static DEFINE_RATELIMIT_STATE(cpf_rate,
> > DEFAULT_RATELIMIT_INTERVAL,
> > +			      DEFAULT_RATELIMIT_BURST);
> > +
> > +static void do_user_control_protection_fault(struct pt_regs *regs,
> > +					     unsigned long error_code)
> >  {
> > -	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> > -		pr_err("Unexpected #CP\n");
> > -		BUG();
> > +	struct task_struct *tsk;
> > +	unsigned long ssp;
> > +
> > +	/* Read SSP before enabling interrupts. */
> > +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > +	cond_local_irq_enable(regs);
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +		WARN_ONCE(1, "User-mode control protection fault with
> > shadow support disabled\n");
> 
> So it's ok to get an unexpected #CP on CET-capable hardware, but not
> on
> CET-incapable hardware?
> 
> The conditions for this WARN() (and others) probably want adjusting
> to
> what the kernel has enabled, not what hardware is capable of.

Sorry, I don't follow. This code is only compiled in if the kernel has
been compiled for userspace shadow stacks. If the HW supports it and
the kernel is configured for it, it should be enabled. If you clear it
with the clearcpuid command line it should be as if the HW doesn't
support it. So I think it should not be too unexpected, in situations
where it gets passed this check.

> 
> > @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
> >  }
> >  
> >  __setup("ibt=", ibt_setup);
> > -
> > +#else
> > +static void do_kernel_control_protection_fault(struct pt_regs
> > *regs)
> > +{
> > +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT
> > disabled\n");
> > +}
> >  #endif /* CONFIG_X86_KERNEL_IBT */
> >  
> > +#if defined(CONFIG_X86_KERNEL_IBT) ||
> > defined(CONFIG_X86_SHADOW_STACK)
> > +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> > +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> > +		pr_err("Unexpected #CP\n");
> 
> Do some future poor sole a favour and render the numeric error code
> too.  Without it, the error is ambiguous between SHSTK and IBT when
> %rip
> points at a call/ret instruction.
> 

This was from the original kernel IBT handler. Yes, all these messages
should probably be unified too. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  2022-10-05  9:39   ` Peter Zijlstra
@ 2022-10-05 22:45     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-05 22:45 UTC (permalink / raw)
  To: peterz
  Cc: mtk.manpages, bsingharora, hpa, Syromiatnikov, Eugene, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov

On Wed, 2022-10-05 at 11:39 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:04PM -0700, Rick Edgecombe wrote:
> 
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index d62b2cb85cea..b7dde8730236 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -229,16 +223,74 @@ enum cp_error_code {
> >  	CP_ENCL	     = 1 << 15,
> >  };
> >  
> > -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static const char * const control_protection_err[] = {
> > +	"unknown",
> > +	"near-ret",
> > +	"far-ret/iret",
> > +	"endbranch",
> > +	"rstorssp",
> > +	"setssbsy",
> > +};
> > +
> > +static DEFINE_RATELIMIT_STATE(cpf_rate,
> > DEFAULT_RATELIMIT_INTERVAL,
> > +			      DEFAULT_RATELIMIT_BURST);
> > +
> > +static void do_user_control_protection_fault(struct pt_regs *regs,
> > +					     unsigned long error_code)
> >  {
> > -	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> > -		pr_err("Unexpected #CP\n");
> > -		BUG();
> > +	struct task_struct *tsk;
> > +	unsigned long ssp;
> > +
> > +	/* Read SSP before enabling interrupts. */
> > +	rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > +	cond_local_irq_enable(regs);
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +		WARN_ONCE(1, "User-mode control protection fault with
> > shadow support disabled\n");
> > +
> > +	tsk = current;
> > +	tsk->thread.error_code = error_code;
> > +	tsk->thread.trap_nr = X86_TRAP_CP;
> > +
> > +	/* Ratelimit to prevent log spamming. */
> > +	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> > +	    __ratelimit(&cpf_rate)) {
> > +		unsigned int cpec;
> > +
> > +		cpec = error_code & CP_EC;
> > +		if (cpec >= ARRAY_SIZE(control_protection_err))
> > +			cpec = 0;
> > +
> > +		pr_emerg("%s[%d] control protection ip:%lx sp:%lx
> > ssp:%lx error:%lx(%s)%s",
> > +			 tsk->comm, task_pid_nr(tsk),
> > +			 regs->ip, regs->sp, ssp, error_code,
> > +			 control_protection_err[cpec],
> > +			 error_code & CP_ENCL ? " in enclave" : "");
> > +		print_vma_addr(KERN_CONT " in ", regs->ip);
> > +		pr_cont("\n");
> >  	}
> >  
> > -	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) !=
> > CP_ENDBR))
> > -		return;
> 
> Why are you removing the (error_code & CP_EC) != CP_ENDBR check from
> the
> kernel handler?

Argh. It was accidentally removed with the user_mode() check. I'll fix
it.

> 
> > +	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> > +	cond_local_irq_disable(regs);
> > +}
> > +#else
> > +static void do_user_control_protection_fault(struct pt_regs *regs,
> > +					     unsigned long error_code)
> > +{
> > +	WARN_ONCE(1, "User-mode control protection fault with shadow
> > support disabled\n");
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86_KERNEL_IBT
> > +
> > +static __ro_after_init bool ibt_fatal = true;
> > +
> > +extern void ibt_selftest_ip(void); /* code label defined in asm
> > below */
> >  
> > +static void do_kernel_control_protection_fault(struct pt_regs
> > *regs)
> > +{
> >  	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
> >  		regs->ax = 0;
> >  		return;
> > @@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
> >  }
> >  
> >  __setup("ibt=", ibt_setup);
> > -
> > +#else
> > +static void do_kernel_control_protection_fault(struct pt_regs
> > *regs)
> > +{
> > +	WARN_ONCE(1, "Kernel-mode control protection fault with IBT
> > disabled\n");
> > +}
> >  #endif /* CONFIG_X86_KERNEL_IBT */
> >  
> > +#if defined(CONFIG_X86_KERNEL_IBT) ||
> > defined(CONFIG_X86_SHADOW_STACK)
> > +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
> > +	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> > +		pr_err("Unexpected #CP\n");
> > +		BUG();
> > +	}
> > +
> > +	if (user_mode(regs))
> > +		do_user_control_protection_fault(regs, error_code);
> > +	else
> > +		do_kernel_control_protection_fault(regs);
> 
> These function names are weirdly long, surely they can do without the
> _fault part at the very least. And as stated above, I would really
> like
> the kernel thing to retain the error_code argument.
> 

I can shorten them. Thanks.

> > +}
> > +#endif /* defined(CONFIG_X86_KERNEL_IBT) ||
> > defined(CONFIG_X86_SHADOW_STACK) */
> 
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-10-05  2:43   ` Andrew Cooper
@ 2022-10-05 22:47     ` Edgecombe, Rick P
  2022-10-05 22:58       ` Andrew Cooper
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-05 22:47 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, andrew.cooper3, hjl.tools, Yang, Weijiang, oleg, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2022-10-05 at 02:43 +0000, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
> > diff --git a/arch/x86/include/asm/special_insns.h
> > b/arch/x86/include/asm/special_insns.h
> > index 35f709f619fb..f096f52bd059 100644
> > --- a/arch/x86/include/asm/special_insns.h
> > +++ b/arch/x86/include/asm/special_insns.h
> > @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
> >                : [pax] "a" (p));
> >   }
> >   
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > +{
> > +     asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > +                       _ASM_EXTABLE(1b, %l[fail])
> > +                       :: [addr] "r" (addr), [val] "r" (val)
> > +                       :: fail);
> 
> "1: wrssq %[val], %[addr]\n"
> _ASM_EXTABLE(1b, %l[fail])
> : [addr] "+m" (*addr)
> : [val] "r" (val)
> :: fail
> 
> Otherwise you've failed to tell the compiler that you wrote to *addr.
> 
> With that fixed, it's not volatile because there are no unexpressed
> side
> effects.

Ok, thanks!

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-10-05 22:47     ` Edgecombe, Rick P
@ 2022-10-05 22:58       ` Andrew Cooper
  2022-10-20 21:51         ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Andrew Cooper @ 2022-10-05 22:58 UTC (permalink / raw)
  To: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, hjl.tools, Yang, Weijiang, oleg, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov, Andrew Cooper
  Cc: Yu, Yu-cheng

On 05/10/2022 23:47, Edgecombe, Rick P wrote:
> On Wed, 2022-10-05 at 02:43 +0000, Andrew Cooper wrote:
>> On 29/09/2022 23:29, Rick Edgecombe wrote:
>>> diff --git a/arch/x86/include/asm/special_insns.h
>>> b/arch/x86/include/asm/special_insns.h
>>> index 35f709f619fb..f096f52bd059 100644
>>> --- a/arch/x86/include/asm/special_insns.h
>>> +++ b/arch/x86/include/asm/special_insns.h
>>> @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
>>>                : [pax] "a" (p));
>>>   }
>>>   
>>> +#ifdef CONFIG_X86_SHADOW_STACK
>>> +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
>>> +{
>>> +     asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
>>> +                       _ASM_EXTABLE(1b, %l[fail])
>>> +                       :: [addr] "r" (addr), [val] "r" (val)
>>> +                       :: fail);
>> "1: wrssq %[val], %[addr]\n"
>> _ASM_EXTABLE(1b, %l[fail])
>> : [addr] "+m" (*addr)
>> : [val] "r" (val)
>> :: fail
>>
>> Otherwise you've failed to tell the compiler that you wrote to *addr.
>>
>> With that fixed, it's not volatile because there are no unexpressed
>> side
>> effects.
> Ok, thanks!

On further consideration, it should be "=m" not "+m", which is even less
constrained, and even easier for an enterprising optimiser to try and do
something useful with.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-05  2:17   ` Andrew Cooper
  2022-10-05 14:08     ` Dave Hansen
@ 2022-10-05 23:01     ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-05 23:01 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, andrew.cooper3, hjl.tools, Yang, Weijiang, oleg, Lutomirski,
	Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2022-10-05 at 02:17 +0000, Andrew Cooper wrote:
> (flags & PSE|RW|D) == PSE|D;
> 
> R/O+D can exist higher in the paging structures and does not convey
> type=shstk-ness to later stages of the walk.

Hmm, yes. I guess it would be more correct to check if it's a leaf as
well.

> 
> 
> However, there is a further complication which is bound rear its head
> sooner or later, and warrants discussing.
> 
> type=shstk isn't actually only R/O+D on the leaf PTE; its also R/W on
> the accumulated access rights on non-leaf PTEs.
> 
> Specifically, if you clear the RW bit on any higher level in the
> pagetable, then everything mapped by that PTE ceases to be of type
> shstk, even if the leaf has the R/O+D bit combination.
> 
> This is allegedly a feature for the database folks, where they can
> create R/O and R/W aliases of the same memory, sharing intermediate
> pagetables, where the R/W alias will set D bits per usual and the R/O
> alias needs not to transmogrify itself into a shadow stack.

Thanks, I somehow missed this corner of the architecture. It looks like
this is not an issue for Linux at the moment because non-leaf PTEs
should have Write=1. I guess we need to keep this in mind if we ever
have Write=0 upper level PTEs though. Maybe a comment around
_PAGE_TABLE would be useful.



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-05 14:08     ` Dave Hansen
@ 2022-10-05 23:06       ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-05 23:06 UTC (permalink / raw)
  To: Shankar, Ravi V, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, andrew.cooper3, hjl.tools, Yang, Weijiang,
	oleg, Lutomirski, Andy, pavel, arnd, Moreira, Joao, tglx,
	mike.kravetz, x86, linux-doc, jamorris, john.allen, rppt, mingo,
	Hansen, Dave, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2022-10-05 at 07:08 -0700, Dave Hansen wrote:
> On 10/4/22 19:17, Andrew Cooper wrote:
> > On 29/09/2022 23:29, Rick Edgecombe wrote:
> > > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > 
> > > There is essentially no room left in the x86 hardware PTEs on
> > > some OSes
> > > (not Linux). That left the hardware architects looking for a way
> > > to
> > > represent a new memory type (shadow stack) within the existing
> > > bits.
> > > They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> > 
> > How does "Some OSes have a greater dependence on software available
> > bits
> > in PTEs than Linux" sound?
> > 
> > > The reason it's lightly used is that Dirty=1 is normally set
> > > _before_ a
> > > write. A write with a Write=0 PTE would typically only generate a
> > > fault,
> > > not set Dirty=1. Hardware can (rarely) both set Write=1 *and*
> > > generate the
> > > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which
> > > supports shadow
> > > stacks will no longer exhibit this oddity.
> > 
> > Again, an interesting anecdote but not salient information here.
> 
> As much as I like the sound of my own voice (and anecdotes), I agree
> that this is a bit oblique for the patch.  Maybe this anecdote should
> get banished elsewhere.
> 
> The changelog here could definitely get to the point faster.

Although this text was inherited, I thought it was useful to disperse
any "huh, I wonder why" thoughts that may be lingering in the readers
head as they try to grok the rest of the text. I'll shorten it as
suggested. Thanks all.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-10-04  4:37       ` Kees Cook
@ 2022-10-06  0:38         ` Edgecombe, Rick P
  2022-10-06  3:11           ` Kees Cook
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-06  0:38 UTC (permalink / raw)
  To: keescook, Lutomirski, Andy
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	dave.hansen, kirill.shutemov, Eranian, Stephane, linux-mm,
	fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, Yang, Weijiang, linux-doc, pavel, arnd, Moreira, Joao,
	tglx, mike.kravetz, x86, jamorris, john.allen, rppt, mingo,
	Shankar, Ravi V, corbet, linux-kernel, linux-api, gorcunov

On Mon, 2022-10-03 at 21:37 -0700, Kees Cook wrote:
> On Mon, Oct 03, 2022 at 04:00:36PM -0700, Andy Lutomirski wrote:
> > On 10/3/22 15:28, Kees Cook wrote:
> > > On Thu, Sep 29, 2022 at 03:29:26PM -0700, Rick Edgecombe wrote:
> > > > For the current shadow stack implementation, shadow stacks
> > > > contents easily
> > > > be arbitrarily provisioned with data.
> > > 
> > > I can't parse this sentence.
> > > 
> > > > This property helps apps protect
> > > > themselves better, but also restricts any potential apps that
> > > > may want to
> > > > do exotic things at the expense of a little security.
> > > 
> > > Is anything using this right now? Wouldn't thing be safer without
> > > WRSS?
> > > (Why can't we skip this patch?)
> > > 
> > 
> > So that people don't write programs that need either (shstk off) or
> > (shstk
> > on and WRSS on) and crash or otherwise fail on kernels that support
> > shstk
> > but don't support WRSS, perhaps?
> 
> Right, yes. I meant more "what programs currently need WRSS to
> operate
> under shstk? (And what is it that they are doing that needs it?)"
> 
> All is see currently is compiler self-tests and emulators using it?
> 
https://codesearch.debian.net/search?q=%5Cb%28wrss%7CWRSS%29%5Cb&literal=0&perpkg=1

Most apps that weren't just automatically compiled haven't had
implementation effort yet. (of course glibc has had a bunch) I hope we
would see more of that when we finally get it upstream. So I think a
better question is, how many apps will need WRSS when they go to enable
shadow stack. I'm thinking the answer must be some and it could be nice
to catch them when they first investigate enabling it.

But yes, except for Mike's CRIU branch, there aren't any programs that
use it today, and we could drop it for a first implementation. I don't
see it as something that would only make things less safe though. It
just lets apps that can't easily work within the stricter shadow stack
environment, at least get access to a weaker but still beneficial one.

Kees, did you catch that it can be locked off while enabling shadow
stack?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace
  2022-10-06  0:38         ` Edgecombe, Rick P
@ 2022-10-06  3:11           ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-06  3:11 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Lutomirski, Andy, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, hjl.tools, Yang, Weijiang, linux-doc, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, jamorris, john.allen,
	rppt, mingo, Shankar, Ravi V, corbet, linux-kernel, linux-api,
	gorcunov

On Thu, Oct 06, 2022 at 12:38:06AM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-10-03 at 21:37 -0700, Kees Cook wrote:
> > On Mon, Oct 03, 2022 at 04:00:36PM -0700, Andy Lutomirski wrote:
> > > On 10/3/22 15:28, Kees Cook wrote:
> > > > On Thu, Sep 29, 2022 at 03:29:26PM -0700, Rick Edgecombe wrote:
> > > > > For the current shadow stack implementation, shadow stacks
> > > > > contents easily
> > > > > be arbitrarily provisioned with data.
> > > > 
> > > > I can't parse this sentence.
> > > > 
> > > > > This property helps apps protect
> > > > > themselves better, but also restricts any potential apps that
> > > > > may want to
> > > > > do exotic things at the expense of a little security.
> > > > 
> > > > Is anything using this right now? Wouldn't thing be safer without
> > > > WRSS?
> > > > (Why can't we skip this patch?)
> > > > 
> > > 
> > > So that people don't write programs that need either (shstk off) or
> > > (shstk
> > > on and WRSS on) and crash or otherwise fail on kernels that support
> > > shstk
> > > but don't support WRSS, perhaps?
> > 
> > Right, yes. I meant more "what programs currently need WRSS to
> > operate
> > under shstk? (And what is it that they are doing that needs it?)"
> > 
> > All is see currently is compiler self-tests and emulators using it?
> > 
> https://codesearch.debian.net/search?q=%5Cb%28wrss%7CWRSS%29%5Cb&literal=0&perpkg=1
> 
> Most apps that weren't just automatically compiled haven't had
> implementation effort yet. (of course glibc has had a bunch) I hope we
> would see more of that when we finally get it upstream. So I think a
> better question is, how many apps will need WRSS when they go to enable
> shadow stack. I'm thinking the answer must be some and it could be nice
> to catch them when they first investigate enabling it.
> 
> But yes, except for Mike's CRIU branch, there aren't any programs that
> use it today, and we could drop it for a first implementation. I don't
> see it as something that would only make things less safe though. It
> just lets apps that can't easily work within the stricter shadow stack
> environment, at least get access to a weaker but still beneficial one.
> 
> Kees, did you catch that it can be locked off while enabling shadow
> stack?

Yup, saw that! Looks good. Thanks. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-30 15:16   ` Jann Horn
@ 2022-10-06 16:10     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-06 16:10 UTC (permalink / raw)
  To: jannh
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, dethoma, linux-arch,
	kcc, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy,
	pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	Raj, Ashok, jamorris, john.allen, rppt, mingo, Shankar, Ravi V,
	corbet, linux-kernel, linux-api, gorcunov

On Fri, 2022-09-30 at 17:16 +0200, Jann Horn wrote:
> On Fri, Sep 30, 2022 at 12:30 AM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > The reason it's lightly used is that Dirty=1 is normally set
> > _before_ a
> > write. A write with a Write=0 PTE would typically only generate a
> > fault,
> > not set Dirty=1. Hardware can (rarely) both set Write=1 *and*
> > generate the
> > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports
> > shadow
> > stacks will no longer exhibit this oddity.
> 
> Stupid question, since I just recently learned that IOMMUv2 is a
> thing: I assume this also holds for IOMMUs that implement
> IOMMUv2/SVA,
> where the IOMMU directly walks the userspace page tables, and not
> just
> for the CPU core?

Sorry for the delay, I had to go find out. IOMMU behaves similar to the
CET CPUs in this regard. Thanks for the question.




^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-10-03 22:51     ` Edgecombe, Rick P
@ 2022-10-06 18:50       ` Mike Rapoport
  0 siblings, 0 replies; 241+ messages in thread
From: Mike Rapoport @ 2022-10-06 18:50 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: keescook, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, Oct 03, 2022 at 10:51:02PM +0000, Edgecombe, Rick P wrote:
> CC Mike about ptrace/CRIU question.
> 
> On Mon, 2022-10-03 at 12:01 -0700, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:20PM -0700, Rick Edgecombe wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > Add three new arch_prctl() handles:
> > > 
> > >  - ARCH_CET_ENABLE/DISABLE enables or disables the specified
> > >    feature. Returns 0 on success or an error.
> > > 
> > >  - ARCH_CET_LOCK prevents future disabling or enabling of the
> > >    specified feature. Returns 0 on success or an error
> > > 
> > > The features are handled per-thread and inherited over
> > > fork(2)/clone(2),
> > > but reset on exec().

...

> > > +#include <linux/sched.h>
> > > +#include <linux/bitops.h>
> > > +#include <asm/prctl.h>
> > > +
> > > +long cet_prctl(struct task_struct *task, int option, unsigned long
> > > features)
> > > +{
> > > +	if (option == ARCH_CET_LOCK) {
> > > +		task->thread.features_locked |= features;
> > > +		return 0;
> > > +	}
> > > +
> > > +	/* Don't allow via ptrace */
> > > +	if (task != current)
> > > +		return -EINVAL;
> > 
> > ... but locking _is_ allowed via ptrace? If that intended, it should
> > be
> > explicitly mentioned in the commit log and in a comment here.
> 
> I believe CRIU needs to lock via ptrace as well. Maybe Mike can
> confirm.

Actually, I didn't use ptrace for locking, I did it with "plain"
arch_prctl().

I still can't say for sure CRIU won't need this, I didn't have time yet to
have a closer look at this set.
 
-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-09-29 22:29 ` [PATCH v2 23/39] x86: Introduce userspace API for CET enabling Rick Edgecombe
  2022-10-03 19:01   ` Kees Cook
@ 2022-10-10 10:56   ` Florian Weimer
  2022-10-10 16:28     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Florian Weimer @ 2022-10-10 10:56 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

* Rick Edgecombe:

> +	/* Only support enabling/disabling one feature at a time. */
> +	if (hweight_long(features) > 1)
> +		return -EINVAL;

This means we'll soon need three extra system calls for x86-64 process
start: SHSTK, IBT, and switching off vsyscall emulation.  (The latter
does not need any special CPU support.)

Maybe we can do something else instead to make the strace output a
little bit cleaner?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall
  2022-09-29 22:29 ` [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
  2022-10-03 22:23   ` Kees Cook
@ 2022-10-10 11:13   ` Florian Weimer
  2022-10-10 14:19     ` Jason A. Donenfeld
  1 sibling, 1 reply; 241+ messages in thread
From: Florian Weimer @ 2022-10-10 11:13 UTC (permalink / raw)
  To: Rick Edgecombe, Jason A. Donenfeld
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma

* Rick Edgecombe:

> When operating with shadow stacks enabled, the kernel will automatically
> allocate shadow stacks for new threads, however in some cases userspace
> will need additional shadow stacks. The main example of this is the
> ucontext family of functions, which require userspace allocating and
> pivoting to userspace managed stacks.
>
> Unlike most other user memory permissions, shadow stacks need to be
> provisioned with special data in order to be useful. They need to be setup
> with a restore token so that userspace can pivot to them via the RSTORSSP
> instruction. But, the security design of shadow stack's is that they
> should not be written to except in limited circumstances. This presents a
> problem for userspace, as to how userspace can provision this special
> data, without allowing for the shadow stack to be generally writable.
>
> Previously, a new PROT_SHADOW_STACK was attempted, which could be
> mprotect()ed from RW permissions after the data was provisioned. This was
> found to not be secure enough, as other thread's could write to the
> shadow stack during the writable window.
>
> The kernel can use a special instruction, WRUSS, to write directly to
> userspace shadow stacks. So the solution can be that memory can be mapped
> as shadow stack permissions from the beginning (never generally writable
> in userspace), and the kernel itself can write the restore token.
>
> First, a new madvise() flag was explored, which could operate on the
> PROT_SHADOW_STACK memory. This had a couple downsides:
> 1. Extra checks were needed in mprotect() to prevent writable memory from
>    ever becoming PROT_SHADOW_STACK.
> 2. Extra checks/vma state were needed in the new madvise() to prevent
>    restore tokens being written into the middle of pre-used shadow stacks.
>    It is ideal to prevent restore tokens being added at arbitrary
>    locations, so the check was to make sure the shadow stack had never been
>    written to.
> 3. It stood out from the rest of the madvise flags, as more of direct
>    action than a hint at future desired behavior.
>
> So rather than repurpose two existing syscalls (mmap, madvise) that don't
> quite fit, just implement a new map_shadow_stack syscall to allow
> userspace to map and setup new shadow stacks in one step. While ucontext
> is the primary motivator, userspace may have other unforeseen reasons to
> setup it's own shadow stacks using the WRSS instruction. Towards this
> provide a flag so that stacks can be optionally setup securely for the
> common case of ucontext without enabling WRSS. Or potentially have the
> kernel set up the shadow stack in some new way.
>
> The following example demonstrates how to create a new shadow stack with
> map_shadow_stack:
> void *shstk = map_shadow_stack(adrr, stack_size, SHADOW_STACK_SET_TOKEN);

Jason has recently been working on vDSO-based getrandom acceleration.
It needs a way for a userspace thread to allocate userspace memory in a
specific way.  Jason proposed to use a vDSO call as the interface, not a
system call.

Maybe this approach is applicable here as well?  Or we can come up with
a more general interface for such per-thread allocations?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
                     ` (2 preceding siblings ...)
  2022-10-05  0:02   ` Andrew Cooper
@ 2022-10-10 12:19   ` Florian Weimer
  2022-10-10 16:44     ` Edgecombe, Rick P
  3 siblings, 1 reply; 241+ messages in thread
From: Florian Weimer @ 2022-10-10 12:19 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

* Rick Edgecombe:

> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or LLVM v10.0.1
> +or later are required. To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.

Uhm, I think we are using binutils 2.30 with extra fixes.  I hope that
these binaries are still valid.

More importantly, glibc needs to be configured with --enable-cet
explicitly (unless the compiler defaults to CET).  The default glibc
build with a default GCC will produce dynamically-linked executables
that disable CET (when running on later/differently configured glibc
builds).  The statically linked object files are not marked up for CET
in that case.

I think the goal is to support the new kernel interface for actually
switching on SHSTK in glibc 2.37.  But at that point, hopefully all
those existing binaries can start enjoying the STSTK benefits.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-10-05  2:30     ` Andrew Cooper
@ 2022-10-10 12:33       ` Florian Weimer
  2022-10-10 13:32         ` Andrew Cooper
  0 siblings, 1 reply; 241+ messages in thread
From: Florian Weimer @ 2022-10-10 12:33 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kees Cook, Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, Yu-cheng Yu

* Andrew Cooper:

> You don't actually need a hole to create a guard.  Any mapping of type
> != shstk will do.
>
> If you've got a load of threads, you can tightly pack stack / shstk /
> stack / shstk with no holes, and they each act as each other guard pages.

Can userspace read the shadow stack directly?  Writing is obviously
blocked, but reading?

GCC's stack-clash probing uses OR instructions, so it would be fine with
a readable mapping.  POSIX does not appear to require PROT_NONE mappings
for the stack guard region, either.  However, the
pthread_attr_setguardsize manual page pretty clearly says that it's got
to be unreadable and unwriteable.  Hence my question.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-10-10 12:33       ` Florian Weimer
@ 2022-10-10 13:32         ` Andrew Cooper
  2022-10-10 13:40           ` Florian Weimer
  0 siblings, 1 reply; 241+ messages in thread
From: Andrew Cooper @ 2022-10-10 13:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Kees Cook, Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, Yu-cheng Yu, Andrew Cooper

On 10/10/2022 13:33, Florian Weimer wrote:
> * Andrew Cooper:
>
>> You don't actually need a hole to create a guard.  Any mapping of type
>> != shstk will do.
>>
>> If you've got a load of threads, you can tightly pack stack / shstk /
>> stack / shstk with no holes, and they each act as each other guard pages.
> Can userspace read the shadow stack directly?  Writing is obviously
> blocked, but reading?

Yes - regular reads are permitted to shstk memory.

It's actually a great way to get backtraces with no extra metadata needed.

> GCC's stack-clash probing uses OR instructions, so it would be fine with
> a readable mapping.

It's `or $0, (%rsp)` which is a read/modify/write and will fault when
hitting a shstk mapping.

> POSIX does not appear to require PROT_NONE mappings
> for the stack guard region, either.  However, the
> pthread_attr_setguardsize manual page pretty clearly says that it's got
> to be unreadable and unwriteable.  Hence my question.

Hmm.  If that's what the manuals say, then fine.

But honestly, you don't get very far at all without faulting on a
read-only stack.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-10-10 13:32         ` Andrew Cooper
@ 2022-10-10 13:40           ` Florian Weimer
  2022-10-10 13:56             ` Andrew Cooper
  0 siblings, 1 reply; 241+ messages in thread
From: Florian Weimer @ 2022-10-10 13:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kees Cook, Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, Yu-cheng Yu

* Andrew Cooper:

> On 10/10/2022 13:33, Florian Weimer wrote:
>> * Andrew Cooper:
>>
>>> You don't actually need a hole to create a guard.  Any mapping of type
>>> != shstk will do.
>>>
>>> If you've got a load of threads, you can tightly pack stack / shstk /
>>> stack / shstk with no holes, and they each act as each other guard pages.
>> Can userspace read the shadow stack directly?  Writing is obviously
>> blocked, but reading?
>
> Yes - regular reads are permitted to shstk memory.
>
> It's actually a great way to get backtraces with no extra metadata
> needed.

Indeed, I hope shadow stacks can be used to put the discussion around
frame pointers to a rest, at least when it comes to profiling. 8-)

>> POSIX does not appear to require PROT_NONE mappings
>> for the stack guard region, either.  However, the
>> pthread_attr_setguardsize manual page pretty clearly says that it's got
>> to be unreadable and unwriteable.  Hence my question.
>
> Hmm.  If that's what the manuals say, then fine.
>
> But honestly, you don't get very far at all without faulting on a
> read-only stack.

I guess we can update the manual page proactively.  It does look like a
tempting optimization.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 18/39] mm: Add guard pages around a shadow stack.
  2022-10-10 13:40           ` Florian Weimer
@ 2022-10-10 13:56             ` Andrew Cooper
  0 siblings, 0 replies; 241+ messages in thread
From: Andrew Cooper @ 2022-10-10 13:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Kees Cook, Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, Yu-cheng Yu, Andrew Cooper

On 10/10/2022 14:40, Florian Weimer wrote:
> * Andrew Cooper:
>
>>> POSIX does not appear to require PROT_NONE mappings
>>> for the stack guard region, either.  However, the
>>> pthread_attr_setguardsize manual page pretty clearly says that it's got
>>> to be unreadable and unwriteable.  Hence my question.
>> Hmm.  If that's what the manuals say, then fine.
>>
>> But honestly, you don't get very far at all without faulting on a
>> read-only stack.
> I guess we can update the manual page proactively.  It does look like a
> tempting optimization.

Here's one I prepared earlier, discussing getting supervisor shadow
stacks working in Xen.

http://xenbits.xen.org/people/andrewcoop/Xen-CET-SS.pdf

This optimisation turned out to be very helpful by being able to put the
shadow stacks in what were previously the guard holes, meaning we didn't
actually need to allocate any more memory for the stacks.

~Andrew

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall
  2022-10-10 11:13   ` Florian Weimer
@ 2022-10-10 14:19     ` Jason A. Donenfeld
  0 siblings, 0 replies; 241+ messages in thread
From: Jason A. Donenfeld @ 2022-10-10 14:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Rick Edgecombe, x86, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H . J . Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V . Shankar, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian,
	rppt, jamorris, dethoma

On Mon, Oct 10, 2022 at 01:13:05PM +0200, Florian Weimer wrote:
> * Rick Edgecombe:
> 
> > When operating with shadow stacks enabled, the kernel will automatically
> > allocate shadow stacks for new threads, however in some cases userspace
> > will need additional shadow stacks. The main example of this is the
> > ucontext family of functions, which require userspace allocating and
> > pivoting to userspace managed stacks.
> >
> > Unlike most other user memory permissions, shadow stacks need to be
> > provisioned with special data in order to be useful. They need to be setup
> > with a restore token so that userspace can pivot to them via the RSTORSSP
> > instruction. But, the security design of shadow stack's is that they
> > should not be written to except in limited circumstances. This presents a
> > problem for userspace, as to how userspace can provision this special
> > data, without allowing for the shadow stack to be generally writable.
> >
> > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > mprotect()ed from RW permissions after the data was provisioned. This was
> > found to not be secure enough, as other thread's could write to the
> > shadow stack during the writable window.
> >
> > The kernel can use a special instruction, WRUSS, to write directly to
> > userspace shadow stacks. So the solution can be that memory can be mapped
> > as shadow stack permissions from the beginning (never generally writable
> > in userspace), and the kernel itself can write the restore token.
> >
> > First, a new madvise() flag was explored, which could operate on the
> > PROT_SHADOW_STACK memory. This had a couple downsides:
> > 1. Extra checks were needed in mprotect() to prevent writable memory from
> >    ever becoming PROT_SHADOW_STACK.
> > 2. Extra checks/vma state were needed in the new madvise() to prevent
> >    restore tokens being written into the middle of pre-used shadow stacks.
> >    It is ideal to prevent restore tokens being added at arbitrary
> >    locations, so the check was to make sure the shadow stack had never been
> >    written to.
> > 3. It stood out from the rest of the madvise flags, as more of direct
> >    action than a hint at future desired behavior.
> >
> > So rather than repurpose two existing syscalls (mmap, madvise) that don't
> > quite fit, just implement a new map_shadow_stack syscall to allow
> > userspace to map and setup new shadow stacks in one step. While ucontext
> > is the primary motivator, userspace may have other unforeseen reasons to
> > setup it's own shadow stacks using the WRSS instruction. Towards this
> > provide a flag so that stacks can be optionally setup securely for the
> > common case of ucontext without enabling WRSS. Or potentially have the
> > kernel set up the shadow stack in some new way.
> >
> > The following example demonstrates how to create a new shadow stack with
> > map_shadow_stack:
> > void *shstk = map_shadow_stack(adrr, stack_size, SHADOW_STACK_SET_TOKEN);
> 
> Jason has recently been working on vDSO-based getrandom acceleration.
> It needs a way for a userspace thread to allocate userspace memory in a
> specific way.  Jason proposed to use a vDSO call as the interface, not a
> system call.

Not quite so in the latest revision of that patch:
https://lore.kernel.org/lkml/20220916125916.652546-1-Jason@zx2c4.com/

Jason

> 
> Maybe this approach is applicable here as well?  Or we can come up with
> a more general interface for such per-thread allocations?
> 
> Thanks,
> Florian
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-10-10 10:56   ` Florian Weimer
@ 2022-10-10 16:28     ` Edgecombe, Rick P
  2022-10-12 12:18       ` Florian Weimer
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 16:28 UTC (permalink / raw)
  To: fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2022-10-10 at 12:56 +0200, Florian Weimer wrote:
> > +     /* Only support enabling/disabling one feature at a time. */
> > +     if (hweight_long(features) > 1)
> > +             return -EINVAL;
> 
> This means we'll soon need three extra system calls for x86-64
> process
> start: SHSTK, IBT, and switching off vsyscall emulation.  (The latter
> does not need any special CPU support.)
> 
> Maybe we can do something else instead to make the strace output a
> little bit cleaner?

In previous versions it supported enabling multiple features in a
single syscall. Thomas Gleixner pointed out that (this was on the LAM
patchset that shared the interface at the time) it makes the behavior
of what to do when one feature fails to enable complicated:

https://lore.kernel.org/lkml/87zgjjqico.ffs@tglx/

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-10 12:19   ` Florian Weimer
@ 2022-10-10 16:44     ` Edgecombe, Rick P
  2022-10-10 16:51       ` H.J. Lu
  2022-10-12 12:29       ` Florian Weimer
  0 siblings, 2 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 16:44 UTC (permalink / raw)
  To: fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2022-10-10 at 14:19 +0200, Florian Weimer wrote:
> Uhm, I think we are using binutils 2.30 with extra fixes.  I hope
> that
> these binaries are still valid.

Yea, you're right. Andrew Cooper pointed out it has been supported
since 2.29, so 2.30 should be fine.

> 
> More importantly, glibc needs to be configured with --enable-cet
> explicitly (unless the compiler defaults to CET).  The default glibc
> build with a default GCC will produce dynamically-linked executables
> that disable CET (when running on later/differently configured glibc
> builds).  The statically linked object files are not marked up for
> CET
> in that case.

Thanks, that's a good point. I'll add a blurb about glibc needs to be
compiled with CET support.

> 
> I think the goal is to support the new kernel interface for actually
> switching on SHSTK in glibc 2.37.  But at that point, hopefully all
> those existing binaries can start enjoying the STSTK benefits.

Can you share more about this plan? HJ was previously planning to wait
until the kernel support was upstream before making any more glibc
changes. Hopefully this will be in time for that, but I'd really rather
not repeat what happened last time where we had to design the kernel
interface around not breaking old glibc's with mismatched CET
enablement.

What did you think of the proposal to disable existing binaries and
start from scratch? Elaborated in the coverletter in the section
"Compatibility of Existing Binaries/Enabling Interface".

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-10 16:44     ` Edgecombe, Rick P
@ 2022-10-10 16:51       ` H.J. Lu
  2022-10-12 12:29       ` Florian Weimer
  1 sibling, 0 replies; 241+ messages in thread
From: H.J. Lu @ 2022-10-10 16:51 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: fweimer, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov,
	Eranian, Stephane, linux-mm, nadav.amit, jannh, dethoma,
	linux-arch, kcc, bp, oleg, Yang, Weijiang, Lutomirski, Andy,
	pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, Oct 10, 2022 at 9:44 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-10-10 at 14:19 +0200, Florian Weimer wrote:
> > Uhm, I think we are using binutils 2.30 with extra fixes.  I hope
> > that
> > these binaries are still valid.
>
> Yea, you're right. Andrew Cooper pointed out it has been supported
> since 2.29, so 2.30 should be fine.
>
> >
> > More importantly, glibc needs to be configured with --enable-cet
> > explicitly (unless the compiler defaults to CET).  The default glibc
> > build with a default GCC will produce dynamically-linked executables
> > that disable CET (when running on later/differently configured glibc
> > builds).  The statically linked object files are not marked up for
> > CET
> > in that case.
>
> Thanks, that's a good point. I'll add a blurb about glibc needs to be
> compiled with CET support.
>
> >
> > I think the goal is to support the new kernel interface for actually
> > switching on SHSTK in glibc 2.37.  But at that point, hopefully all
> > those existing binaries can start enjoying the STSTK benefits.
>
> Can you share more about this plan? HJ was previously planning to wait
> until the kernel support was upstream before making any more glibc
> changes. Hopefully this will be in time for that, but I'd really rather
> not repeat what happened last time where we had to design the kernel
> interface around not breaking old glibc's with mismatched CET
> enablement.
>
> What did you think of the proposal to disable existing binaries and
> start from scratch? Elaborated in the coverletter in the section
> "Compatibility of Existing Binaries/Enabling Interface".

My current glibc plan is that kernel won't enable CET automatically
and glibc will issue syscall to enable CET at early startup time.   All
existing CET enabled dynamic executables will have CET enabled
under the CET kernel and the updated CET glibc.

-- 
H.J.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-10-10 16:28     ` Edgecombe, Rick P
@ 2022-10-12 12:18       ` Florian Weimer
  2022-10-12 17:30         ` Edgecombe, Rick P
  0 siblings, 1 reply; 241+ messages in thread
From: Florian Weimer @ 2022-10-12 12:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, nadav.amit, jannh, dethoma, linux-arch, kcc, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, linux-doc, jamorris,
	john.allen, rppt, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

* Rick P. Edgecombe:

> On Mon, 2022-10-10 at 12:56 +0200, Florian Weimer wrote:
>> > +     /* Only support enabling/disabling one feature at a time. */
>> > +     if (hweight_long(features) > 1)
>> > +             return -EINVAL;
>> 
>> This means we'll soon need three extra system calls for x86-64
>> process
>> start: SHSTK, IBT, and switching off vsyscall emulation.  (The latter
>> does not need any special CPU support.)
>> 
>> Maybe we can do something else instead to make the strace output a
>> little bit cleaner?
>
> In previous versions it supported enabling multiple features in a
> single syscall. Thomas Gleixner pointed out that (this was on the LAM
> patchset that shared the interface at the time) it makes the behavior
> of what to do when one feature fails to enable complicated:
>
> https://lore.kernel.org/lkml/87zgjjqico.ffs@tglx/

Can we return the bits for the features that were actually enabled?
Those three don't have cross-dependencies in the sense that you would
only use X & Y together, but not X or Y alone.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-10 16:44     ` Edgecombe, Rick P
  2022-10-10 16:51       ` H.J. Lu
@ 2022-10-12 12:29       ` Florian Weimer
  2022-10-12 15:59         ` Dave Hansen
  2022-10-13 21:28         ` Edgecombe, Rick P
  1 sibling, 2 replies; 241+ messages in thread
From: Florian Weimer @ 2022-10-12 12:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

* Rick P. Edgecombe:

>> I think the goal is to support the new kernel interface for actually
>> switching on SHSTK in glibc 2.37.  But at that point, hopefully all
>> those existing binaries can start enjoying the STSTK benefits.
>
> Can you share more about this plan? HJ was previously planning to wait
> until the kernel support was upstream before making any more glibc
> changes. Hopefully this will be in time for that, but I'd really rather
> not repeat what happened last time where we had to design the kernel
> interface around not breaking old glibc's with mismatched CET
> enablement.

You're still doing that (keeping that gap in this constant), and this
appreciated and very much necessary.

> What did you think of the proposal to disable existing binaries and
> start from scratch? Elaborated in the coverletter in the section
> "Compatibility of Existing Binaries/Enabling Interface".

The ABI was finalized around four years ago, and we have shipped several
Fedora and Red Hat Enterprise Linux versions with it.  Other
distributions did as well.  It's a bit late to make changes now, and
certainly not for such trivialities.  In the case of the IBT ABI, it may
be tempting to start over in a less trivial way, to radically reduce the
amount of ENDBR instructions.  But that doesn't concern SHSTK, and
there's no actual implementation anyway.

But as H.J. implied, you would have to do rather nasty things in the
kernel to prevent us from achieving ABI compatibility in userspace, like
parsing property notes on the main executable and disabling the new
arch_prctl calls if you see something there that you don't like. 8-)
Of course no one is going to implement that.

(We are fine with swapping out glibc and its dynamic loader to enable
CET with the appropriate kernel mechanism, but we wouldn't want to
change the way all other binaries are marked up.)

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-12 12:29       ` Florian Weimer
@ 2022-10-12 15:59         ` Dave Hansen
  2022-10-12 16:54           ` Florian Weimer
  2022-10-13 21:28         ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2022-10-12 15:59 UTC (permalink / raw)
  To: Florian Weimer, Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On 10/12/22 05:29, Florian Weimer wrote:
>> What did you think of the proposal to disable existing binaries and
>> start from scratch? Elaborated in the coverletter in the section
>> "Compatibility of Existing Binaries/Enabling Interface".
> The ABI was finalized around four years ago, and we have shipped several
> Fedora and Red Hat Enterprise Linux versions with it.  Other
> distributions did as well.  It's a bit late to make changes now, and
> certainly not for such trivialities. 

Just to be clear: You're saying that a user/kernel ABI was "finalized"
by glibc shipping the user side of it, before there being an upstream
kernel implementation?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-12 15:59         ` Dave Hansen
@ 2022-10-12 16:54           ` Florian Weimer
  0 siblings, 0 replies; 241+ messages in thread
From: Florian Weimer @ 2022-10-12 16:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Edgecombe, Rick P, bsingharora, hpa, Syromiatnikov, Eugene,
	peterz, rdunlap, keescook, Yu, Yu-cheng, dave.hansen,
	kirill.shutemov, Eranian, Stephane, linux-mm, nadav.amit, jannh,
	dethoma, linux-arch, kcc, bp, oleg, hjl.tools, Yang, Weijiang,
	Lutomirski, Andy, pavel, arnd, Moreira, Joao, tglx, mike.kravetz,
	x86, linux-doc, jamorris, john.allen, rppt, mingo, Shankar,
	Ravi V, corbet, linux-kernel, linux-api, gorcunov

* Dave Hansen:

> On 10/12/22 05:29, Florian Weimer wrote:
>>> What did you think of the proposal to disable existing binaries and
>>> start from scratch? Elaborated in the coverletter in the section
>>> "Compatibility of Existing Binaries/Enabling Interface".
>> The ABI was finalized around four years ago, and we have shipped several
>> Fedora and Red Hat Enterprise Linux versions with it.  Other
>> distributions did as well.  It's a bit late to make changes now, and
>> certainly not for such trivialities. 
>
> Just to be clear: You're saying that a user/kernel ABI was "finalized"
> by glibc shipping the user side of it, before there being an upstream
> kernel implementation?

Sorry for being unclear.  I was refering to the x86-64 ELF psABI
supplement for CET, not the kernel/userspace interface, which still does
not exist in its final form as of today, as far as I understand it.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 23/39] x86: Introduce userspace API for CET enabling
  2022-10-12 12:18       ` Florian Weimer
@ 2022-10-12 17:30         ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-12 17:30 UTC (permalink / raw)
  To: fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, nadav.amit, jannh, dethoma, kcc, linux-arch, bp, oleg,
	hjl.tools, Yang, Weijiang, Lutomirski, Andy, jamorris, arnd,
	Moreira, Joao, tglx, pavel, mike.kravetz, x86, linux-doc, rppt,
	john.allen, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Wed, 2022-10-12 at 14:18 +0200, Florian Weimer wrote:
> * Rick P. Edgecombe:
> 
> > On Mon, 2022-10-10 at 12:56 +0200, Florian Weimer wrote:
> > > > +     /* Only support enabling/disabling one feature at a time.
> > > > */
> > > > +     if (hweight_long(features) > 1)
> > > > +             return -EINVAL;
> > > 
> > > This means we'll soon need three extra system calls for x86-64
> > > process
> > > start: SHSTK, IBT, and switching off vsyscall emulation.  (The
> > > latter
> > > does not need any special CPU support.)
> > > 
> > > Maybe we can do something else instead to make the strace output
> > > a
> > > little bit cleaner?
> > 
> > In previous versions it supported enabling multiple features in a
> > single syscall. Thomas Gleixner pointed out that (this was on the
> > LAM
> > patchset that shared the interface at the time) it makes the
> > behavior
> > of what to do when one feature fails to enable complicated:
> > 
> > https://lore.kernel.org/lkml/87zgjjqico.ffs@tglx/
> 
> Can we return the bits for the features that were actually enabled?

Actually that specific option is covered in that thread as well. I was
thinking we would need to pass a struct in an out to do a batch
operation. Thomas suggested it could be added later and to start with a
simpler option. Is an extra syscall or two at startup really a big
problem?

> Those three don't have cross-dependencies in the sense that you would
> only use X & Y together, but not X or Y alone.

I don't fully follow this, but WRSS does actually depend on SHSTK.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
                     ` (2 preceding siblings ...)
  2022-10-03 19:42   ` Dave Hansen
@ 2022-10-12 20:04   ` Borislav Petkov
  2022-10-13  0:31     ` Edgecombe, Rick P
  3 siblings, 1 reply; 241+ messages in thread
From: Borislav Petkov @ 2022-10-12 20:04 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:28:59PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>

> Subject: Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack

Please remove all "CET", "cet", etc strings from the text as that is
confusing. We should use either shadow stack or IBT and not CET.

> +config ARCH_HAS_SHADOW_STACK

Do I see it correctly that this thing is needed only once in
show_smap_vma_flags()?

If so, can we do a arch_show_smap_vma_flags(), call it at the end of
former function and avoid adding yet another Kconfig symbol?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-10-12 20:04   ` Borislav Petkov
@ 2022-10-13  0:31     ` Edgecombe, Rick P
  2022-10-13  9:21       ` Borislav Petkov
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-13  0:31 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Wed, 2022-10-12 at 22:04 +0200, Borislav Petkov wrote:
> On Thu, Sep 29, 2022 at 03:28:59PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Subject: Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for
> > Shadow Stack
> 
> Please remove all "CET", "cet", etc strings from the text as that is
> confusing. We should use either shadow stack or IBT and not CET.

Good point, I'll remove it. Thanks.

> 
> > +config ARCH_HAS_SHADOW_STACK
> 
> Do I see it correctly that this thing is needed only once in
> show_smap_vma_flags()?
> 
> If so, can we do a arch_show_smap_vma_flags(), call it at the end of
> former function and avoid adding yet another Kconfig symbol?

Yea, I was thinking to maybe just change it to
CONFIG_X86_USER_SHADOW_STACK in show_smap_vma_flags(). In that function
there is already CONFIG_ARM64_BTI and CONFIG_ARM64_MTE.

I'm not sure if there is any aversion to having arch CONFIGs in core
code, but it's kind of nice to have all of the potentially conflicting
strings in once place.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack
  2022-10-13  0:31     ` Edgecombe, Rick P
@ 2022-10-13  9:21       ` Borislav Petkov
  0 siblings, 0 replies; 241+ messages in thread
From: Borislav Petkov @ 2022-10-13  9:21 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Oct 13, 2022 at 12:31:38AM +0000, Edgecombe, Rick P wrote:
> Yea, I was thinking to maybe just change it to
> CONFIG_X86_USER_SHADOW_STACK in show_smap_vma_flags(). In that function
> there is already CONFIG_ARM64_BTI and CONFIG_ARM64_MTE.

I was thinking exactly the same thing. :-)

> I'm not sure if there is any aversion to having arch CONFIGs in core
> code, but it's kind of nice to have all of the potentially conflicting
> strings in once place.

Yeah, ok.

I guess you can do the CONFIG_X86_USER_SHADOW_STACK thing for the sake
of simplicity. We have *waaay* too many Kconfig symbols and we should
introduce only the least amount of new ones.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-12 12:29       ` Florian Weimer
  2022-10-12 15:59         ` Dave Hansen
@ 2022-10-13 21:28         ` Edgecombe, Rick P
  2022-10-13 22:15           ` H.J. Lu
  2022-10-26 21:59           ` Edgecombe, Rick P
  1 sibling, 2 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-13 21:28 UTC (permalink / raw)
  To: fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, Moreira, Joao, tglx, pavel, mike.kravetz,
	x86, linux-doc, rppt, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Wed, 2022-10-12 at 14:29 +0200, Florian Weimer wrote:
> The ABI was finalized around four years ago, and we have shipped
> several
> Fedora and Red Hat Enterprise Linux versions with it.  Other
> distributions did as well.  It's a bit late to make changes now, and
> certainly not for such trivialities.  In the case of the IBT ABI, it
> may
> be tempting to start over in a less trivial way, to radically reduce
> the
> amount of ENDBR instructions.  But that doesn't concern SHSTK, and
> there's no actual implementation anyway.
> 
> But as H.J. implied, you would have to do rather nasty things in the
> kernel to prevent us from achieving ABI compatibility in userspace,
> like
> parsing property notes on the main executable and disabling the new
> arch_prctl calls if you see something there that you don't like. 8-)
> Of course no one is going to implement that.
> 
> (We are fine with swapping out glibc and its dynamic loader to enable
> CET with the appropriate kernel mechanism, but we wouldn't want to
> change the way all other binaries are marked up.)

So we have these compatibility issues with existing binaries. We know
some apps are totally broken. It sounds like you are proposing to
ignore this and let people who hit the issues work through it
themselves. This was also proposed by other glibc developers as a
solution for past CET compatibility issues that broke boot on kernel
upgrade. I have to say, as the person pushing these patches, I’m
uncomfortable with this approach. I don’t think users will like the
results. Basically, do they want to upgrade and run a bunch of untested
integration with known failures? I also don’t want to get this feature
reverted and I’m not exactly sure how this scenario would be taken.

But I hear the point about it not being ideal to abandon the existing
CET userspace. I think there is also a point about how userspace chose
to do this optimistic and early wide enabling, even if it was a bad
idea, and so how much should the kernel try to save userspace from
itself. So what do you think about this instead:

The current psABI spec talks about the binary being compatible with
shadow stack. It doesn’t say much about what should happen after the
loader. Since the greater ecosystem has used this bit with a more
cavalier attitude, glibc could treat it as a request for a warn and
continue mode. In the meantime we could have a new bit shstk_strict,
that requests behavior like these patches implement, and kills the
process on violation. Glibc/tools could add support for this strict bit
and anyone that wants to more carefully compile with it could finally
get shadow stack today. Then the implementation of the warn and
continue mode could follow that, and glibc could map the original shstk
bit to that kernel mode. So the old binaries would get there
eventually, which is better than the continuing nothing they have
today.

And speaking of having nothing today, there are people that really want
to use shadow stack and do not care at all about having CET support for
existing binaries. Neither glibc or elf bits are required to use kernel
shadow stack support. So if it comes to it, I don’t want to hold
support back for other users because the elf note bit enabling path
grew some issues.

Please let me know about what you think of that plan.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-13 21:28         ` Edgecombe, Rick P
@ 2022-10-13 22:15           ` H.J. Lu
  2022-10-26 21:59           ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: H.J. Lu @ 2022-10-13 22:15 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: fweimer, bsingharora, hpa, Syromiatnikov, Eugene, peterz,
	rdunlap, keescook, Yu, Yu-cheng, Eranian, Stephane,
	kirill.shutemov, dave.hansen, linux-mm, nadav.amit, jannh,
	dethoma, kcc, linux-arch, bp, oleg, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, Moreira, Joao, tglx, pavel, mike.kravetz,
	x86, linux-doc, rppt, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Oct 13, 2022 at 2:28 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2022-10-12 at 14:29 +0200, Florian Weimer wrote:
> > The ABI was finalized around four years ago, and we have shipped
> > several
> > Fedora and Red Hat Enterprise Linux versions with it.  Other
> > distributions did as well.  It's a bit late to make changes now, and
> > certainly not for such trivialities.  In the case of the IBT ABI, it
> > may
> > be tempting to start over in a less trivial way, to radically reduce
> > the
> > amount of ENDBR instructions.  But that doesn't concern SHSTK, and
> > there's no actual implementation anyway.
> >
> > But as H.J. implied, you would have to do rather nasty things in the
> > kernel to prevent us from achieving ABI compatibility in userspace,
> > like
> > parsing property notes on the main executable and disabling the new
> > arch_prctl calls if you see something there that you don't like. 8-)
> > Of course no one is going to implement that.
> >
> > (We are fine with swapping out glibc and its dynamic loader to enable
> > CET with the appropriate kernel mechanism, but we wouldn't want to
> > change the way all other binaries are marked up.)
>
> So we have these compatibility issues with existing binaries. We know
> some apps are totally broken. It sounds like you are proposing to
> ignore this and let people who hit the issues work through it
> themselves. This was also proposed by other glibc developers as a
> solution for past CET compatibility issues that broke boot on kernel
> upgrade. I have to say, as the person pushing these patches, I’m
> uncomfortable with this approach. I don’t think users will like the
> results. Basically, do they want to upgrade and run a bunch of untested
> integration with known failures? I also don’t want to get this feature
> reverted and I’m not exactly sure how this scenario would be taken.
>
> But I hear the point about it not being ideal to abandon the existing
> CET userspace. I think there is also a point about how userspace chose
> to do this optimistic and early wide enabling, even if it was a bad
> idea, and so how much should the kernel try to save userspace from
> itself. So what do you think about this instead:
>
> The current psABI spec talks about the binary being compatible with
> shadow stack. It doesn’t say much about what should happen after the
> loader. Since the greater ecosystem has used this bit with a more
> cavalier attitude, glibc could treat it as a request for a warn and
> continue mode. In the meantime we could have a new bit shstk_strict,
> that requests behavior like these patches implement, and kills the
> process on violation. Glibc/tools could add support for this strict bit
> and anyone that wants to more carefully compile with it could finally
> get shadow stack today. Then the implementation of the warn and
> continue mode could follow that, and glibc could map the original shstk
> bit to that kernel mode. So the old binaries would get there
> eventually, which is better than the continuing nothing they have
> today.
>
> And speaking of having nothing today, there are people that really want
> to use shadow stack and do not care at all about having CET support for
> existing binaries. Neither glibc or elf bits are required to use kernel
> shadow stack support. So if it comes to it, I don’t want to hold
> support back for other users because the elf note bit enabling path
> grew some issues.
>
> Please let me know about what you think of that plan.

The kernel CET description

+The kernel does not process these applications directly. Applications must
+enable them using the interface descriped in section 4. Typically this
+would be done in dynamic loader or static runtime objects, as is the case
+in glibc.

may leave an impression that each application needs to use the kernel
interface to enable CET itself.  This is an option.  But the updated glibc
will enable CET automatically on behalf of the CET enabled application.
If the glibc isn't updated to use the new CET kernel interface, the existing
CET enabled binaries will run correctly under the new CET enabled
kernel without CET enabled.

-- 
H.J.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
                     ` (3 preceding siblings ...)
  2022-10-05 11:33   ` Peter Zijlstra
@ 2022-10-14  9:41   ` Peter Zijlstra
  2022-10-14 15:52     ` Edgecombe, Rick P
  2022-10-14  9:42   ` Peter Zijlstra
  5 siblings, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-14  9:41 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux). That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> 
> The reason it's lightly used is that Dirty=1 is normally set _before_ a
> write. A write with a Write=0 PTE would typically only generate a fault,
> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the

s/Write/Dirty/

> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow

s/Dirty=0,Write=1/Write=0,Dirty=1/

> stacks will no longer exhibit this oddity.
> 
> The kernel should avoid inadvertently creating shadow stack memory because
> it is security sensitive. So given the above, all it needs to do is avoid
> manually crating Write=0,Dirty=1 PTEs in software.

Whichever way around you choose, please be consistent.

> In places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
> Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1. This
> clearly separates shadow stack from other data, and results in the
> following:
> 
> (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
>     Previously when a typical anonymous writable mapping was made COW via
>     fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead
>     use the Cow bit.
> (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
>     is in a R/O VMA, and get_user_pages() needs a writable copy. The page
>     fault handler creates a copy of the page and sets the new copy's PTE
>     as Write=0 and Cow=1.
> (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
> (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack
>     page is being shared among processes (this happens at fork()), its PTE
>     is made Dirty=0, so the next shadow stack access causes a fault, and
>     the page is duplicated and Dirty=1 is set again. This is the COW
>     equivalent for shadow stack pages, even though it's copy-on-access
>     rather than copy-on-write.
> (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without
>     shadow stack support set Dirty=1.

Please restureture this (and the comment) something like:


  (Write=0,Dirty=0,Cow=1):

	- copy_present_pte(): A modified copy-on-write page.
	- ...


  (Write=0,Dirty=1,Cow=0):

	- FEATURE_CET:  Shadow Stack entry
	- !FEATURE_CET: see the above Cow=1 cases



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
                     ` (4 preceding siblings ...)
  2022-10-14  9:41   ` Peter Zijlstra
@ 2022-10-14  9:42   ` Peter Zijlstra
  2022-10-14 18:06     ` Edgecombe, Rick P
  5 siblings, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-14  9:42 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  	return native_make_pte(v & ~clear);
>  }
>  
> +/*
> + * Normally the Dirty bit is used to denote COW memory on x86. But

This is misleading; this isn't an x86 specific thing. The core-mm code
does this.

> + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> + * since the Dirty=1,Write=0 will result in the memory being treated
> + * as shaodw stack by the HW. So when creating COW memory, a software
> + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and
> + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
> + * transition it to the shadow stack compatible version of COW (Cow=1).
> + */

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors
  2022-09-29 22:29 ` [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors Rick Edgecombe
  2022-10-03 18:20   ` Kees Cook
@ 2022-10-14 10:07   ` Peter Zijlstra
  2022-10-14 15:51     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-14 10:07 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:12PM -0700, Rick Edgecombe wrote:

> The architecture has concepts of both shadow stack reads and shadow stack
> writes. Any shadow stack access to non-shadow stack memory will generate
> a fault with the shadow stack error code bit set.
> 
> This means that, unlike normal write protection, the fault handler needs
> to create a type of memory that can be written to (with instructions that
> generate shadow stack writes), even to fulfill a read access. So in the
> case of COW memory, the COW needs to take place even with a shadow stack
> read. Otherwise the page will be left (shadow stack) writable in
> userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
> for shadow stack accesses, even if the access was a shadow stack read.

That ^ should be moved into the comment below

>  - Clarify reasoning for FAULT_FLAG_WRITE for all shadow stack accesses

> @@ -1300,6 +1314,13 @@ void do_user_addr_fault(struct pt_regs *regs,
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
> +	/*
> +	 * In order to fullfull a shadow stack access, the page needs
> +	 * to be made (shadow stack) writable. So treat all shadow stack
> +	 * accesses as writes.
> +	 */

Because that's impenetrable.

> +	if (error_code & X86_PF_SHSTK)
> +		flags |= FAULT_FLAG_WRITE;
>  	if (error_code & X86_PF_WRITE)
>  		flags |= FAULT_FLAG_WRITE;
>  	if (error_code & X86_PF_INSTR)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack
  2022-09-29 22:29 ` [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
  2022-10-03 18:22   ` Kees Cook
  2022-10-03 23:53   ` Kirill A . Shutemov
@ 2022-10-14 15:32   ` Peter Zijlstra
  2022-10-14 15:45     ` Edgecombe, Rick P
  2 siblings, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-14 15:32 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:13PM -0700, Rick Edgecombe wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8cd413c5a329..fef14ab3abcb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -981,13 +981,25 @@ void free_compound_page(struct page *page);
>   * servicing faults for write access.  In the normal case, do always want
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
> + *
> + * If a vma is shadow stack (a type of writable memory), mark the pte shadow
> + * stack.
>   */
> +#ifndef maybe_mkwrite
>  static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  {
> -	if (likely(vma->vm_flags & VM_WRITE))
> +	if (!(vma->vm_flags & VM_WRITE))
> +		goto out;
> +
> +	if (vma->vm_flags & VM_SHADOW_STACK)
> +		pte = pte_mkwrite_shstk(pte);
> +	else
>  		pte = pte_mkwrite(pte);
> +
> +out:
>  	return pte;
>  }
> +#endif

Why the #ifndef guard? There is no other implementation, nor does this
patch introduce one.

Also, wouldn't it be simpler to write it like:

static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
	if (!(vma->vm_flags & VM_WRITE))
		return pte;

	if (vma->vm_flags & VM_SHADOW_STACK)
		return pte_mkwrite_shstk(pte);

	return pte_mkwrite(pte);
}

? (idem for the pmd version etc..)

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack
  2022-10-14 15:32   ` Peter Zijlstra
@ 2022-10-14 15:45     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 15:45 UTC (permalink / raw)
  To: peterz
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, rdunlap, keescook, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 17:32 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:13PM -0700, Rick Edgecombe wrote:
> 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8cd413c5a329..fef14ab3abcb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -981,13 +981,25 @@ void free_compound_page(struct page *page);
> >    * servicing faults for write access.  In the normal case, do
> > always want
> >    * pte_mkwrite.  But get_user_pages can cause write faults for
> > mappings
> >    * that do not have writing enabled, when used by
> > access_process_vm.
> > + *
> > + * If a vma is shadow stack (a type of writable memory), mark the
> > pte shadow
> > + * stack.
> >    */
> > +#ifndef maybe_mkwrite
> >   static inline pte_t maybe_mkwrite(pte_t pte, struct
> > vm_area_struct *vma)
> >   {
> > -     if (likely(vma->vm_flags & VM_WRITE))
> > +     if (!(vma->vm_flags & VM_WRITE))
> > +             goto out;
> > +
> > +     if (vma->vm_flags & VM_SHADOW_STACK)
> > +             pte = pte_mkwrite_shstk(pte);
> > +     else
> >                pte = pte_mkwrite(pte);
> > +
> > +out:
> >        return pte;
> >   }
> > +#endif
> 
> Why the #ifndef guard? There is no other implementation, nor does
> this
> patch introduce one.

Oh yea, this series used to add another one, but I forgot to remove the
guards. Thanks.

> 
> Also, wouldn't it be simpler to write it like:
> 
> static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct
> *vma)
> {
>         if (!(vma->vm_flags & VM_WRITE))
>                 return pte;
> 
>         if (vma->vm_flags & VM_SHADOW_STACK)
>                 return pte_mkwrite_shstk(pte);
> 
>         return pte_mkwrite(pte);
> }

Yep, that looks better. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors
  2022-10-14 10:07   ` Peter Zijlstra
@ 2022-10-14 15:51     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 15:51 UTC (permalink / raw)
  To: peterz
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, rdunlap, keescook, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 12:07 +0200, Peter Zijlstra wrote:
> That ^ should be moved into the comment below

Ok, good idea. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
                     ` (2 preceding siblings ...)
  2022-10-04  1:56   ` Nadav Amit
@ 2022-10-14 15:52   ` Peter Zijlstra
  2022-10-14 15:56     ` Edgecombe, Rick P
  3 siblings, 1 reply; 241+ messages in thread
From: Peter Zijlstra @ 2022-10-14 15:52 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Randy Dunlap, Ravi V . Shankar,
	Weijiang Yang, Kirill A . Shutemov, joao.moreira, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 7327b2573f7c..b49372c7de41 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	int ret;
>  	pte_t _dst_pte, *dst_pte;
>  	bool writable = dst_vma->vm_flags & VM_WRITE;
> +	bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
>  	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	bool page_in_cache = page->mapping;
>  	spinlock_t *ptl;
> @@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  		writable = false;
>  	}
>  
> -	if (writable)
> -		_dst_pte = pte_mkwrite(_dst_pte);
> -	else
> +	if (writable) {
> +		if (shstk)
> +			_dst_pte = pte_mkwrite_shstk(_dst_pte);
> +		else
> +			_dst_pte = pte_mkwrite(_dst_pte);
> +	} else
>  		/*
>  		 * We need this to make sure write bit removed; as mk_pte()
>  		 * could return a pte with write bit set.

Urgh.. that's unfortunate. But yeah, I don't see a way to make that
pretty either.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-14  9:41   ` Peter Zijlstra
@ 2022-10-14 15:52     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 15:52 UTC (permalink / raw)
  To: peterz
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, rdunlap, keescook, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 11:41 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > There is essentially no room left in the x86 hardware PTEs on some
> > OSes
> > (not Linux). That left the hardware architects looking for a way to
> > represent a new memory type (shadow stack) within the existing
> > bits.
> > They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> > 
> > The reason it's lightly used is that Dirty=1 is normally set
> > _before_ a
> > write. A write with a Write=0 PTE would typically only generate a
> > fault,
> > not set Dirty=1. Hardware can (rarely) both set Write=1 *and*
> > generate the
> 
> s/Write/Dirty/

Oops, yes.

> 
> > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports
> > shadow
> 
> s/Dirty=0,Write=1/Write=0,Dirty=1/

Ok, I'll scrub the series for the order.

> 
> > stacks will no longer exhibit this oddity.
> > 
> > The kernel should avoid inadvertently creating shadow stack memory
> > because
> > it is security sensitive. So given the above, all it needs to do is
> > avoid
> > manually crating Write=0,Dirty=1 PTEs in software.
> 
> Whichever way around you choose, please be consistent.
> 
> > In places where Linux normally creates Write=0,Dirty=1, it can use
> > the
> > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In
> > other
> > words, whenever Linux needs to create Write=0,Dirty=1, it instead
> > creates
> > Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
> > This
> > clearly separates shadow stack from other data, and results in the
> > following:
> > 
> > (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
> >      Previously when a typical anonymous writable mapping was made
> > COW via
> >      fork(), the kernel would mark it Write=0,Dirty=1. Now it will
> > instead
> >      use the Cow bit.
> > (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The
> > user page
> >      is in a R/O VMA, and get_user_pages() needs a writable copy.
> > The page
> >      fault handler creates a copy of the page and sets the new
> > copy's PTE
> >      as Write=0 and Cow=1.
> > (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
> > (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a
> > shadow stack
> >      page is being shared among processes (this happens at fork()),
> > its PTE
> >      is made Dirty=0, so the next shadow stack access causes a
> > fault, and
> >      the page is duplicated and Dirty=1 is set again. This is the
> > COW
> >      equivalent for shadow stack pages, even though it's copy-on-
> > access
> >      rather than copy-on-write.
> > (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor
> > without
> >      shadow stack support set Dirty=1.
> 
> Please restureture this (and the comment) something like:
> 
> 
>   (Write=0,Dirty=0,Cow=1):
> 
>         - copy_present_pte(): A modified copy-on-write page.
>         - ...
> 
> 
>   (Write=0,Dirty=1,Cow=0):
> 
>         - FEATURE_CET:  Shadow Stack entry
>         - !FEATURE_CET: see the above Cow=1 cases

Yes, I incorporated feedback from your earlier comment. Sorry for bad
communication.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly
  2022-10-14 15:52   ` Peter Zijlstra
@ 2022-10-14 15:56     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 15:56 UTC (permalink / raw)
  To: peterz
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, rdunlap, keescook, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 17:52 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 7327b2573f7c..b49372c7de41 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct
> > *dst_mm, pmd_t *dst_pmd,
> >        int ret;
> >        pte_t _dst_pte, *dst_pte;
> >        bool writable = dst_vma->vm_flags & VM_WRITE;
> > +     bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
> >        bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> >        bool page_in_cache = page->mapping;
> >        spinlock_t *ptl;
> > @@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct
> > *dst_mm, pmd_t *dst_pmd,
> >                writable = false;
> >        }
> >   
> > -     if (writable)
> > -             _dst_pte = pte_mkwrite(_dst_pte);
> > -     else
> > +     if (writable) {
> > +             if (shstk)
> > +                     _dst_pte = pte_mkwrite_shstk(_dst_pte);
> > +             else
> > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > +     } else
> >                /*
> >                 * We need this to make sure write bit removed; as
> > mk_pte()
> >                 * could return a pte with write bit set.
> 
> Urgh.. that's unfortunate. But yeah, I don't see a way to make that
> pretty either.

Nadav pointed out that:
entry = maybe_mkwrite(pte_mkdirty(entry), vma);

and:
if (vma->vm_flags & VM_WRITE)
	entry = pte_mkwrite(pte_mkdirty(entry));

Are not actually the same, because in the former the non-writable PTE
gets marked dirty. So I was actually going to add two more cases like
the ugly case.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2022-09-29 22:29 ` [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
  2022-10-03 17:26   ` Kees Cook
@ 2022-10-14 16:20   ` Borislav Petkov
  2022-10-14 19:35     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Borislav Petkov @ 2022-10-14 16:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:00PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> The Control-Flow Enforcement Technology contains two related features,
> one of which is Shadow Stacks. Future patches will utilize this feature
> for shadow stack support in KVM, so add a CPU feature flags for Shadow
> Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
> 
> To protect shadow stack state from malicious modification, the registers
> are only accessible in supervisor mode. This implementation
> context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
> on XSAVES.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kees Cook <keescook@chromium.org>

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2022-09-29 22:29 ` [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
  2022-10-03 17:31   ` Kees Cook
  2022-10-05  0:55   ` Andrew Cooper
@ 2022-10-14 17:12   ` Borislav Petkov
  2022-10-14 18:15     ` Edgecombe, Rick P
  2 siblings, 1 reply; 241+ messages in thread
From: Borislav Petkov @ 2022-10-14 17:12 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:01PM -0700, Rick Edgecombe wrote:
>  static __always_inline void setup_cet(struct cpuinfo_x86 *c)
>  {
> -	u64 msr = CET_ENDBR_EN;
> +	bool kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);

So I'd love it if we can get rid of that HAS_KERNEL_IBT thing and use
the usual ifdeffery with Kconfig symbols. I wouldn't like for yet
another HAS_XXX feature checking method to proliferate as this is the
only one:

$ git grep -E "\WHAS_" arch/x86/
arch/x86/include/asm/ibt.h:18: * When all the above are satisfied, HAS_KERNEL_IBT will be 1, otherwise 0.
arch/x86/include/asm/ibt.h:22:#define HAS_KERNEL_IBT    1
arch/x86/include/asm/ibt.h:92:#define HAS_KERNEL_IBT    0
arch/x86/include/asm/ibt.h:114:#define ENDBR_INSN_SIZE          (4*HAS_KERNEL_IBT)
arch/x86/include/asm/idtentry.h:8:#define IDT_ALIGN     (8 * (1 + HAS_KERNEL_IBT))
arch/x86/kernel/cpu/common.c:601:       bool kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
arch/x86/kernel/cpu/common.c:1942:      if (HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT))

>  __noendbr void cet_disable(void)
>  {
> -	if (cpu_feature_enabled(X86_FEATURE_IBT))
> -		wrmsrl(MSR_IA32_S_CET, 0);
> +	if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
> +	      cpu_feature_enabled(X86_FEATURE_SHSTK)))
> +		return;
> +
> +	wrmsrl(MSR_IA32_S_CET, 0);
> +	wrmsrl(MSR_IA32_U_CET, 0);
>  }
>  
> +

Stray newline.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW
  2022-10-14  9:42   ` Peter Zijlstra
@ 2022-10-14 18:06     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 18:06 UTC (permalink / raw)
  To: peterz
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, rdunlap, keescook, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 11:42 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte,
> > pteval_t clear)
> >        return native_make_pte(v & ~clear);
> >   }
> >   
> > +/*
> > + * Normally the Dirty bit is used to denote COW memory on x86. But
> 
> This is misleading; this isn't an x86 specific thing. The core-mm
> code
> does this.

Well pte_mkdirty() does map to other HW bits on different
architectures. But yea, it's confusing.

Hmm, is this comment a bit stale either way now though? In the past it
was probably more accurate to say core MM code used it to "detect"
cowed memory. But the GUP pte_dirty() check was changed recently:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5535be3099717646781ce1540cf725965d680e7b

I don't think any code is looking specifically for COWed memory using
the PTE dirty bit anymore, it just happens to coincide with it. Double
checking my understanding...

Maybe this would be more accurate?

/*
 * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the
 * case of X86_FEATURE_SHSTK, the software COW bit is used, since the
 * Dirty=1,Write=0 will result in the memory being treated as shaodw
 * stack by the HW. So when creating COW memory, a software bit is used
 * _PAGE_BIT_COW. The following functions pte_mkcow() and
 * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
 * transition it to the shadow stack compatible version of COW (Cow=1).
 */

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2022-10-14 17:12   ` Borislav Petkov
@ 2022-10-14 18:15     ` Edgecombe, Rick P
  2022-10-14 19:44       ` Borislav Petkov
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 18:15 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 19:12 +0200, Borislav Petkov wrote:
> On Thu, Sep 29, 2022 at 03:29:01PM -0700, Rick Edgecombe wrote:
> >   static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> >   {
> > -     u64 msr = CET_ENDBR_EN;
> > +     bool kernel_ibt = HAS_KERNEL_IBT &&
> > cpu_feature_enabled(X86_FEATURE_IBT);
> 
> So I'd love it if we can get rid of that HAS_KERNEL_IBT thing and use
> the usual ifdeffery with Kconfig symbols. I wouldn't like for yet
> another HAS_XXX feature checking method to proliferate as this is the
> only one:

Andrew Cooper has suggested to create some software cpu features to
differentiate user/supervisor CET feature use. It could replace
HAS_KERNEL_IBT. Any objections to that versus Kconfig symbols?

[snip]

> cpu_feature_enabled(X86_FEATURE_IBT))
> 
> >   __noendbr void cet_disable(void)
> >   {
> > -     if (cpu_feature_enabled(X86_FEATURE_IBT))
> > -             wrmsrl(MSR_IA32_S_CET, 0);
> > +     if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
> > +           cpu_feature_enabled(X86_FEATURE_SHSTK)))
> > +             return;
> > +
> > +     wrmsrl(MSR_IA32_S_CET, 0);
> > +     wrmsrl(MSR_IA32_U_CET, 0);
> >   }
> >   
> > +
> 
> Stray newline.

Oops, will clean that up. Thanks.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks
  2022-10-14 16:20   ` Borislav Petkov
@ 2022-10-14 19:35     ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 19:35 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, 2022-10-14 at 18:20 +0200, Borislav Petkov wrote:
> On Thu, Sep 29, 2022 at 03:29:00PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > The Control-Flow Enforcement Technology contains two related
> > features,
> > one of which is Shadow Stacks. Future patches will utilize this
> > feature
> > for shadow stack support in KVM, so add a CPU feature flags for
> > Shadow
> > Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
> > 
> > To protect shadow stack state from malicious modification, the
> > registers
> > are only accessible in supervisor mode. This implementation
> > context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK
> > depend
> > on XSAVES.
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Cc: Kees Cook <keescook@chromium.org>
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>

Thanks!

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack
  2022-10-14 18:15     ` Edgecombe, Rick P
@ 2022-10-14 19:44       ` Borislav Petkov
  0 siblings, 0 replies; 241+ messages in thread
From: Borislav Petkov @ 2022-10-14 19:44 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Fri, Oct 14, 2022 at 06:15:30PM +0000, Edgecombe, Rick P wrote:
> Andrew Cooper has suggested to create some software cpu features to
> differentiate user/supervisor CET feature use. It could replace
> HAS_KERNEL_IBT. Any objections to that versus Kconfig symbols?

Sure, except you can't use them in

arch/x86/include/asm/idtentry.h:8:#define IDT_ALIGN     (8 * (1 + HAS_KERNEL_IBT))

as that gets used in asm code.

But you don't have to do this in this patchset - it is huge already
anyway. I have this thing on my todo so I'll get to it eventually.
Unless you're itching to remove it yourself - then I won't stay in the
way.

:-)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2022-09-29 22:29 ` [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
  2022-10-03 17:40   ` Kees Cook
@ 2022-10-15  9:46   ` Borislav Petkov
  2022-10-17 18:57     ` Edgecombe, Rick P
  1 sibling, 1 reply; 241+ messages in thread
From: Borislav Petkov @ 2022-10-15  9:46 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H . J . Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma,
	Yu-cheng Yu

On Thu, Sep 29, 2022 at 03:29:02PM -0700, Rick Edgecombe wrote:
> Both XSAVE state components are supervisor states, even the state
> controlling user-mode operation. This is a departure from earlier features
> like protection keys where the PKRU state a normal user (non-supervisor)
^^^^^

A verb is missing in that sentence.

> +	"x87 floating point registers"			,
> +	"SSE registers"					,
> +	"AVX registers"					,
> +	"MPX bounds registers"				,
> +	"MPX CSR"					,
> +	"AVX-512 opmask"				,
> +	"AVX-512 Hi256"					,
> +	"AVX-512 ZMM_Hi256"				,
> +	"Processor Trace (unused)"			,
> +	"Protection Keys User registers"		,
> +	"PASID state"					,
> +	"Control-flow User registers"			,
> +	"Control-flow Kernel registers (unused)"	,
> +	"unknown xstate feature"			,
> +	"unknown xstate feature"			,
> +	"unknown xstate feature"			,
> +	"unknown xstate feature"			,
> +	"AMX Tile config"				,
> +	"AMX Tile data"					,
> +	"unknown xstate feature"			,

What Kees said. :)

> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_YMM,       struct ymmh_struct);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_BNDREGS,   struct mpx_bndreg_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_BNDCSR,    struct mpx_bndcsr_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_PKRU,      struct pkru_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
> +	XCHECK_SZ(&chked, sz, nr, XFEATURE_CET_USER,  struct cet_user_state);

That looks silly. I wonder if you could do:

	switch (nr) {
	case XFEATURE_YMM:	XCHECK_SZ(sz, XFEATURE_YMM, struct ymmh_struct);	  return;
	case XFEATURE_BNDREGS:	XCHECK_SZ(sz, XFEATURE_BNDREGS, struct mpx_bndreg_state); return;
	case ...
	...
	default:
		/* that falls into the WARN etc */

and then you get rid of the if check in the macro itself and leave the
macro be a dumb, unconditional one.

Hmmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2022-10-15  9:46   ` Borislav Petkov
@ 2022-10-17 18:57     ` Edgecombe, Rick P
  2022-10-17 19:33       ` Borislav Petkov
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-17 18:57 UTC (permalink / raw)
  To: bp
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Sat, 2022-10-15 at 11:46 +0200, Borislav Petkov wrote:
> On Thu, Sep 29, 2022 at 03:29:02PM -0700, Rick Edgecombe wrote:
> > Both XSAVE state components are supervisor states, even the state
> > controlling user-mode operation. This is a departure from earlier
> > features
> > like protection keys where the PKRU state a normal user (non-
> > supervisor)
> 
> ^^^^^
> 
> A verb is missing in that sentence.

Oops yes.

> 
> > +	"x87 floating point registers"			,
> > +	"SSE registers"					,
> > +	"AVX registers"					,
> > +	"MPX bounds registers"				,
> > +	"MPX CSR"					,
> > +	"AVX-512 opmask"				,
> > +	"AVX-512 Hi256"					,
> > +	"AVX-512 ZMM_Hi256"				,
> > +	"Processor Trace (unused)"			,
> > +	"Protection Keys User registers"		,
> > +	"PASID state"					,
> > +	"Control-flow User registers"			,
> > +	"Control-flow Kernel registers (unused)"	,
> > +	"unknown xstate feature"			,
> > +	"unknown xstate feature"			,
> > +	"unknown xstate feature"			,
> > +	"unknown xstate feature"			,
> > +	"AMX Tile config"				,
> > +	"AMX Tile data"					,
> > +	"unknown xstate feature"			,
> 
> What Kees said. :)

Sure, I'll adjust the comma.

> 
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_YMM,       struct
> > ymmh_struct);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_BNDREGS,   struct
> > mpx_bndreg_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_BNDCSR,    struct
> > mpx_bndcsr_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_OPMASK,    struct
> > avx_512_opmask_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_ZMM_Hi256, struct
> > avx_512_zmm_uppers_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_Hi16_ZMM,  struct
> > avx_512_hi16_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_PKRU,      struct
> > pkru_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_PASID,     struct
> > ia32_pasid_state);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_XTILE_CFG, struct
> > xtile_cfg);
> > +	XCHECK_SZ(&chked, sz, nr, XFEATURE_CET_USER,  struct
> > cet_user_state);
> 
> That looks silly. I wonder if you could do:
> 
> 	switch (nr) {
> 	case XFEATURE_YMM:	XCHECK_SZ(sz, XFEATURE_YMM, struct
> ymmh_struct);	  return;
> 	case XFEATURE_BNDREGS:	XCHECK_SZ(sz, XFEATURE_BNDREGS,
> struct mpx_bndreg_state); return;
> 	case ...
> 	...
> 	default:
> 		/* that falls into the WARN etc */
> 
> and then you get rid of the if check in the macro itself and leave
> the
> macro be a dumb, unconditional one.
> 
> Hmmm.
> 

Hmm yea. Another reason the actual define is passed in is that the
macro want's to stringify the XFEATURE define in order to generate the 
message like this:
XFEATURE_YMM: struct is 123 bytes, cpu state is 456 bytes

The exact format of the message is probably not too critical though. If
instead it used xfeature_names[], it could be:
[AVX registers]: struct is 123 bytes, cpu state is 456 bytes

The full block looks like (like you had):
switch (nr) {
case XFEATURE_YMM:	  return XCHECK_SZ(sz, nr, struct ymmh_struct);
case XFEATURE_BNDREGS:	  return XCHECK_SZ(sz, nr, struct
mpx_bndreg_state);
case XFEATURE_BNDCSR:	  return XCHECK_SZ(sz, nr, struct
mpx_bndcsr_state);
case XFEATURE_OPMASK:	  return XCHECK_SZ(sz, nr, struct
avx_512_opmask_state);
case XFEATURE_ZMM_Hi256:  return XCHECK_SZ(sz, nr, struct
avx_512_zmm_uppers_state);
case XFEATURE_Hi16_ZMM:	  return XCHECK_SZ(sz, nr, struct
avx_512_hi16_state);
case XFEATURE_PKRU: 	  return XCHECK_SZ(sz, nr, struct pkru_state);
case XFEATURE_PASID: 	  return XCHECK_SZ(sz, nr, struct
ia32_pasid_state);
case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
case XFEATURE_CET_USER:	  return XCHECK_SZ(sz, nr, struct
cet_user_state);
case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return
true;
default:
	WARN_ONCE(1, "no structure for xstate: %d\n", nr);
	XSTATE_WARN_ON(1);
	return false;
}

I like how it fits the XFEATURE_XTILE_DATA check in with the rest.

Thanks!

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  2022-10-17 18:57     ` Edgecombe, Rick P
@ 2022-10-17 19:33       ` Borislav Petkov
  0 siblings, 0 replies; 241+ messages in thread
From: Borislav Petkov @ 2022-10-17 19:33 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma,
	linux-arch, kcc, pavel, oleg, hjl.tools, linux-doc, Lutomirski,
	Andy, arnd, jamorris, Moreira, Joao, tglx, mike.kravetz, x86,
	Yang, Weijiang, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, Oct 17, 2022 at 06:57:13PM +0000, Edgecombe, Rick P wrote:
> Hmm yea. Another reason the actual define is passed in is that the
> macro want's to stringify the XFEATURE define in order to generate the 
> message like this:
> XFEATURE_YMM: struct is 123 bytes, cpu state is 456 bytes
> 
> The exact format of the message is probably not too critical though. If
> instead it used xfeature_names[], it could be:
> [AVX registers]: struct is 123 bytes, cpu state is 456 bytes

Bah, "registers", that made me look at the thing. Yeah, not sure if all
those "registers" strings even matter there.

[AVX]: struct is 123 bytes, cpu state is 456 bytes

looks good enough to me too. But WTH do I know.

> The full block looks like (like you had):
> switch (nr) {
> case XFEATURE_YMM:	  return XCHECK_SZ(sz, nr, struct ymmh_struct);
> case XFEATURE_BNDREGS:	  return XCHECK_SZ(sz, nr, struct
> mpx_bndreg_state);
> case XFEATURE_BNDCSR:	  return XCHECK_SZ(sz, nr, struct
> mpx_bndcsr_state);
> case XFEATURE_OPMASK:	  return XCHECK_SZ(sz, nr, struct
> avx_512_opmask_state);
> case XFEATURE_ZMM_Hi256:  return XCHECK_SZ(sz, nr, struct
> avx_512_zmm_uppers_state);
> case XFEATURE_Hi16_ZMM:	  return XCHECK_SZ(sz, nr, struct
> avx_512_hi16_state);
> case XFEATURE_PKRU: 	  return XCHECK_SZ(sz, nr, struct pkru_state);
> case XFEATURE_PASID: 	  return XCHECK_SZ(sz, nr, struct
> ia32_pasid_state);
> case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
> case XFEATURE_CET_USER:	  return XCHECK_SZ(sz, nr, struct
> cet_user_state);
> case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return
> true;
> default:
> 	WARN_ONCE(1, "no structure for xstate: %d\n", nr);
> 	XSTATE_WARN_ON(1);
> 	return false;
> }
> 
> I like how it fits the XFEATURE_XTILE_DATA check in with the rest.

Yap, nice and straight-forward pattern. :)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 20:50               ` Nathan Chancellor
  2022-10-04 21:17                 ` H. Peter Anvin
@ 2022-10-20 21:22                 ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-20 21:22 UTC (permalink / raw)
  To: nathan
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, babu.moger, peterz,
	rdunlap, keescook, dave.hansen, kirill.shutemov, Eranian,
	Stephane, linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, pavel, Lutomirski, Andy,
	thomas.lendacky, jamorris, arnd, Moreira, Joao, tglx, x86,
	mike.kravetz, linux-doc, gustavoars, john.allen, rppt, Shankar,
	Ravi V, ndesaulniers, Hansen, Dave, mingo, corbet, linux-api,
	linux-kernel, Yang, Weijiang, gorcunov

On Tue, 2022-10-04 at 13:50 -0700, Nathan Chancellor wrote:
> On Tue, Oct 04, 2022 at 08:34:54PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2022-10-04 at 14:43 -0500, John Allen wrote:
> > > On 10/4/22 10:47 AM, Nathan Chancellor wrote:
> > > > Hi Kees,
> > > > 
> > > > On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
> > > > > On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen wrote:
> > > > > > On 10/3/22 16:57, Kees Cook wrote:
> > > > > > > On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick Edgecombe
> > > > > > > wrote:
> > > > > > > > Shadow stack is supported on newer AMD processors, but
> > > > > > > > the
> > > > > > > > kernel
> > > > > > > > implementation has not been tested on them. Prevent
> > > > > > > > basic
> > > > > > > > issues from
> > > > > > > > showing up for normal users by disabling shadow stack
> > > > > > > > on
> > > > > > > > all CPUs except
> > > > > > > > Intel until it has been tested. At which point the
> > > > > > > > limitation should be
> > > > > > > > removed.
> > > > > > > > 
> > > > > > > > Signed-off-by: Rick Edgecombe <
> > > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > 
> > > > > > > So running the selftests on an AMD system is sufficient
> > > > > > > to
> > > > > > > drop this
> > > > > > > patch?
> > > > > > 
> > > > > > Yes, that's enough.
> > > > > > 
> > > > > > I _thought_ the AMD folks provided some tested-by's at some
> > > > > > point in the
> > > > > > past.  But, maybe I'm confusing this for one of the other
> > > > > > shared
> > > > > > features.  Either way, I'm sure no tested-by's were dropped
> > > > > > on
> > > > > > purpose.
> > > > > > 
> > > > > > I'm sure Rick is eager to trim down his series and this
> > > > > > would
> > > > > > be a great
> > > > > > patch to drop.  Does anyone want to make that easy for
> > > > > > Rick?
> > > > > > 
> > > > > > <hint> <hint>
> > > > > 
> > > > > Hey Gustavo, Nathan, or Nick! I know y'all have some fancy
> > > > > AMD
> > > > > testing
> > > > > rigs. Got a moment to spin up this series and run the
> > > > > selftests?
> > > > > :)
> > > > 
> > > > I do have access to a system with an EPYC 7513, which does have
> > > > Shadow
> > > > Stack support (I can see 'shstk' in the "Flags" section of
> > > > lscpu
> > > > with
> > > > this series). As far as I understand it, AMD only added Shadow
> > > > Stack
> > > > with Zen 3; my regular AMD test system is Zen 2 (probably
> > > > should
> > > > look at
> > > > procurring a Zen 3 or Zen 4 one at some point).
> > > > 
> > > > I applied this series on top of 6.0 and reverted this change
> > > > then
> > > > booted
> > > > it on that system. After building the selftest (which did
> > > > require
> > > > 'make headers_install' and a small addition to make it build
> > > > beyond
> > > > that, see below), I ran it and this was the result. I am not
> > > > sure
> > > > if
> > > > that is expected or not but the other results seem promising
> > > > for
> > > > dropping this patch.
> > > > 
> > > >     $ ./test_shadow_stack_64
> > > >     [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
> > > >     [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
> > > >     [INFO]  ssp is now 7f8a36ca0000
> > > >     [OK]    Shadow stack pivot
> > > >     [OK]    Shadow stack faults
> > > >     [INFO]  Corrupting shadow stack
> > > >     [INFO]  Generated shadow stack violation successfully
> > > >     [OK]    Shadow stack violation test
> > > >     [INFO]  Gup read -> shstk access success
> > > >     [INFO]  Gup write -> shstk access success
> > > >     [INFO]  Violation from normal write
> > > >     [INFO]  Gup read -> write access success
> > > >     [INFO]  Violation from normal write
> > > >     [INFO]  Gup write -> write access success
> > > >     [INFO]  Cow gup write -> write access success
> > > >     [OK]    Shadow gup test
> > > >     [INFO]  Violation from shstk access
> > > >     [OK]    mprotect() test
> > > >     [OK]    Userfaultfd test
> > > >     [FAIL]  Alt shadow stack test
> > > 
> > > The selftest is looking OK on my system (Dell PowerEdge R6515 w/
> > > EPYC
> > > 7713). I also just pulled a fresh 6.0 kernel and applied the
> > > series
> > > including the fix Nathan mentions below.
> > > 
> > > $ tools/testing/selftests/x86/test_shadow_stack_64
> > > [INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
> > > [INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
> > > [INFO]  ssp is now 7f30cccc6000
> > > [OK]    Shadow stack pivot
> > > [OK]    Shadow stack faults
> > > [INFO]  Corrupting shadow stack
> > > [INFO]  Generated shadow stack violation successfully
> > > [OK]    Shadow stack violation test
> > > [INFO]  Gup read -> shstk access success
> > > [INFO]  Gup write -> shstk access success
> > > [INFO]  Violation from normal write
> > > [INFO]  Gup read -> write access success
> > > [INFO]  Violation from normal write
> > > [INFO]  Gup write -> write access success
> > > [INFO]  Cow gup write -> write access success
> > > [OK]    Shadow gup test
> > > [INFO]  Violation from shstk access
> > > [OK]    mprotect() test
> > > [OK]    Userfaultfd test
> > > [OK]    Alt shadow stack test.
> > 
> > Thanks for the testing. Based on the test, I wonder if this could
> > be a
> > SW bug. Nathan, could I send you a tweaked test with some more
> > debug
> > information?
> 
> Yes, more than happy to help you look into this further!

Indeed this was a SW bug and had nothing to do with the CPU model. The
altshstk selftest was not fully initializing the stack_t struct, and
getting lucky on some compilers. Thanks to Nathan for helping me debug
it.



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-03 19:43   ` Kees Cook
  2022-10-03 20:04     ` Dave Hansen
@ 2022-10-20 21:29     ` Edgecombe, Rick P
  2022-10-20 22:54       ` Kees Cook
  1 sibling, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-20 21:29 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

Just now realizing, I never responded to most of this feedback as the
later conversation focused in on one area. All seems good (thanks!),
except not sure about the below:

On Mon, 2022-10-03 at 12:43 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:21PM -0700, Rick Edgecombe wrote:
> > +
> > +	mmap_write_lock(mm);
> > +	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> > +		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
> 
> This will use the mmap base address offset randomization, I guess?

Yes.

> 
> > +
> > +	mmap_write_unlock(mm);
> > +
> > +	return addr;
> > +}
> > +
> > +static void unmap_shadow_stack(u64 base, u64 size)
> > +{
> > +	while (1) {
> > +		int r;
> > +
> > +		r = vm_munmap(base, size);
> > +
> > +		/*
> > +		 * vm_munmap() returns -EINTR when mmap_lock is held by
> > +		 * something else, and that lock should not be held for
> > a
> > +		 * long time.  Retry it for the case.
> > +		 */
> > +		if (r == -EINTR) {
> > +			cond_resched();
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * For all other types of vm_munmap() failure, either
> > the
> > +		 * system is out of memory or there is bug.
> > +		 */
> > +		WARN_ON_ONCE(r);
> > +		break;
> > +	}
> > +}
> > +
> > +int shstk_setup(void)
> 
> Only called local. Make static?
> 
> > +{
> > +	struct thread_shstk *shstk = &current->thread.shstk;
> > +	unsigned long addr, size;
> > +
> > +	/* Already enabled */
> > +	if (feature_enabled(CET_SHSTK))
> > +		return 0;
> > +
> > +	/* Also not supported for 32 bit */
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> > in_ia32_syscall())
> > +		return -EOPNOTSUPP;
> > +
> > +	size = PAGE_ALIGN(min_t(unsigned long long,
> > rlimit(RLIMIT_STACK), SZ_4G));
> > +	addr = alloc_shstk(size);
> > +	if (IS_ERR_VALUE(addr))
> > +		return PTR_ERR((void *)addr);
> > +
> > +	fpu_lock_and_load();
> > +	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> > +	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
> > +	fpregs_unlock();
> > +
> > +	shstk->base = addr;
> > +	shstk->size = size;
> > +	feature_set(CET_SHSTK);
> > +
> > +	return 0;
> > +}
> > +
> > +void reset_thread_shstk(void)
> > +{
> > +	memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
> > +	current->thread.features = 0;
> > +	current->thread.features_locked = 0;
> > +}
> 
> If features is always going to be tied to shstk, why not put them in
> the
> shstk struct?

CET and LAM used to share an enabling interface and also kernel side
enablement state tracking. But in the end LAM got its own enabling
interface. Thomas had suggested that they could share a state field on
the kernel side. But then LAM already had enough state tracking for
it's needs.

Shadow stack used to track enabling with the fields in the shstk struct
that keep track of the threads shadow stack. But then we added WRSS
which needs another field to keep track of the status. So I thought to
leave the 'features' field and use it for all the CET features
tracking. I left it outside of the shstk struct so it looks usable for
any other features that might be looking for an status bit. I can
definitely compile it out when there is no user shadow stack.

snip


> > +
> > +void shstk_free(struct task_struct *tsk)
> > +{
> > +	struct thread_shstk *shstk = &tsk->thread.shstk;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> > +	    !feature_enabled(CET_SHSTK))
> > +		return;
> > +
> > +	if (!tsk->mm)
> > +		return;
> > +
> > +	unmap_shadow_stack(shstk->base, shstk->size);
> 
> I feel like base and size should be zeroed here?
> 

The code used to use shstk->base and shstk->size to keep track of if
shadow stack was enabled. I'm not sure why to zero it now. Just
defensively or did you see a concrete issue?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk
  2022-10-05 22:58       ` Andrew Cooper
@ 2022-10-20 21:51         ` Edgecombe, Rick P
  0 siblings, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-20 21:51 UTC (permalink / raw)
  To: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Eranian, Stephane, kirill.shutemov, dave.hansen,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, andrew.cooper3, oleg, Yang, Weijiang, Lutomirski, Andy,
	hjl.tools, jamorris, arnd, Moreira, Joao, tglx, pavel,
	mike.kravetz, x86, linux-doc, john.allen, rppt, mingo, Shankar,
	Ravi V, corbet, linux-kernel, linux-api, gorcunov
  Cc: Yu, Yu-cheng

On Wed, 2022-10-05 at 22:58 +0000, Andrew Cooper wrote:
> On 05/10/2022 23:47, Edgecombe, Rick P wrote:
> > On Wed, 2022-10-05 at 02:43 +0000, Andrew Cooper wrote:
> > > On 29/09/2022 23:29, Rick Edgecombe wrote:
> > > > diff --git a/arch/x86/include/asm/special_insns.h
> > > > b/arch/x86/include/asm/special_insns.h
> > > > index 35f709f619fb..f096f52bd059 100644
> > > > --- a/arch/x86/include/asm/special_insns.h
> > > > +++ b/arch/x86/include/asm/special_insns.h
> > > > @@ -223,6 +223,19 @@ static inline void clwb(volatile void
> > > > *__p)
> > > >                 : [pax] "a" (p));
> > > >    }
> > > >    
> > > > +#ifdef CONFIG_X86_SHADOW_STACK
> > > > +static inline int write_user_shstk_64(u64 __user *addr, u64
> > > > val)
> > > > +{
> > > > +     asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > > > +                       _ASM_EXTABLE(1b, %l[fail])
> > > > +                       :: [addr] "r" (addr), [val] "r" (val)
> > > > +                       :: fail);
> > > 
> > > "1: wrssq %[val], %[addr]\n"
> > > _ASM_EXTABLE(1b, %l[fail])
> > > : [addr] "+m" (*addr)
> > > : [val] "r" (val)
> > > :: fail
> > > 
> > > Otherwise you've failed to tell the compiler that you wrote to
> > > *addr.
> > > 
> > > With that fixed, it's not volatile because there are no
> > > unexpressed
> > > side
> > > effects.
> > 
> > Ok, thanks!
> 
> On further consideration, it should be "=m" not "+m", which is even
> less
> constrained, and even easier for an enterprising optimiser to try and
> do
> something useful with.

AFAICT this won't work on all gccs. Asm goto's used to not support
having outputs. They are also implicitly volatile anyway. So I think
I'll have to leave it. But I can change the wrss one in the selftest to
"=m".

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack
  2022-10-03 20:52   ` Kees Cook
@ 2022-10-20 22:08     ` Edgecombe, Rick P
  2022-10-20 22:57       ` Kees Cook
  0 siblings, 1 reply; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-20 22:08 UTC (permalink / raw)
  To: keescook
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

Kees, sorry for the delayed response. There was so much feedback, I
missed responding to some.

On Mon, 2022-10-03 at 13:52 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:24PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > When a signal is handled normally the context is pushed to the
> > stack
> > before handling it. For shadow stacks, since the shadow stack only
> > track's
> > return addresses, there isn't any state that needs to be pushed.
> > However,
> > there are still a few things that need to be done. These things are
> > userspace visible and which will be kernel ABI for shadow stacks.
> > 
> > One is to make sure the restorer address is written to shadow
> > stack, since
> > the signal handler (if not changing ucontext) returns to the
> > restorer, and
> > the restorer calls sigreturn. So add the restorer on the shadow
> > stack
> > before handling the signal, so there is not a conflict when the
> > signal
> > handler returns to the restorer.
> > 
> > The other thing to do is to place some type of checkable token on
> > the
> > thread's shadow stack before handling the signal and check it
> > during
> > sigreturn. This is an extra layer of protection to hamper attackers
> > calling sigreturn manually as in SROP-like attacks.
> > 
> > For this token we can use the shadow stack data format defined
> > earlier.
> > Have the data pushed be the previous SSP. In the future the
> > sigreturn
> > might want to return back to a different stack. Storing the SSP
> > (instead
> > of a restore offset or something) allows for future functionality
> > that
> > may want to restore to a different stack.
> > 
> > So, when handling a signal push
> >  - the SSP pointing in the shadow stack data format
> >  - the restorer address below the restore token.
> > 
> > In sigreturn, verify SSP is stored in the data format and pop the
> > shadow
> > stack.
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> > Cc: Florian Weimer <fweimer@redhat.com>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > 
> > ---
> > 
> > v2:
> >  - Switch to new shstk signal format
> > 
> > v1:
> >  - Use xsave helpers.
> >  - Expand commit log.
> > 
> > Yu-cheng v27:
> >  - Eliminate saving shadow stack pointer to signal context.
> > 
> > Yu-cheng v25:
> >  - Update commit log/comments for the sc_ext struct.
> >  - Use restorer address already calculated.
> >  - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
> >  - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
> >  - Eliminate writing to MSR_IA32_U_CET for shadow stack.
> >  - Change wrmsrl() to wrmsrl_safe() and handle error.
> > 
> >  arch/x86/ia32/ia32_signal.c |   1 +
> >  arch/x86/include/asm/cet.h  |   5 ++
> >  arch/x86/kernel/shstk.c     | 126 ++++++++++++++++++++++++++++++
> > ------
> >  arch/x86/kernel/signal.c    |  10 +++
> >  4 files changed, 123 insertions(+), 19 deletions(-)
> > 
> > diff --git a/arch/x86/ia32/ia32_signal.c
> > b/arch/x86/ia32/ia32_signal.c
> > index c9c3859322fa..88d71b9de616 100644
> > --- a/arch/x86/ia32/ia32_signal.c
> > +++ b/arch/x86/ia32/ia32_signal.c
> > @@ -34,6 +34,7 @@
> >  #include <asm/sigframe.h>
> >  #include <asm/sighandling.h>
> >  #include <asm/smap.h>
> > +#include <asm/cet.h>
> >  
> >  static inline void reload_segments(struct sigcontext_32 *sc)
> >  {
> > diff --git a/arch/x86/include/asm/cet.h
> > b/arch/x86/include/asm/cet.h
> > index 924de99e0c61..8c6fab9f402a 100644
> > --- a/arch/x86/include/asm/cet.h
> > +++ b/arch/x86/include/asm/cet.h
> > @@ -6,6 +6,7 @@
> >  #include <linux/types.h>
> >  
> >  struct task_struct;
> > +struct ksignal;
> >  
> >  struct thread_shstk {
> >  	u64	base;
> > @@ -22,6 +23,8 @@ int shstk_alloc_thread_stack(struct task_struct
> > *p, unsigned long clone_flags,
> >  void shstk_free(struct task_struct *p);
> >  int shstk_disable(void);
> >  void reset_thread_shstk(void);
> > +int setup_signal_shadow_stack(struct ksignal *ksig);
> > +int restore_signal_shadow_stack(void);
> >  #else
> >  static inline long cet_prctl(struct task_struct *task, int option,
> >  		      unsigned long features) { return -EINVAL; }
> > @@ -33,6 +36,8 @@ static inline int shstk_alloc_thread_stack(struct
> > task_struct *p,
> >  static inline void shstk_free(struct task_struct *p) {}
> >  static inline int shstk_disable(void) { return -EOPNOTSUPP; }
> >  static inline void reset_thread_shstk(void) {}
> > +static inline int setup_signal_shadow_stack(struct ksignal *ksig)
> > { return 0; }
> > +static inline int restore_signal_shadow_stack(void) { return 0; }
> >  #endif /* CONFIG_X86_SHADOW_STACK */
> >  
> >  #endif /* __ASSEMBLY__ */
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 8904aef487bf..04442134aadd 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -227,41 +227,129 @@ static int get_shstk_data(unsigned long
> > *data, unsigned long __user *addr)
> >  }
> >  
> >  /*
> > - * Verify the user shadow stack has a valid token on it, and then
> > set
> > - * *new_ssp according to the token.
> > + * Create a restore token on shadow stack, and then push the user-
> > mode
> > + * function return address.
> >   */
> > -static int shstk_check_rstor_token(unsigned long *new_ssp)
> > +static int shstk_setup_rstor_token(unsigned long ret_addr,
> > unsigned long *new_ssp)
> 
> Oh, hrm. Prior patch defines shstk_check_rstor_token() and
> doesn't call it. This patch removes it. :P Can you please remove
> shstk_check_rstor_token() from the prior patch?

Yes, this function is not needed until the alt shadow stack stuff. It
got all mangled across earlier patches. I removed it all together now.
Thanks.

> 
> >  {
> > -	unsigned long token_addr;
> > -	unsigned long token;
> > +	unsigned long ssp, token_addr;
> > +	int err;
> > +
> > +	if (!ret_addr)
> > +		return -EINVAL;
> > +
> > +	ssp = get_user_shstk_addr();
> > +	if (!ssp)
> > +		return -EINVAL;
> > +
> > +	err = create_rstor_token(ssp, &token_addr);
> > +	if (err)
> > +		return err;
> > +
> > +	ssp = token_addr - sizeof(u64);
> > +	err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
> > +
> > +	if (!err)
> > +		*new_ssp = ssp;
> > +
> > +	return err;
> > +}
> > +
> > +static int shstk_push_sigframe(unsigned long *ssp)
> > +{
> > +	unsigned long target_ssp = *ssp;
> > +
> > +	/* Token must be aligned */
> > +	if (!IS_ALIGNED(*ssp, 8))
> > +		return -EINVAL;
> >  
> > -	token_addr = get_user_shstk_addr();
> > -	if (!token_addr)
> > +	if (!IS_ALIGNED(target_ssp, 8))
> >  		return -EINVAL;
> >  
> > -	if (get_user(token, (unsigned long __user *)token_addr))
> > +	*ssp -= SS_FRAME_SIZE;
> > +	if (put_shstk_data((void *__user)*ssp, target_ssp))
> >  		return -EFAULT;
> >  
> > -	/* Is mode flag correct? */
> > -	if (!(token & BIT(0)))
> > +	return 0;
> > +}
> > +
> > +
> > +static int shstk_pop_sigframe(unsigned long *ssp)
> > +{
> > +	unsigned long token_addr;
> > +	int err;
> > +
> > +	err = get_shstk_data(&token_addr, (unsigned long __user
> > *)*ssp);
> > +	if (unlikely(err))
> > +		return err;
> > +
> > +	/* Restore SSP aligned? */
> > +	if (unlikely(!IS_ALIGNED(token_addr, 8)))
> >  		return -EINVAL;
> 
> Why doesn't this always fail, given BIT(0) being set? I don't see it
> getting cleared until the end of this function.

Because it isn't a normal token, it was an address in the "data format"
that had bit 63 set. Then bit 63 was cleared, making it a normal
address.

> 
> >  
> > -	/* Is busy flag set? */
> > -	if (token & BIT(1))
> > +	/* SSP in userspace? */
> > +	if (unlikely(token_addr >= TASK_SIZE_MAX))
> >  		return -EINVAL;
> 
> BIT(63) already got cleared by here (in get_shstk_data(), but yes,
> this is still a reasonable check.

Good point. I guess I can leave it. Thanks.

> 
> >  
> > -	/* Mask out flags */
> > -	token &= ~3UL;
> > +	*ssp = token_addr;
> > +
> > +	return 0;
> > +}
> 
> 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support
  2022-10-20 21:29     ` Edgecombe, Rick P
@ 2022-10-20 22:54       ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-20 22:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Oct 20, 2022 at 09:29:38PM +0000, Edgecombe, Rick P wrote:
> The code used to use shstk->base and shstk->size to keep track of if
> shadow stack was enabled. I'm not sure why to zero it now. Just
> defensively or did you see a concrete issue?

Just to be defensive. It's not fast path by any means, to better to
have values that make a bit of sense there. *shrug* It just stood out
as feeling "missing" while I was reading the code.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack
  2022-10-20 22:08     ` Edgecombe, Rick P
@ 2022-10-20 22:57       ` Kees Cook
  0 siblings, 0 replies; 241+ messages in thread
From: Kees Cook @ 2022-10-20 22:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap, Yu,
	Yu-cheng, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, linux-arch, kcc,
	bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski, Andy, pavel,
	arnd, Moreira, Joao, tglx, mike.kravetz, x86, linux-doc,
	jamorris, john.allen, rppt, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, Oct 20, 2022 at 10:08:17PM +0000, Edgecombe, Rick P wrote:
> Kees, sorry for the delayed response. There was so much feedback, I
> missed responding to some.

No worries! Most of my feedback was just to get help filling gaps in my
understanding. :) I appreciate the replies -- I'm looking forward to v3!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 01/39] Documentation/x86: Add CET description
  2022-10-13 21:28         ` Edgecombe, Rick P
  2022-10-13 22:15           ` H.J. Lu
@ 2022-10-26 21:59           ` Edgecombe, Rick P
  1 sibling, 0 replies; 241+ messages in thread
From: Edgecombe, Rick P @ 2022-10-26 21:59 UTC (permalink / raw)
  To: fweimer
  Cc: bsingharora, hpa, Syromiatnikov, Eugene, peterz, rdunlap,
	keescook, Yu, Yu-cheng, Eranian, Stephane, kirill.shutemov,
	dave.hansen, linux-mm, nadav.amit, jannh, dethoma, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, Lutomirski,
	Andy, jamorris, arnd, Moreira, Joao, tglx, pavel, mike.kravetz,
	x86, linux-doc, rppt, john.allen, mingo, Shankar, Ravi V, corbet,
	linux-kernel, linux-api, gorcunov

On Thu, 2022-10-13 at 14:28 -0700, Rick Edgecombe wrote:
> In the meantime we could have a new bit shstk_strict,
> that requests behavior like these patches implement, and kills the
> process on violation. Glibc/tools could add support for this strict
> bit
> and anyone that wants to more carefully compile with it could finally
> get shadow stack today. Then the implementation of the warn and
> continue mode could follow that, and glibc could map the original
> shstk
> bit to that kernel mode. So the old binaries would get there
> eventually, which is better than the continuing nothing they have
> today.

Hi,

Any thoughts on this proposal?

Thanks,

Rick

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs
  2022-10-04 23:24                   ` Edgecombe, Rick P
@ 2022-11-03 17:39                     ` John Allen
  0 siblings, 0 replies; 241+ messages in thread
From: John Allen @ 2022-11-03 17:39 UTC (permalink / raw)
  To: Edgecombe, Rick P, hpa, nathan
  Cc: bsingharora, Syromiatnikov, Eugene, babu.moger, peterz, rdunlap,
	keescook, dave.hansen, kirill.shutemov, Eranian, Stephane,
	linux-mm, fweimer, nadav.amit, jannh, dethoma, kcc, linux-arch,
	bp, oleg, hjl.tools, pavel, Lutomirski, Andy, thomas.lendacky,
	jamorris, arnd, Moreira, Joao, tglx, mike.kravetz, x86,
	linux-doc, gustavoars, rppt, Shankar, Ravi V, ndesaulniers,
	Hansen, Dave, mingo, corbet, linux-api, linux-kernel, Yang,
	Weijiang, gorcunov

On 10/4/22 6:24 PM, Edgecombe, Rick P wrote:
> On Tue, 2022-10-04 at 14:17 -0700, H. Peter Anvin wrote:
>> On October 4, 2022 1:50:20 PM PDT, Nathan Chancellor <
>> nathan@kernel.org> wrote:
>>> On Tue, Oct 04, 2022 at 08:34:54PM +0000, Edgecombe, Rick P wrote:
>>>> On Tue, 2022-10-04 at 14:43 -0500, John Allen wrote:
>>>>> On 10/4/22 10:47 AM, Nathan Chancellor wrote:
>>>>>> Hi Kees,
>>>>>>
>>>>>> On Mon, Oct 03, 2022 at 09:54:26PM -0700, Kees Cook wrote:
>>>>>>> On Mon, Oct 03, 2022 at 05:09:04PM -0700, Dave Hansen
>>>>>>> wrote:
>>>>>>>> On 10/3/22 16:57, Kees Cook wrote:
>>>>>>>>> On Thu, Sep 29, 2022 at 03:29:30PM -0700, Rick
>>>>>>>>> Edgecombe
>>>>>>>>> wrote:
>>>>>>>>>> Shadow stack is supported on newer AMD processors,
>>>>>>>>>> but the
>>>>>>>>>> kernel
>>>>>>>>>> implementation has not been tested on them. Prevent
>>>>>>>>>> basic
>>>>>>>>>> issues from
>>>>>>>>>> showing up for normal users by disabling shadow stack
>>>>>>>>>> on
>>>>>>>>>> all CPUs except
>>>>>>>>>> Intel until it has been tested. At which point the
>>>>>>>>>> limitation should be
>>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Rick Edgecombe <
>>>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>>>
>>>>>>>>> So running the selftests on an AMD system is sufficient
>>>>>>>>> to
>>>>>>>>> drop this
>>>>>>>>> patch?
>>>>>>>>
>>>>>>>> Yes, that's enough.
>>>>>>>>
>>>>>>>> I _thought_ the AMD folks provided some tested-by's at
>>>>>>>> some
>>>>>>>> point in the
>>>>>>>> past.  But, maybe I'm confusing this for one of the other
>>>>>>>> shared
>>>>>>>> features.  Either way, I'm sure no tested-by's were
>>>>>>>> dropped on
>>>>>>>> purpose.
>>>>>>>>
>>>>>>>> I'm sure Rick is eager to trim down his series and this
>>>>>>>> would
>>>>>>>> be a great
>>>>>>>> patch to drop.  Does anyone want to make that easy for
>>>>>>>> Rick?
>>>>>>>>
>>>>>>>> <hint> <hint>
>>>>>>>
>>>>>>> Hey Gustavo, Nathan, or Nick! I know y'all have some fancy
>>>>>>> AMD
>>>>>>> testing
>>>>>>> rigs. Got a moment to spin up this series and run the
>>>>>>> selftests?
>>>>>>> :)
>>>>>>
>>>>>> I do have access to a system with an EPYC 7513, which does
>>>>>> have
>>>>>> Shadow
>>>>>> Stack support (I can see 'shstk' in the "Flags" section of
>>>>>> lscpu
>>>>>> with
>>>>>> this series). As far as I understand it, AMD only added
>>>>>> Shadow
>>>>>> Stack
>>>>>> with Zen 3; my regular AMD test system is Zen 2 (probably
>>>>>> should
>>>>>> look at
>>>>>> procurring a Zen 3 or Zen 4 one at some point).
>>>>>>
>>>>>> I applied this series on top of 6.0 and reverted this change
>>>>>> then
>>>>>> booted
>>>>>> it on that system. After building the selftest (which did
>>>>>> require
>>>>>> 'make headers_install' and a small addition to make it build
>>>>>> beyond
>>>>>> that, see below), I ran it and this was the result. I am not
>>>>>> sure
>>>>>> if
>>>>>> that is expected or not but the other results seem promising
>>>>>> for
>>>>>> dropping this patch.
>>>>>>
>>>>>>    $ ./test_shadow_stack_64
>>>>>>    [INFO]  new_ssp = 7f8a36c9fff8, *new_ssp = 7f8a36ca0001
>>>>>>    [INFO]  changing ssp from 7f8a374a0ff0 to 7f8a36c9fff8
>>>>>>    [INFO]  ssp is now 7f8a36ca0000
>>>>>>    [OK]    Shadow stack pivot
>>>>>>    [OK]    Shadow stack faults
>>>>>>    [INFO]  Corrupting shadow stack
>>>>>>    [INFO]  Generated shadow stack violation successfully
>>>>>>    [OK]    Shadow stack violation test
>>>>>>    [INFO]  Gup read -> shstk access success
>>>>>>    [INFO]  Gup write -> shstk access success
>>>>>>    [INFO]  Violation from normal write
>>>>>>    [INFO]  Gup read -> write access success
>>>>>>    [INFO]  Violation from normal write
>>>>>>    [INFO]  Gup write -> write access success
>>>>>>    [INFO]  Cow gup write -> write access success
>>>>>>    [OK]    Shadow gup test
>>>>>>    [INFO]  Violation from shstk access
>>>>>>    [OK]    mprotect() test
>>>>>>    [OK]    Userfaultfd test
>>>>>>    [FAIL]  Alt shadow stack test
>>>>>
>>>>> The selftest is looking OK on my system (Dell PowerEdge R6515
>>>>> w/ EPYC
>>>>> 7713). I also just pulled a fresh 6.0 kernel and applied the
>>>>> series
>>>>> including the fix Nathan mentions below.
>>>>>
>>>>> $ tools/testing/selftests/x86/test_shadow_stack_64
>>>>> [INFO]  new_ssp = 7f30cccc5ff8, *new_ssp = 7f30cccc6001
>>>>> [INFO]  changing ssp from 7f30cd4c6ff0 to 7f30cccc5ff8
>>>>> [INFO]  ssp is now 7f30cccc6000
>>>>> [OK]    Shadow stack pivot
>>>>> [OK]    Shadow stack faults
>>>>> [INFO]  Corrupting shadow stack
>>>>> [INFO]  Generated shadow stack violation successfully
>>>>> [OK]    Shadow stack violation test
>>>>> [INFO]  Gup read -> shstk access success
>>>>> [INFO]  Gup write -> shstk access success
>>>>> [INFO]  Violation from normal write
>>>>> [INFO]  Gup read -> write access success
>>>>> [INFO]  Violation from normal write
>>>>> [INFO]  Gup write -> write access success
>>>>> [INFO]  Cow gup write -> write access success
>>>>> [OK]    Shadow gup test
>>>>> [INFO]  Violation from shstk access
>>>>> [OK]    mprotect() test
>>>>> [OK]    Userfaultfd test
>>>>> [OK]    Alt shadow stack test.
>>>>
>>>> Thanks for the testing. Based on the test, I wonder if this could
>>>> be a
>>>> SW bug. Nathan, could I send you a tweaked test with some more
>>>> debug
>>>> information?
>>>
>>> Yes, more than happy to help you look into this further!
>>>
>>>> John, are we sure AMD and Intel systems behave the same with
>>>> respect to
>>>> CPUs not creating Dirty=1,Write=0 PTEs in rare situations? Or any
>>>> other
>>>> CET related differences we should hash out? Otherwise I'll drop
>>>> the
>>>> patch for the next version. (and assuming the issue Nathan hit
>>>> doesn't
>>>> turn up anything HW related).
>>
>> I have to admit to being a bit confused here... in general, we trust
>> CPUID bits unless they are *known* to be wrong, in which case we
>> blacklist them.
>>
>> If AMD advertises the feature but it doesn't work or they didn't
>> validate it, that would be a (serious!) bug on their part that we can
>> address by blacklisting, but they should also fix with a
>> microcode/BIOS patch.
>>
>> What am I missing?
> 
> I have not read anything about the AMD implementation except hearing
> that it is supported. But there are some microarchitectual-like aspects
> to this CET Linux implementation, around requiring CPUs to not create
> Dirty=1,Write=0 PTEs in some cases, where they did in the past. In
> another thread Jann asked how the IOMMU works with respect to this edge
> case and I'm currently trying to chase down that answer for even Intel
> HW. So I just wanted to double check that we expect that everything
> should be the same. In either case we still have time to iron things
> out before anything gets upstream.

Hi Rick,

Sorry for the delayed reply. After asking around, I think you can safely
assume that AMD will not create Dirty=1,Write=0 PTEs in rare
circumstances and shadow stack should behave the same as Intel in that
regard.

Thanks,
John


^ permalink raw reply	[flat|nested] 241+ messages in thread

end of thread, other threads:[~2022-11-03 17:39 UTC | newest]

Thread overview: 241+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-29 22:28 [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
2022-09-29 22:28 ` [PATCH v2 01/39] Documentation/x86: Add CET description Rick Edgecombe
2022-09-30  3:41   ` Bagas Sanjaya
2022-09-30 13:33     ` Jonathan Corbet
2022-09-30 13:41       ` Bagas Sanjaya
2022-10-03 16:56         ` Edgecombe, Rick P
2022-10-04  2:16           ` Bagas Sanjaya
2022-10-05  9:10           ` Peter Zijlstra
2022-10-05  9:25             ` Bagas Sanjaya
2022-10-05  9:46               ` Peter Zijlstra
2022-10-03 19:35     ` John Hubbard
2022-10-03 19:39       ` Dave Hansen
2022-10-04  2:13       ` Bagas Sanjaya
2022-10-03 17:18   ` Kees Cook
2022-10-03 19:46     ` Edgecombe, Rick P
2022-10-05  0:02   ` Andrew Cooper
2022-10-10 12:19   ` Florian Weimer
2022-10-10 16:44     ` Edgecombe, Rick P
2022-10-10 16:51       ` H.J. Lu
2022-10-12 12:29       ` Florian Weimer
2022-10-12 15:59         ` Dave Hansen
2022-10-12 16:54           ` Florian Weimer
2022-10-13 21:28         ` Edgecombe, Rick P
2022-10-13 22:15           ` H.J. Lu
2022-10-26 21:59           ` Edgecombe, Rick P
2022-09-29 22:28 ` [PATCH v2 02/39] x86/cet/shstk: Add Kconfig option for Shadow Stack Rick Edgecombe
2022-10-03 13:40   ` Kirill A . Shutemov
2022-10-03 19:53     ` Edgecombe, Rick P
2022-10-03 17:25   ` Kees Cook
2022-10-03 19:52     ` Edgecombe, Rick P
2022-10-03 19:42   ` Dave Hansen
2022-10-03 19:50     ` Edgecombe, Rick P
2022-10-12 20:04   ` Borislav Petkov
2022-10-13  0:31     ` Edgecombe, Rick P
2022-10-13  9:21       ` Borislav Petkov
2022-09-29 22:29 ` [PATCH v2 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks Rick Edgecombe
2022-10-03 17:26   ` Kees Cook
2022-10-14 16:20   ` Borislav Petkov
2022-10-14 19:35     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack Rick Edgecombe
2022-10-03 17:31   ` Kees Cook
2022-10-05  0:55   ` Andrew Cooper
2022-10-14 17:12   ` Borislav Petkov
2022-10-14 18:15     ` Edgecombe, Rick P
2022-10-14 19:44       ` Borislav Petkov
2022-09-29 22:29 ` [PATCH v2 05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Rick Edgecombe
2022-10-03 17:40   ` Kees Cook
2022-10-15  9:46   ` Borislav Petkov
2022-10-17 18:57     ` Edgecombe, Rick P
2022-10-17 19:33       ` Borislav Petkov
2022-09-29 22:29 ` [PATCH v2 06/39] x86/fpu: Add helper for modifying xstate Rick Edgecombe
2022-10-03 17:48   ` Kees Cook
2022-10-03 20:05     ` Edgecombe, Rick P
2022-10-04  4:05       ` Kees Cook
2022-10-04 14:18       ` Dave Hansen
2022-10-04 16:13         ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
2022-10-03 14:01   ` Kirill A . Shutemov
2022-10-03 18:12     ` Edgecombe, Rick P
2022-10-03 18:04   ` Kees Cook
2022-10-03 20:33     ` Edgecombe, Rick P
2022-10-03 22:51   ` Andy Lutomirski
2022-10-03 23:09     ` H. Peter Anvin
2022-10-03 23:11     ` Edgecombe, Rick P
2022-10-05  1:20   ` Andrew Cooper
2022-10-05 22:44     ` Edgecombe, Rick P
2022-10-05  9:39   ` Peter Zijlstra
2022-10-05 22:45     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages Rick Edgecombe
2022-10-03 14:17   ` Kirill A . Shutemov
2022-10-05  1:31   ` Andrew Cooper
2022-10-05 11:16     ` Peter Zijlstra
2022-10-05 12:34       ` Andrew Cooper
2022-09-29 22:29 ` [PATCH v2 09/39] x86/mm: Move pmd_write(), pud_write() up in the file Rick Edgecombe
2022-10-03 18:06   ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 10/39] x86/mm: Introduce _PAGE_COW Rick Edgecombe
2022-09-30 15:16   ` Jann Horn
2022-10-06 16:10     ` Edgecombe, Rick P
2022-10-03 16:26   ` Kirill A . Shutemov
2022-10-03 21:36     ` Edgecombe, Rick P
2022-10-03 21:54       ` Jann Horn
2022-10-03 22:20         ` Edgecombe, Rick P
2022-10-03 22:14       ` Dave Hansen
2022-10-05  2:17   ` Andrew Cooper
2022-10-05 14:08     ` Dave Hansen
2022-10-05 23:06       ` Edgecombe, Rick P
2022-10-05 23:01     ` Edgecombe, Rick P
2022-10-05 11:33   ` Peter Zijlstra
2022-10-14  9:41   ` Peter Zijlstra
2022-10-14 15:52     ` Edgecombe, Rick P
2022-10-14  9:42   ` Peter Zijlstra
2022-10-14 18:06     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 11/39] x86/mm: Update pte_modify for _PAGE_COW Rick Edgecombe
2022-09-29 22:29 ` [PATCH v2 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW Rick Edgecombe
2022-10-03 17:43   ` Kirill A . Shutemov
2022-10-03 18:11   ` Nadav Amit
2022-10-03 18:51     ` Dave Hansen
2022-10-03 22:28     ` Edgecombe, Rick P
2022-10-03 23:17       ` Nadav Amit
2022-10-03 23:20         ` Nadav Amit
2022-10-03 23:25           ` Nadav Amit
2022-10-03 23:38             ` Edgecombe, Rick P
2022-10-04  0:40               ` Nadav Amit
2022-09-29 22:29 ` [PATCH v2 13/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 Rick Edgecombe
2022-10-03 18:11   ` Kees Cook
2022-10-03 18:24   ` Peter Xu
2022-09-29 22:29 ` [PATCH v2 14/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory Rick Edgecombe
2022-10-03 17:47   ` Kirill A . Shutemov
2022-10-04  0:29     ` Edgecombe, Rick P
2022-10-03 18:17   ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 15/39] x86/mm: Check Shadow Stack page fault errors Rick Edgecombe
2022-10-03 18:20   ` Kees Cook
2022-10-14 10:07   ` Peter Zijlstra
2022-10-14 15:51     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 16/39] x86/mm: Update maybe_mkwrite() for shadow stack Rick Edgecombe
2022-10-03 18:22   ` Kees Cook
2022-10-03 23:53   ` Kirill A . Shutemov
2022-10-14 15:32   ` Peter Zijlstra
2022-10-14 15:45     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 17/39] mm: Fixup places that call pte_mkwrite() directly Rick Edgecombe
2022-10-03 18:24   ` Kees Cook
2022-10-03 23:56   ` Kirill A . Shutemov
2022-10-04 16:15     ` Edgecombe, Rick P
2022-10-04  1:56   ` Nadav Amit
2022-10-04 16:21     ` Edgecombe, Rick P
2022-10-14 15:52   ` Peter Zijlstra
2022-10-14 15:56     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 18/39] mm: Add guard pages around a shadow stack Rick Edgecombe
2022-10-03 18:30   ` Kees Cook
2022-10-05  2:30     ` Andrew Cooper
2022-10-10 12:33       ` Florian Weimer
2022-10-10 13:32         ` Andrew Cooper
2022-10-10 13:40           ` Florian Weimer
2022-10-10 13:56             ` Andrew Cooper
2022-09-29 22:29 ` [PATCH v2 19/39] mm/mmap: Add shadow stack pages to memory accounting Rick Edgecombe
2022-10-03 18:31   ` Kees Cook
2022-10-04  0:03   ` Kirill A . Shutemov
2022-10-04  0:32     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 20/39] mm/mprotect: Exclude shadow stack from preserve_write Rick Edgecombe
2022-09-29 22:29 ` [PATCH v2 21/39] mm: Re-introduce vm_flags to do_mmap() Rick Edgecombe
2022-09-29 22:29 ` [PATCH v2 22/39] mm: Don't allow write GUPs to shadow stack memory Rick Edgecombe
2022-09-30 19:16   ` Dave Hansen
2022-09-30 20:30     ` Edgecombe, Rick P
2022-09-30 20:37       ` Dave Hansen
2022-09-30 23:00     ` Jann Horn
2022-09-30 23:02       ` Jann Horn
2022-09-30 23:04       ` Edgecombe, Rick P
2022-10-03 18:39   ` Kees Cook
2022-10-03 22:49     ` Andy Lutomirski
2022-10-04  4:21       ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 23/39] x86: Introduce userspace API for CET enabling Rick Edgecombe
2022-10-03 19:01   ` Kees Cook
2022-10-03 22:51     ` Edgecombe, Rick P
2022-10-06 18:50       ` Mike Rapoport
2022-10-10 10:56   ` Florian Weimer
2022-10-10 16:28     ` Edgecombe, Rick P
2022-10-12 12:18       ` Florian Weimer
2022-10-12 17:30         ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 24/39] x86/cet/shstk: Add user-mode shadow stack support Rick Edgecombe
2022-10-03 19:43   ` Kees Cook
2022-10-03 20:04     ` Dave Hansen
2022-10-04  4:04       ` Kees Cook
2022-10-04 16:25         ` Edgecombe, Rick P
2022-10-04 10:17       ` David Laight
2022-10-04 19:32         ` Kees Cook
2022-10-05 13:32           ` David Laight
2022-10-20 21:29     ` Edgecombe, Rick P
2022-10-20 22:54       ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 25/39] x86/cet/shstk: Handle thread shadow stack Rick Edgecombe
2022-10-03 10:36   ` Mike Rapoport
2022-10-03 16:57     ` Edgecombe, Rick P
2022-10-03 20:29   ` Kees Cook
2022-10-04 22:09     ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 26/39] x86/cet/shstk: Introduce routines modifying shstk Rick Edgecombe
2022-10-03 20:44   ` Kees Cook
2022-10-04 22:13     ` Edgecombe, Rick P
2022-10-05  2:43   ` Andrew Cooper
2022-10-05 22:47     ` Edgecombe, Rick P
2022-10-05 22:58       ` Andrew Cooper
2022-10-20 21:51         ` Edgecombe, Rick P
2022-09-29 22:29 ` [PATCH v2 27/39] x86/cet/shstk: Handle signals for shadow stack Rick Edgecombe
2022-10-03 20:52   ` Kees Cook
2022-10-20 22:08     ` Edgecombe, Rick P
2022-10-20 22:57       ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 28/39] x86/cet/shstk: Introduce map_shadow_stack syscall Rick Edgecombe
2022-10-03 22:23   ` Kees Cook
2022-10-04 22:56     ` Edgecombe, Rick P
2022-10-04 23:16       ` H.J. Lu
2022-10-10 11:13   ` Florian Weimer
2022-10-10 14:19     ` Jason A. Donenfeld
2022-09-29 22:29 ` [PATCH v2 29/39] x86/cet/shstk: Support wrss for userspace Rick Edgecombe
2022-10-03 22:28   ` Kees Cook
2022-10-03 23:00     ` Andy Lutomirski
2022-10-04  4:37       ` Kees Cook
2022-10-06  0:38         ` Edgecombe, Rick P
2022-10-06  3:11           ` Kees Cook
2022-10-04  8:30     ` Mike Rapoport
2022-09-29 22:29 ` [PATCH v2 30/39] x86: Expose thread features status in /proc/$PID/arch_status Rick Edgecombe
2022-10-03 22:37   ` Kees Cook
2022-10-03 22:45     ` Andy Lutomirski
2022-10-04  4:18       ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 31/39] x86/cet/shstk: Wire in CET interface Rick Edgecombe
2022-10-03 22:41   ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 32/39] selftests/x86: Add shadow stack test Rick Edgecombe
2022-10-03 23:56   ` Kees Cook
2022-09-29 22:29 ` [PATCH v2 33/39] x86/cpufeatures: Limit shadow stack to Intel CPUs Rick Edgecombe
2022-10-03 23:57   ` Kees Cook
2022-10-04  0:09     ` Dave Hansen
2022-10-04  4:54       ` Kees Cook
2022-10-04 15:47         ` Nathan Chancellor
2022-10-04 19:43           ` John Allen
2022-10-04 20:34             ` Edgecombe, Rick P
2022-10-04 20:50               ` Nathan Chancellor
2022-10-04 21:17                 ` H. Peter Anvin
2022-10-04 23:24                   ` Edgecombe, Rick P
2022-11-03 17:39                     ` John Allen
2022-10-20 21:22                 ` Edgecombe, Rick P
2022-10-04  8:36       ` Mike Rapoport
2022-09-29 22:29 ` [OPTIONAL/CLEANUP v2 34/39] x86: Separate out x86_regset for 32 and 64 bit Rick Edgecombe
2022-09-29 22:29 ` [OPTIONAL/CLEANUP v2 35/39] x86: Improve formatting of user_regset arrays Rick Edgecombe
2022-09-29 22:29 ` [OPTIONAL/RFC v2 36/39] x86/fpu: Add helper for initing features Rick Edgecombe
2022-10-03 19:07   ` Chang S. Bae
2022-10-04 23:05     ` Edgecombe, Rick P
2022-09-29 22:29 ` [OPTIONAL/RFC v2 37/39] x86/cet: Add PTRACE interface for CET Rick Edgecombe
2022-10-03 23:59   ` Kees Cook
2022-10-04  8:44     ` Mike Rapoport
2022-10-04 19:24       ` Kees Cook
2022-09-29 22:29 ` [OPTIONAL/RFC v2 38/39] x86/cet/shstk: Add ARCH_CET_UNLOCK Rick Edgecombe
2022-10-04  0:00   ` Kees Cook
2022-09-29 22:29 ` [OPTIONAL/RFC v2 39/39] x86: Add alt shadow stack support Rick Edgecombe
2022-10-03 23:21   ` Andy Lutomirski
2022-10-04 16:12     ` Edgecombe, Rick P
2022-10-04 17:46       ` Andy Lutomirski
2022-10-04 18:04         ` Edgecombe, Rick P
2022-10-03 17:04 ` [PATCH v2 00/39] Shadowstacks for userspace Kees Cook
2022-10-03 17:25   ` Jann Horn
2022-10-04  5:01     ` Kees Cook
2022-10-04  9:57       ` David Laight
2022-10-04 19:28         ` Kees Cook
2022-10-03 18:33   ` Edgecombe, Rick P
2022-10-04  3:59     ` Kees Cook

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.