linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack
@ 2020-08-25  0:25 Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 01/25] Documentation/x86: Add CET description Yu-cheng Yu
                   ` (24 more replies)
  0 siblings, 25 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Control-flow Enforcement (CET) is a new Intel processor feature that blocks
return/jump-oriented programming attacks.  Details are in "Intel 64 and
IA-32 Architectures Software Developer's Manual" [1].

CET can protect applications and the kernel.  This series enables only
application-level protection, and has three parts:

  - shadow stack [2],
  - indirect branch tracking, ptrace [3], and
  - selftests [4].

I have run tests on these patches for quite some time, and they have been
very stable.  Linux distributions with CET are available now, and Intel
processors with CET are becoming available.  It would be nice if CET
support can be accepted into the kernel.  I will be working to address any
issues should they come up.

Changes in v11:

- Rebase to v5.9-rc1.
- There was no more caller passing vm_flags to do_mmap() and the input
  parameter was removed.  Shadow stack allocation is a new user passing
  VM_SHSTK.  Thus, reintroduce the parameter and do_mmap_pgoff().
- Selftests/x86/sigreturn does a sigreturn from 64-bit to a 32-bit context,
  and needs a shadow stack in the 32-bit address range.  For all similar
  purposes, change shadow stack allocation arch_prctl() to take MAP_32BIT
  and MAP_POPULATE flags from the user.
- Update arch_prctl() for checking invalid inputs and using proper return
  codes.
- Other smaller changes are noted in each patch's log.

[1] Intel 64 and IA-32 Architectures Software Developer's Manual:

    https://software.intel.com/en-us/download/intel-64-and-ia-32-
    architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

[2] CET Shadow Stack patches v10:

    https://lkml.kernel.org/r/20200429220732.31602-1-yu-cheng.yu@intel.com/

[3] There is no Indirect Branch Tracking patches v10.  There have been no
    major changes since v9:

    https://lkml.kernel.org/r/20200205182308.4028-1-yu-cheng.yu@intel.com/

[4] I am holding off the selftests changes and working to get Acked-by's.
    The earlier version of the selftests patches:

    https://lkml.kernel.org/r/20200521211720.20236-1-yu-cheng.yu@intel.com/

Yu-cheng Yu (25):
  Documentation/x86: Add CET description
  x86/cpufeatures: Add CET CPU feature flags for Control-flow
    Enforcement Technology (CET)
  x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
  x86/cet: Add control-protection fault handler
  x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack
  x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
  x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages
  x86/mm: Introduce _PAGE_COW
  drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  x86/mm: Update pte_modify for _PAGE_COW
  x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
    transition from _PAGE_DIRTY_HW to _PAGE_COW
  mm: Introduce VM_SHSTK for shadow stack memory
  x86/mm: Shadow Stack page fault error checking
  x86/mm: Update maybe_mkwrite() for shadow stack
  mm: Fixup places that call pte_mkwrite() directly
  mm: Add guard pages around a shadow stack.
  mm/mmap: Add shadow stack pages to memory accounting
  mm: Update can_follow_write_pte() for shadow stack
  mm: Re-introduce do_mmap_pgoff()
  x86/cet/shstk: User-mode shadow stack support
  x86/cet/shstk: Handle signals for shadow stack
  binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND properties
  ELF: Introduce arch_setup_elf_property()
  x86/cet/shstk: Handle thread shadow stack
  x86/cet/shstk: Add arch_prctl functions for shadow stack

 .../admin-guide/kernel-parameters.txt         |   6 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 143 +++++++
 arch/arm64/include/asm/elf.h                  |   5 +
 arch/x86/Kconfig                              |  36 ++
 arch/x86/ia32/ia32_signal.c                   |  17 +
 arch/x86/include/asm/cet.h                    |  40 ++
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/elf.h                    |  13 +
 arch/x86/include/asm/fpu/internal.h           |  10 +
 arch/x86/include/asm/fpu/types.h              |  23 +-
 arch/x86/include/asm/fpu/xstate.h             |   5 +-
 arch/x86/include/asm/idtentry.h               |   4 +
 arch/x86/include/asm/mmu_context.h            |   3 +
 arch/x86/include/asm/msr-index.h              |  17 +
 arch/x86/include/asm/pgtable.h                | 209 +++++++++-
 arch/x86/include/asm/pgtable_types.h          |  58 ++-
 arch/x86/include/asm/processor.h              |  15 +
 arch/x86/include/asm/special_insns.h          |  32 ++
 arch/x86/include/asm/traps.h                  |   2 +
 arch/x86/include/uapi/asm/prctl.h             |   5 +
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/include/uapi/asm/sigcontext.h        |   9 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/cet.c                         | 357 ++++++++++++++++++
 arch/x86/kernel/cet_prctl.c                   |  98 +++++
 arch/x86/kernel/cpu/common.c                  |  28 ++
 arch/x86/kernel/cpu/cpuid-deps.c              |   2 +
 arch/x86/kernel/fpu/signal.c                  | 100 +++++
 arch/x86/kernel/fpu/xstate.c                  |  28 +-
 arch/x86/kernel/idt.c                         |   4 +
 arch/x86/kernel/process.c                     |  14 +-
 arch/x86/kernel/process_64.c                  |  32 ++
 arch/x86/kernel/relocate_kernel_64.S          |   2 +-
 arch/x86/kernel/signal.c                      |  10 +
 arch/x86/kernel/signal_compat.c               |   2 +-
 arch/x86/kernel/traps.c                       |  59 +++
 arch/x86/kvm/vmx/vmx.c                        |   2 +-
 arch/x86/mm/fault.c                           |  19 +
 arch/x86/mm/mmap.c                            |   2 +
 arch/x86/mm/pat/set_memory.c                  |   2 +-
 arch/x86/mm/pgtable.c                         |  25 ++
 drivers/gpu/drm/i915/gvt/gtt.c                |   2 +-
 fs/aio.c                                      |   6 +-
 fs/binfmt_elf.c                               |   4 +
 fs/hugetlbfs/inode.c                          |   2 +-
 fs/proc/task_mmu.c                            |   3 +
 include/linux/elf.h                           |   6 +
 include/linux/fs.h                            |   2 +-
 include/linux/mm.h                            |  46 ++-
 include/linux/pgtable.h                       |  35 ++
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/linux/elf.h                      |   9 +
 ipc/shm.c                                     |   2 +-
 mm/gup.c                                      |   8 +-
 mm/huge_memory.c                              |  10 +-
 mm/memory.c                                   |   5 +-
 mm/migrate.c                                  |   3 +-
 mm/mmap.c                                     |  21 +-
 mm/mprotect.c                                 |   2 +-
 mm/nommu.c                                    |   6 +-
 mm/shmem.c                                    |   2 +-
 mm/util.c                                     |   4 +-
 scripts/as-x86_64-has-shadow-stack.sh         |   4 +
 .../arch/x86/include/asm/disabled-features.h  |   8 +-
 tools/arch/x86/include/uapi/asm/prctl.h       |   5 +
 67 files changed, 1574 insertions(+), 77 deletions(-)
 create mode 100644 Documentation/x86/intel_cet.rst
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/cet_prctl.c
 create mode 100755 scripts/as-x86_64-has-shadow-stack.sh

-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 01/25] Documentation/x86: Add CET description
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 02/25] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Explain no_user_shstk/no_user_ibt kernel parameters, and introduce a new
document on Control-flow Enforcement Technology (CET).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v11:
- Add back GLIBC tunables information.
- Add ARCH_X86_CET_MMAP_SHSTK information.

v10:
- Change no_cet_shstk and no_cet_ibt to no_user_shstk and no_user_ibt.
- Remove the opcode section, as it is already in the Intel SDM.
- Remove sections related to GLIBC implementation.
- Remove shadow stack memory management section, as it is already in the
  code comments.
- Remove legacy bitmap related information, as it is not supported now.
- Fix arch_ioctl() related text.
- Change SHSTK, IBT to plain English.

 .../admin-guide/kernel-parameters.txt         |   6 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 143 ++++++++++++++++++
 3 files changed, 150 insertions(+)
 create mode 100644 Documentation/x86/intel_cet.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..c85373c120a3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3167,6 +3167,12 @@
 			noexec=on: enable non-executable mappings (default)
 			noexec=off: disable non-executable mappings
 
+	no_user_shstk	[X86-64] Disable Shadow Stack for user-mode
+			applications
+
+	no_user_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
+			applications
+
 	nosmap		[X86,PPC]
 			Disable SMAP (Supervisor Mode Access Prevention)
 			even if it is supported by processor.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 265d9e9a093b..2aef972a868d 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -19,6 +19,7 @@ x86-specific Documentation
    tlb
    mtrr
    pat
+   intel_cet
    intel-iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
new file mode 100644
index 000000000000..2deda249bc2c
--- /dev/null
+++ b/Documentation/x86/intel_cet.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+[1] Overview
+============
+
+Control-flow Enforcement Technology (CET) is an Intel processor feature
+that provides protection against return/jump-oriented programming (ROP)
+attacks.  It can be set up to protect both applications and the kernel.
+Only user-mode protection is implemented in the 64-bit kernel, including
+support for running legacy 32-bit applications.
+
+CET introduces Shadow Stack and Indirect Branch Tracking.  Shadow stack is
+a secondary stack allocated from memory and cannot be directly modified by
+applications.  When executing a CALL, the processor pushes the return
+address to both the normal stack and the shadow stack.  Upon function
+return, the processor pops the shadow stack copy and compares it to the
+normal stack copy.  If the two differ, the processor raises a control-
+protection fault.  Indirect branch tracking verifies indirect CALL/JMP
+targets are intended as marked by the compiler with 'ENDBR' opcodes.
+
+There are two kernel configuration options:
+
+    X86_INTEL_SHADOW_STACK_USER, and
+    X86_INTEL_BRANCH_TRACKING_USER.
+
+These need to be enabled to build a CET-enabled kernel, and Binutils v2.31
+and GCC v8.1 or later are required to build a CET kernel.  To build a CET-
+enabled application, GLIBC v2.28 or later is also required.
+
+There are two command-line options for disabling CET features::
+
+    no_user_shstk - disables user shadow stack, and
+    no_user_ibt   - disables user indirect branch tracking.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET.
+
+[2] Application Enabling
+========================
+
+An application's CET capability is marked in its ELF header and can be
+verified from the following command output, in the NT_GNU_PROPERTY_TYPE_0
+field:
+
+    readelf -n <application>
+
+If an application supports CET and is statically linked, it will run with
+CET protection.  If the application needs any shared libraries, the loader
+checks all dependencies and enables CET when all requirements are met.
+
+[3] Backward Compatibility
+==========================
+
+GLIBC provides a few tunables for backward compatibility.
+
+GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
+    Turn off SHSTK/IBT for the current shell.
+
+GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
+    This controls how dlopen() handles SHSTK legacy libraries::
+
+        on         - continue with SHSTK enabled;
+        permissive - continue with SHSTK off.
+
+[4] CET arch_prctl()'s
+======================
+
+Several arch_prctl()'s have been added for CET:
+
+arch_prctl(ARCH_X86_CET_STATUS, u64 *addr)
+    Return CET feature status.
+
+    The parameter 'addr' is a pointer to a user buffer.
+    On returning to the caller, the kernel fills the following
+    information::
+
+        *addr       = shadow stack/indirect branch tracking status
+        *(addr + 1) = shadow stack base address
+        *(addr + 2) = shadow stack size
+
+arch_prctl(ARCH_X86_CET_DISABLE, u64 features)
+    Disable shadow stack and/or indirect branch tracking as specified in
+    'features'.  Return -EPERM if CET is locked.
+
+arch_prctl(ARCH_X86_CET_LOCK)
+    Lock in all CET features.  They cannot be turned off afterwards.
+
+arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args)
+    Allocate a new shadow stack and put a restore token at top.
+
+    The parameter 'args' is a pointer to a user buffer::
+
+        *args = desired size
+        *(args + 1) = MAP_32BIT or MAP_POPULATE
+
+    On returning, *args is the allocated shadow stack address.
+
+Note:
+  There is no CET-enabling arch_prctl function.  By design, CET is enabled
+  automatically if the binary and the system can support it.
+
+[5] The implementation of the Shadow Stack
+==========================================
+
+Shadow Stack size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB).  In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB.  However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+The main program and its signal handlers use the same shadow stack.
+Because the shadow stack stores only return addresses, a large shadow
+stack covers the condition that both the program stack and the signal
+alternate stack run out.
+
+The kernel creates a restore token for the shadow stack restoring address
+and verifies that token when restoring from the signal handler.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHSTK flag set; its PTEs are required to be
+read-only and dirty.  When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread.
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 02/25] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 01/25] Documentation/x86: Add CET description Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 03/25] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu, Borislav Petkov

Add CPU feature flags for Control-flow Enforcement Technology (CET).

CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/cpufeatures.h | 2 ++
 arch/x86/kernel/cpu/cpuid-deps.c   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2901d5df4366..c794e18e8a14 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -341,6 +341,7 @@
 #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
 #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK		(16*32+ 7) /* Shadow Stack */
 #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
 #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
 #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
@@ -370,6 +371,7 @@
 #define X86_FEATURE_SERIALIZE		(18*32+14) /* SERIALIZE instruction */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR		(18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_IBT			(18*32+20) /* Indirect Branch Tracking */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
 #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 3cbe24ca80ab..fec83cc74b9e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -69,6 +69,8 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_CQM_MBM_TOTAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_CQM_MBM_LOCAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_AVX512_BF16,		X86_FEATURE_AVX512VL  },
+	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
+	{ X86_FEATURE_IBT,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 03/25] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 01/25] Documentation/x86: Add CET description Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 02/25] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 04/25] x86/cet: Add control-protection fault handler Yu-cheng Yu
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Control-flow Enforcement Technology (CET) adds five MSRs.  Introduce them
and their XSAVES supervisor states:

    MSR_IA32_U_CET (user-mode CET settings),
    MSR_IA32_PL3_SSP (user-mode Shadow Stack pointer),
    MSR_IA32_PL0_SSP (kernel-mode Shadow Stack pointer),
    MSR_IA32_PL1_SSP (Privilege Level 1 Shadow Stack pointer),
    MSR_IA32_PL2_SSP (Privilege Level 2 Shadow Stack pointer).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v11:
- Drop MSR_IA32 prefix for individual bits, and use BIT_ULL().
- Drop MSR_IA32_CET_BITMAP_MASK.

v6:
- Remove __packed from struct cet_user_state, struct cet_kernel_state.

 arch/x86/include/asm/fpu/types.h            | 23 +++++++++++++++--
 arch/x86/include/asm/fpu/xstate.h           |  5 ++--
 arch/x86/include/asm/msr-index.h            | 17 +++++++++++++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/fpu/xstate.c                | 28 ++++++++++++++++++---
 5 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c87364ea6446..2a7037a6f960 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
 	XFEATURE_RSRVD_COMP_10,
-	XFEATURE_RSRVD_COMP_11,
-	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_CET_USER,
+	XFEATURE_CET_KERNEL,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -134,6 +134,8 @@ enum xfeature {
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
+#define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
@@ -236,6 +238,23 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+	u64 user_cet;			/* user control-flow settings */
+	u64 user_ssp;			/* user shadow stack pointer */
+};
+
+/*
+ * State component 12 is Control-flow Enforcement kernel states
+ */
+struct cet_kernel_state {
+	u64 kernel_ssp;			/* kernel shadow stack */
+	u64 pl1_ssp;			/* privilege level 1 shadow stack */
+	u64 pl2_ssp;			/* privilege level 2 shadow stack */
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 14ab815132d4..e4408db88bca 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -35,7 +35,7 @@
 				      XFEATURE_MASK_BNDCSR)
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (0)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_CET_USER)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -62,7 +62,8 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+					      XFEATURE_MASK_CET_KERNEL)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4f39a8..25bd727f6fa1 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -912,4 +912,21 @@
 #define MSR_VM_IGNNE                    0xc0010115
 #define MSR_VM_HSAVE_PA                 0xc0010117
 
+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET		0x6a0 /* user mode cet setting */
+#define MSR_IA32_S_CET		0x6a2 /* kernel mode cet setting */
+#define MSR_IA32_PL0_SSP	0x6a4 /* kernel shstk pointer */
+#define MSR_IA32_PL1_SSP	0x6a5 /* ring-1 shstk pointer */
+#define MSR_IA32_PL2_SSP	0x6a6 /* ring-2 shstk pointer */
+#define MSR_IA32_PL3_SSP	0x6a7 /* user shstk pointer */
+#define MSR_IA32_INT_SSP_TAB	0x6a8 /* exception shstk table */
+
+/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
+#define CET_SHSTK_EN		BIT_ULL(0)
+#define CET_WRSS_EN		BIT_ULL(1)
+#define CET_ENDBR_EN		BIT_ULL(2)
+#define CET_LEG_IW_EN		BIT_ULL(3)
+#define CET_NO_TRACK_EN		BIT_ULL(4)
+#define CET_WAIT_ENDBR		BIT_ULL(11)
+
 #endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..a8df907e8017 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_CET_BIT		23 /* enable Control-flow Enforcement */
+#define X86_CR4_CET		_BITUL(X86_CR4_CET_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 038e19c0019e..705fd9b94e31 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -38,6 +38,9 @@ static const char *xfeature_names[] =
 	"Processor Trace (unused)"	,
 	"Protection Keys User registers",
 	"unknown xstate feature"	,
+	"Control-flow User registers"	,
+	"Control-flow Kernel registers"	,
+	"unknown xstate feature"	,
 };
 
 static short xsave_cpuid_features[] __initdata = {
@@ -51,6 +54,9 @@ static short xsave_cpuid_features[] __initdata = {
 	X86_FEATURE_AVX512F,
 	X86_FEATURE_INTEL_PT,
 	X86_FEATURE_PKU,
+	-1,		   /* Unused */
+	X86_FEATURE_SHSTK, /* XFEATURE_CET_USER */
+	X86_FEATURE_SHSTK, /* XFEATURE_CET_KERNEL */
 };
 
 /*
@@ -318,6 +324,8 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
+	print_xstate_feature(XFEATURE_MASK_CET_USER);
+	print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
 }
 
 /*
@@ -592,6 +600,8 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
+	XCHECK_SZ(sz, nr, XFEATURE_CET_USER,   struct cet_user_state);
+	XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
@@ -601,7 +611,8 @@ static void check_xstate_against_struct(int nr)
 	if ((nr < XFEATURE_YMM) ||
 	    (nr >= XFEATURE_MAX) ||
 	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_10) && (nr <= XFEATURE_LBR))) {
+	    (nr == XFEATURE_RSRVD_COMP_10) ||
+	    ((nr >= XFEATURE_RSRVD_COMP_13) && (nr <= XFEATURE_LBR))) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
@@ -831,8 +842,19 @@ void __init fpu__init_system_xstate(void)
 	 * Clear XSAVE features that are disabled in the normal CPUID.
 	 */
 	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
-		if (!boot_cpu_has(xsave_cpuid_features[i]))
-			xfeatures_mask_all &= ~BIT_ULL(i);
+		if (xsave_cpuid_features[i] == X86_FEATURE_SHSTK) {
+			/*
+			 * X86_FEATURE_SHSTK and X86_FEATURE_IBT share
+			 * same states, but can be enabled separately.
+			 */
+			if (!boot_cpu_has(X86_FEATURE_SHSTK) &&
+			    !boot_cpu_has(X86_FEATURE_IBT))
+				xfeatures_mask_all &= ~BIT_ULL(i);
+		} else {
+			if ((xsave_cpuid_features[i] == -1) ||
+			    !boot_cpu_has(xsave_cpuid_features[i]))
+				xfeatures_mask_all &= ~BIT_ULL(i);
+		}
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 04/25] x86/cet: Add control-protection fault handler
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (2 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 03/25] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 05/25] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack Yu-cheng Yu
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the Shadow Stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v10:
- Change CONFIG_X86_64 to CONFIG_X86_INTEL_CET.

v9:
- Add Shadow Stack pointer to the fault printout.

 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 59 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a43366191212..47ffdb658867 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -532,6 +532,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_INTEL_CET
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 7ecf9babf0cb..395c34fd201a 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -112,6 +112,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_INTEL_CET
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf0576cd0..c572a3de1037 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 7);
+	BUILD_BUG_ON(NSIGSEGV != 8);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1f66d2d1e998..dca6fdc829f2 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -597,6 +597,65 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+};
+
+/*
+ * When a control protection exception occurs, send a signal
+ * to the responsible application.  Currently, control
+ * protection is only enabled for the user mode.  This
+ * exception should not come from the kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (notify_die(DIE_TRAP, "control protection fault", regs,
+		       error_code, X86_TRAP_CP, SIGSEGV) == NOTIFY_STOP)
+		return;
+	cond_local_irq_enable(regs);
+
+	if (!user_mode(regs))
+		die("kernel control protection fault", regs, error_code);
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK) &&
+	    !static_cpu_has(X86_FEATURE_IBT))
+		WARN_ONCE(1, "CET is disabled but got control protection fault\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    printk_ratelimit()) {
+		unsigned int max_err;
+		unsigned long ssp;
+
+		max_err = ARRAY_SIZE(control_protection_err) - 1;
+		if ((error_code < 0) || (error_code > max_err))
+			error_code = 0;
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_info("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			tsk->comm, task_pid_nr(tsk),
+			regs->ip, regs->sp, ssp, error_code,
+			control_protection_err[error_code]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR,
+			(void __user *)uprobe_get_trap_addr(regs));
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c267181..91e10cbe3bb0 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -229,7 +229,8 @@ typedef struct siginfo {
 #define SEGV_ACCADI	5	/* ADI not enabled for mapped object */
 #define SEGV_ADIDERR	6	/* Disrupting MCD error */
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
-#define NSIGSEGV	7
+#define SEGV_CPERR	8	/* Control protection fault */
+#define NSIGSEGV	8
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 05/25] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (3 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 04/25] x86/cet: Add control-protection fault handler Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 06/25] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Shadow Stack provides protection against function return address
corruption.  It is active when the processor supports it, the kernel has
CONFIG_X86_INTEL_SHADOW_STACK_USER, and the application is built for the
feature.  This is only implemented for the 64-bit kernel.  When it is
enabled, legacy non-shadow stack applications continue to work, but without
protection.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Change SHSTK to shadow stack in the help text.
- Change build-time check to config-time check.
- Change ARCH_HAS_SHSTK to ARCH_HAS_SHADOW_STACK.

 arch/x86/Kconfig                      | 30 +++++++++++++++++++++++++++
 scripts/as-x86_64-has-shadow-stack.sh |  4 ++++
 2 files changed, 34 insertions(+)
 create mode 100755 scripts/as-x86_64-has-shadow-stack.sh

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..4844649ee884 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1927,6 +1927,36 @@ config X86_INTEL_TSX_MODE_AUTO
 	  side channel attacks- equals the tsx=auto command line parameter.
 endchoice
 
+config AS_HAS_SHADOW_STACK
+	def_bool $(success,$(srctree)/scripts/as-x86_64-has-shadow-stack.sh $(CC))
+	help
+	  Test the assembler for shadow stack instructions.
+
+config X86_INTEL_CET
+	def_bool n
+
+config ARCH_HAS_SHADOW_STACK
+	def_bool n
+
+config X86_INTEL_SHADOW_STACK_USER
+	prompt "Intel Shadow Stacks for user-mode"
+	def_bool n
+	depends on CPU_SUP_INTEL && X86_64
+	depends on AS_HAS_SHADOW_STACK
+	select ARCH_USES_HIGH_VMA_FLAGS
+	select X86_INTEL_CET
+	select ARCH_HAS_SHADOW_STACK
+	help
+	  Shadow Stacks provides protection against program stack
+	  corruption.  It's a hardware feature.  This only matters
+	  if you have the right hardware.  It's a security hardening
+	  feature and apps must be enabled to use it.  You get no
+	  protection "for free" on old userspace.  The hardware can
+	  support user and kernel, but this option is for user space
+	  only.
+
+	  If unsure, say y.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/scripts/as-x86_64-has-shadow-stack.sh b/scripts/as-x86_64-has-shadow-stack.sh
new file mode 100755
index 000000000000..fac1d363a1b8
--- /dev/null
+++ b/scripts/as-x86_64-has-shadow-stack.sh
@@ -0,0 +1,4 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+echo "wrussq %rax, (%rbx)" | $* -x assembler -c -
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 06/25] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (4 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 05/25] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 07/25] x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages Yu-cheng Yu
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu, Dave Hansen

Before introducing _PAGE_COW for non-hardware memory management purposes in
the next patch, rename _PAGE_DIRTY to _PAGE_DIRTY_HW and _PAGE_BIT_DIRTY to
_PAGE_BIT_DIRTY_HW to make meanings more clear.  There are no functional
changes from this patch.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
v9:
- At some places _PAGE_DIRTY were not changed to _PAGE_DIRTY_HW, because
  they will be changed again in the next patch to _PAGE_DIRTY_BITS.
  However, this causes compile issues if the next patch is not yet applied.
  Fix it by changing all _PAGE_DIRTY to _PAGE_DRITY_HW.

 arch/x86/include/asm/pgtable.h       | 18 +++++++++---------
 arch/x86/include/asm/pgtable_types.h | 11 +++++------
 arch/x86/kernel/relocate_kernel_64.S |  2 +-
 arch/x86/kvm/vmx/vmx.c               |  2 +-
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b836138ce852..86b7acd221c1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,7 +124,7 @@ extern pmdval_t early_pmd_flags;
  */
 static inline int pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_HW;
 }
 
 
@@ -163,7 +163,7 @@ static inline int pte_young(pte_t pte)
 
 static inline int pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_HW;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -173,7 +173,7 @@ static inline int pmd_young(pmd_t pmd)
 
 static inline int pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_HW;
 }
 
 static inline int pud_young(pud_t pud)
@@ -334,7 +334,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_HW);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -354,7 +354,7 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -435,7 +435,7 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
@@ -445,7 +445,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -489,7 +489,7 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_HW);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
@@ -499,7 +499,7 @@ static inline pud_t pud_wrprotect(pud_t pud)
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..192e1326b3db 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -15,7 +15,7 @@
 #define _PAGE_BIT_PWT		3	/* page write through */
 #define _PAGE_BIT_PCD		4	/* page cache disabled */
 #define _PAGE_BIT_ACCESSED	5	/* was accessed (raised by CPU) */
-#define _PAGE_BIT_DIRTY		6	/* was written to (raised by CPU) */
+#define _PAGE_BIT_DIRTY_HW	6	/* was written to (raised by CPU) */
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
@@ -46,7 +46,7 @@
 #define _PAGE_PWT	(_AT(pteval_t, 1) << _PAGE_BIT_PWT)
 #define _PAGE_PCD	(_AT(pteval_t, 1) << _PAGE_BIT_PCD)
 #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
-#define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
+#define _PAGE_DIRTY_HW	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_HW)
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_SOFTW1	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
@@ -74,7 +74,7 @@
 			 _PAGE_PKEY_BIT3)
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
-#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY_HW | _PAGE_ACCESSED)
 #else
 #define _PAGE_KNL_ERRATUM_MASK 0
 #endif
@@ -126,7 +126,7 @@
  * pte_modify() does modify it.
  */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
-			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
+			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW |	\
 			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |  \
 			 _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
@@ -163,7 +163,7 @@ enum page_cache_mode {
 #define __RW _PAGE_RW
 #define _USR _PAGE_USER
 #define ___A _PAGE_ACCESSED
-#define ___D _PAGE_DIRTY
+#define ___D _PAGE_DIRTY_HW
 #define ___G _PAGE_GLOBAL
 #define __NX _PAGE_NX
 
@@ -205,7 +205,6 @@ enum page_cache_mode {
 #define __PAGE_KERNEL_IO		__PAGE_KERNEL
 #define __PAGE_KERNEL_IO_NOCACHE	__PAGE_KERNEL_NOCACHE
 
-
 #ifndef __ASSEMBLY__
 
 #define __PAGE_KERNEL_ENC	(__PAGE_KERNEL    | _ENC)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index a4d9a261425b..e3bb4ff95523 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -17,7 +17,7 @@
  */
 
 #define PTR(x) (x << 3)
-#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY_HW)
 
 /*
  * control_page + KEXEC_CONTROL_CODE_MAX_SIZE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 46ba2e03a892..f5ec6dadca4f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3605,7 +3605,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	/* Set up identity-mapping pagetable for EPT in real mode */
 	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
 		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
-			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+			_PAGE_ACCESSED | _PAGE_DIRTY_HW | _PAGE_PSE);
 		r = kvm_write_guest_page(kvm, identity_map_pfn,
 				&tmp, i * sizeof(tmp), sizeof(tmp));
 		if (r < 0)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 07/25] x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (5 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 06/25] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 08/25] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu, Christoph Hellwig

Kernel read-only PTEs are setup as _PAGE_DIRTY_HW.  Since these become
shadow stack PTEs, remove the dirty bit.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable_types.h | 6 +++---
 arch/x86/mm/pat/set_memory.c         | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 192e1326b3db..5f31f1c407b9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -193,10 +193,10 @@ enum page_cache_mode {
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d1b2a889f035..962434fdf0d9 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1932,7 +1932,7 @@ int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY_HW), 0);
 }
 
 int set_memory_rw(unsigned long addr, int numpages)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 08/25] x86/mm: Introduce _PAGE_COW
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (6 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 07/25] x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 09/25] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux).  That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0,Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set by hardware
and cannot normally be set by hardware on a Write=0 PTE.  Software must
normally be involved to create one of these PTEs, so software can simply
opt to not create them.

But that leaves us with a Linux problem: we need to ensure we never create
Write=0,Dirty=1 PTEs.  In places where we do create them, we need to find
an alternative way to represent them _without_ using the same hardware bit
combination.  Thus, enter _PAGE_COW.  This results in the following:

(a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
(b) A R/O page that has been COW'ed: (R/O + _PAGE_COW)
    The user page is in a R/O VMA, and get_user_pages() needs a writable
    copy.  The page fault handler creates a copy of the page and sets
    the new copy's PTE as R/O and _PAGE_COW.
(c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
(d) A shared shadow stack PTE: (R/O + _PAGE_COW)
    When a shadow stack page is being shared among processes (this happens
    at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow
    stack access causes a fault, and the page is duplicated and
    _PAGE_DIRTY_HW is set again.  This is the COW equivalent for shadow
    stack pages, even though it's copy-on-access rather than copy-on-write.
(e) A page where the processor observed a Write=1 PTE, started a write, set
    Dirty=1, but then observed a Write=0 PTE.  That's possible today, but
    will not happen on processors that support shadow stack.

Use _PAGE_COW in pte_wrprotect() and _PAGE_DIRTY_HW in pte_mkwrite().
Apply the same changes to pmd and pud.

When this patch is applied, there are six free bits left in the 64-bit PTE.
There are no more free bits in the 32-bit PTE (except for PAE) and shadow
stack is not implemented for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v10:
- Change _PAGE_BIT_DIRTY_SW to _PAGE_BIT_COW, as it is used for copy-on-
  write PTEs.
- Update pte_write() and treat shadow stack as writable.
- Change *_mkdirty_shstk() to *_mkwrite_shstk() as these make shadow stack
  pages writable.
- Use bit test & shift to move _PAGE_BIT_DIRTY_HW to _PAGE_BIT_COW.
- Change static_cpu_has() to cpu_feature_enabled().
- Revise commit log.

v9:
- Remove pte_move_flags() etc. and put the logic directly in
  pte_wrprotect()/pte_mkwrite() etc.
- Change compile-time conditionals to run-time checks.
- Split out pte_modify()/pmd_modify() to a new patch.
- Update comments.

 arch/x86/include/asm/pgtable.h       | 120 ++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h |  41 ++++++++-
 2 files changed, 150 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 86b7acd221c1..ac4ed814be96 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -122,9 +122,9 @@ extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY_HW;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
 }
 
 
@@ -161,9 +161,9 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY_HW;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -171,9 +171,9 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY_HW;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -183,6 +183,12 @@ static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
+	/*
+	 * If _PAGE_DIRTY_HW is set, the PTE must either have
+	 * _PAGE_RW or be a shadow stack PTE, which is logically writable.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW);
 	return pte_flags(pte) & _PAGE_RW;
 }
 
@@ -334,7 +340,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY_HW);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -344,6 +350,17 @@ static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (RW=0,Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
+			   _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW;
+		pte = pte_clear_flags(pte, _PAGE_DIRTY_HW);
+	}
+
 	return pte_clear_flags(pte, _PAGE_RW);
 }
 
@@ -354,6 +371,18 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+	pteval_t dirty = _PAGE_DIRTY_HW;
+
+	/* Avoid creating (HW)Dirty=1,Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_COW;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_COW);
 	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
@@ -364,6 +393,13 @@ static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		if (pte_flags(pte) & _PAGE_COW) {
+			pte = pte_clear_flags(pte, _PAGE_COW);
+			pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
+		}
+	}
+
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
@@ -435,16 +471,41 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0,Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pmdval_t v = native_pmd_val(pmd);
+
+		v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW <<
+		     _PAGE_BIT_COW;
+		pmd = pmd_clear_flags(__pmd(v), _PAGE_DIRTY_HW);
+	}
+
 	return pmd_clear_flags(pmd, _PAGE_RW);
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
+	pmdval_t dirty = _PAGE_DIRTY_HW;
+
+	/* Avoid creating (HW)Dirty=1,Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW))
+		dirty = _PAGE_COW;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_COW);
 	return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
 }
 
@@ -465,6 +526,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		if (pmd_flags(pmd) & _PAGE_COW) {
+			pmd = pmd_clear_flags(pmd, _PAGE_COW);
+			pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
+		}
+	}
+
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
 
@@ -489,17 +557,36 @@ static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY_HW);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0,Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pudval_t v = native_pud_val(pud);
+
+		v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW <<
+		     _PAGE_BIT_COW;
+		pud = pud_clear_flags(__pud(v), _PAGE_DIRTY_HW);
+	}
+
 	return pud_clear_flags(pud, _PAGE_RW);
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY_HW;
+
+	/* Avoid creating (HW)Dirty=1,Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW))
+		dirty = _PAGE_COW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -519,6 +606,13 @@ static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		if (pud_flags(pud) & _PAGE_COW) {
+			pud = pud_clear_flags(pud, _PAGE_COW);
+			pud = pud_set_flags(pud, _PAGE_DIRTY_HW);
+		}
+	}
+
 	return pud_set_flags(pud, _PAGE_RW);
 }
 
@@ -1132,6 +1226,12 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
+	/*
+	 * If _PAGE_DIRTY_HW is set, then the PMD must either have
+	 * _PAGE_RW or be a shadow stack PMD, which is logically writable.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY_HW);
 	return pmd_flags(pmd) & _PAGE_RW;
 }
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 5f31f1c407b9..b57483567b8b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,7 +23,8 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -36,6 +37,16 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * This bit indicates a copy-on-write page, and is different from
+ * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
+ */
+#ifdef CONFIG_X86_64
+#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +128,34 @@
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * _PAGE_COW is used to separate R/O and copy-on-write PTEs created by
+ * software from the shadow stack PTE setting required by the hardware:
+ * (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
+ * (b) A R/O page that has been COW'ed: (R/O +_PAGE_COW)
+ *     The user page is in a R/O VMA, and get_user_pages() needs a
+ *     writable copy.  The page fault handler creates a copy of the page
+ *     and sets the new copy's PTE as R/O and _PAGE_COW.
+ * (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
+ * (d) A shared (copy-on-access) shadow stack PTE: (R/O + _PAGE_COW)
+ *     When a shadow stack page is being shared among processes (this
+ *     happens at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the
+ *     next shadow stack access causes a fault, and the page is duplicated
+ *     and _PAGE_DIRTY_HW is set again.  This is the COW equivalent for
+ *     shadow stack pages, even though it's copy-on-access rather than
+ *     copy-on-write.
+ * (e) A page where the processor observed a Write=1 PTE, started a write,
+ *     set Dirty=1, but then observed a Write=0 PTE.  That's possible
+ *     today, but will not happen on processors that support shadow stack.
+ */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define _PAGE_COW	(_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_COW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 09/25] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (7 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 08/25] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 10/25] x86/mm: Update pte_modify for _PAGE_COW Yu-cheng Yu
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu, David Airlie, Joonas Lahtinen, Jani Nikula,
	Daniel Vetter, Rodrigo Vivi, Zhenyu Wang, Zhi Wang

After the introduction of _PAGE_COW, a modified page's PTE can have either
_PAGE_DIRTY_HW or _PAGE_COW.  Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
---
 drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index 210016192ce7..c01f4880c794 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1207,7 +1207,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
 	}
 
 	/* Clear dirty field. */
-	se->val64 &= ~_PAGE_DIRTY;
+	se->val64 &= ~_PAGE_DIRTY_BITS;
 
 	ops->clear_pse(se);
 	ops->clear_ips(se);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 10/25] x86/mm: Update pte_modify for _PAGE_COW
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (8 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 09/25] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 11/25] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY_HW to _PAGE_COW Yu-cheng Yu
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Pte_modify() changes a PTE to 'newprot'.  It doesn't use the pte_*()
helpers that a previous patch fixed up, so we need a new site.

Introduce fixup_dirty_pte() to set the dirty bits based on _PAGE_RW, and
apply the same changes to pmd_modify().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Replace _PAGE_CHG_MASK approach with fixup functions.

 arch/x86/include/asm/pgtable.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ac4ed814be96..3bdb192a904b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -727,6 +727,21 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
+static inline pteval_t fixup_dirty_pte(pteval_t pteval)
+{
+	pte_t pte = __pte(pteval);
+
+	if (pte_dirty(pte)) {
+		pte = pte_mkclean(pte);
+
+		if (pte_flags(pte) & _PAGE_RW)
+			pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
+		else
+			pte = pte_set_flags(pte, _PAGE_COW);
+	}
+	return pte_val(pte);
+}
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pteval_t val = pte_val(pte), oldval = val;
@@ -737,16 +752,34 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+	val = fixup_dirty_pte(val);
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
 	return __pte(val);
 }
 
+static inline int pmd_write(pmd_t pmd);
+static inline pmdval_t fixup_dirty_pmd(pmdval_t pmdval)
+{
+	pmd_t pmd = __pmd(pmdval);
+
+	if (pmd_dirty(pmd)) {
+		pmd = pmd_mkclean(pmd);
+
+		if (pmd_flags(pmd) & _PAGE_RW)
+			pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
+		else
+			pmd = pmd_set_flags(pmd, _PAGE_COW);
+	}
+	return pmd_val(pmd);
+}
+
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
 	pmdval_t val = pmd_val(pmd), oldval = val;
 
 	val &= _HPAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val = fixup_dirty_pmd(val);
 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
 	return __pmd(val);
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 11/25] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY_HW to _PAGE_COW
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (9 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 10/25] x86/mm: Update pte_modify for _PAGE_COW Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 12/25] mm: Introduce VM_SHSTK for shadow stack memory Yu-cheng Yu
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

When shadow stack is introduced, [R/O + _PAGE_DIRTY_HW] PTE is reserved
for shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].

When a PTE goes from [R/W + _PAGE_DIRTY_HW] to [R/O + _PAGE_COW], it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a read-only PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE.  However, this will not occur on processors supporting
shadow stack, therefore we don't need a TLB flush here.

The second case is that when the software, without atomic, tests & replaces
_PAGE_DIRTY_HW with _PAGE_COW, a transient shadow stack PTE can exist.
This is prevented with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue.  Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v10:
- Replace bit shift with pte_wrprotect()/pmd_wrprotect(), which use bit
  test & shift.
- Move READ_ONCE of old_pte into try_cmpxchg() loop.
- Change static_cpu_has() to cpu_feature_enabled().

v9:
- Change compile-time conditionals to runtime checks.
- Fix parameters of try_cmpxchg(): change pte_t/pmd_t to
  pte_t.pte/pmd_t.pmd.

v4:
- Implement try_cmpxchg().

 arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3bdb192a904b..a00d55fda5a2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1230,6 +1230,32 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+	/*
+	 * Some processors can start a write, but end up seeing a read-only
+	 * PTE by the time they get to the Dirty bit.  In this case, they
+	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
+	 * looks like a shadow stack PTE.
+	 *
+	 * However, this behavior has been improved and will not occur on
+	 * processors supporting shadow stack.  Without this guarantee, a
+	 * transition to a non-present PTE and flush the TLB would be
+	 * needed.
+	 *
+	 * When changing a writable PTE to read-only and if the PTE has
+	 * _PAGE_DIRTY_HW set, move that bit to _PAGE_COW so that the
+	 * PTE is not a shadow stack PTE.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		do {
+			old_pte = READ_ONCE(*ptep);
+			new_pte = pte_wrprotect(old_pte);
+
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1286,6 +1312,32 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+	/*
+	 * Some processors can start a write, but end up seeing a read-only
+	 * PMD by the time they get to the Dirty bit.  In this case, they
+	 * will set the Dirty bit, leaving a read-only, Dirty PMD which
+	 * looks like a Shadow Stack PMD.
+	 *
+	 * However, this behavior has been improved and will not occur on
+	 * processors supporting Shadow Stack.  Without this guarantee, a
+	 * transition to a non-present PMD and flush the TLB would be
+	 * needed.
+	 *
+	 * When changing a writable PMD to read-only and if the PMD has
+	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_COW so that the
+	 * PMD is not a shadow stack PMD.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		do {
+			old_pmd = READ_ONCE(*pmdp);
+			new_pmd = pmd_wrprotect(old_pmd);
+
+		} while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));
+
+		return;
+	}
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 12/25] mm: Introduce VM_SHSTK for shadow stack memory
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (10 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 11/25] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY_HW to _PAGE_COW Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 13/25] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

A Shadow Stack PTE must be read-only and have _PAGE_DIRTY set.  However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages.  These
two cases are handled differently for page faults.  Introduce VM_SHSTK to
track shadow stack VMAs.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v9:
- Add VM_SHSTK case to arch_vma_name().
- Revise the commit log to explain why adding a new VM flag.

 arch/x86/mm/mmap.c | 2 ++
 fs/proc/task_mmu.c | 3 +++
 include/linux/mm.h | 8 ++++++++
 3 files changed, 13 insertions(+)

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..a22c6b6fc607 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
 
 const char *arch_vma_name(struct vm_area_struct *vma)
 {
+	if (vma->vm_flags & VM_SHSTK)
+		return "[shadow stack]";
 	return NULL;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5066b0251ed8..682ea6f95fa4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -663,6 +663,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT4)]	= "",
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+		[ilog2(VM_SHSTK)]	= "ss",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1983e08f5906..62f5f496a6d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,11 +299,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
 #endif
 
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+# define VM_SHSTK	VM_HIGH_ARCH_5
+#else
+# define VM_SHSTK	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 13/25] x86/mm: Shadow Stack page fault error checking
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (11 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 12/25] mm: Introduce VM_SHSTK for shadow stack memory Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 14/25] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Shadow stack accesses are those that are performed by the CPU where it
expects to encounter a shadow stack mapping.  These accesses are performed
implicitly by CALL/RET at the site of the shadow stack pointer.  These
accesses are made explicitly by shadow stack management instructions like
WRUSSQ.

Shadow stacks accesses to shadow-stack mapping can see faults in normal,
valid operation just like regular accesses to regular mappings.  Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping.

In handling a shadow stack page fault, verify it occurs within a shadow
stack mapping.  It is always an error otherwise.  For valid shadow stack
accesses, set FAULT_FLAG_WRITE to effect copy-on-write.  Because clearing
_PAGE_DIRTY_HW (vs. _PAGE_RW) is used to trigger the fault, shadow stack
read fault and shadow stack write fault are not differentiated and both are
handled as a write access.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
v10:
-Revise commit log.

 arch/x86/include/asm/traps.h |  2 ++
 arch/x86/mm/fault.c          | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 714b1a30e7b0..28b493c53d70 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -50,6 +50,7 @@ void __noreturn handle_stack_overflow(const char *message,
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
+ *   bit 6 ==				1: shadow stack access fault
  */
 enum x86_pf_error_code {
 	X86_PF_PROT	=		1 << 0,
@@ -58,5 +59,6 @@ enum x86_pf_error_code {
 	X86_PF_RSVD	=		1 << 3,
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
+	X86_PF_SHSTK	=		1 << 6,
 };
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 35f1498e9832..db4018d122ca 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1063,6 +1063,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
+	/*
+	 * Verify a shadow stack access is within a shadow stack VMA.
+	 * It is always an error otherwise.  Normal data access to a
+	 * shadow stack area is checked in the case followed.
+	 */
+	if (error_code & X86_PF_SHSTK) {
+		if (!(vma->vm_flags & VM_SHSTK))
+			return 1;
+		return 0;
+	}
+
 	if (error_code & X86_PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1197,6 +1208,14 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
+	/*
+	 * Clearing _PAGE_DIRTY_HW is used to detect shadow stack access.
+	 * This method cannot distinguish shadow stack read vs. write.
+	 * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
+	 * copy-on-write.
+	 */
+	if (hw_error_code & X86_PF_SHSTK)
+		flags |= FAULT_FLAG_WRITE;
 	if (hw_error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (hw_error_code & X86_PF_INSTR)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 14/25] x86/mm: Update maybe_mkwrite() for shadow stack
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (12 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 13/25] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 15/25] mm: Fixup places that call pte_mkwrite() directly Yu-cheng Yu
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Shadow stack memory is writable, but its VMA has VM_SHSTK instead of
VM_WRITE.  Update maybe_mkwrite() to include the shadow stack.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/Kconfig        |  4 ++++
 arch/x86/mm/pgtable.c   | 18 ++++++++++++++++++
 include/linux/mm.h      |  2 ++
 include/linux/pgtable.h | 24 ++++++++++++++++++++++++
 mm/huge_memory.c        |  2 ++
 5 files changed, 50 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4844649ee884..e93be385cd04 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1935,6 +1935,9 @@ config AS_HAS_SHADOW_STACK
 config X86_INTEL_CET
 	def_bool n
 
+config ARCH_MAYBE_MKWRITE
+	def_bool n
+
 config ARCH_HAS_SHADOW_STACK
 	def_bool n
 
@@ -1945,6 +1948,7 @@ config X86_INTEL_SHADOW_STACK_USER
 	depends on AS_HAS_SHADOW_STACK
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select X86_INTEL_CET
+	select ARCH_MAYBE_MKWRITE
 	select ARCH_HAS_SHADOW_STACK
 	help
 	  Shadow Stacks provides protection against program stack
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..a9666b64bc05 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -610,6 +610,24 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 }
 #endif
 
+#ifdef CONFIG_ARCH_MAYBE_MKWRITE
+pte_t arch_maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_SHSTK))
+		pte = pte_mkwrite_shstk(pte);
+	return pte;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_SHSTK))
+		pmd = pmd_mkwrite_shstk(pmd);
+	return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_ARCH_MAYBE_MKWRITE */
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 62f5f496a6d1..9c2efefb9281 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -962,6 +962,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
 		pte = pte_mkwrite(pte);
+	else
+		pte = arch_maybe_mkwrite(pte, vma);
 	return pte;
 }
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a124c21e3204..4445d009f5ec 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1354,6 +1354,30 @@ static inline bool arch_has_pfn_modify_check(void)
 }
 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
 
+#ifdef CONFIG_MMU
+#ifdef CONFIG_ARCH_MAYBE_MKWRITE
+pte_t arch_maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#else /* !CONFIG_ARCH_MAYBE_MKWRITE */
+static inline pte_t arch_maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	return pte;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* CONFIG_ARCH_MAYBE_MKWRITE */
+#endif /* CONFIG_MMU */
+
 /*
  * Architecture PAGE_KERNEL_* fallbacks
  *
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ccff8472cd4..a580b5fb6e1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -464,6 +464,8 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
 		pmd = pmd_mkwrite(pmd);
+	else
+		pmd = arch_maybe_pmd_mkwrite(pmd, vma);
 	return pmd;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 15/25] mm: Fixup places that call pte_mkwrite() directly
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (13 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 14/25] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 16/25] mm: Add guard pages around a shadow stack Yu-cheng Yu
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

A shadow stack page is made writable by pte_mkwrite_shstk(), which sets
_PAGE_DIRTY_HW.  There are a few places that call pte_mkwrite() directly
and miss the maybe_mkwrite() fixup in the previous patch.  Fix them with
maybe_mkwrite():

- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly
  and call pte_mkwrite(), which is the same as maybe_mkwrite().  Change
  them to maybe_mkwrite().

- In do_numa_page(), if the numa entry 'was-writable', then pte_mkwrite()
  is called directly.  Fix it by doing maybe_mkwrite().

- In change_pte_range(), pte_mkwrite() is called directly.  Replace it with
  maybe_mkwrite().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 mm/memory.c   | 5 ++---
 mm/migrate.c  | 3 +--
 mm/mprotect.c | 2 +-
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3a7779d9891d..db34853dd206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3387,8 +3387,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
 	entry = mk_pte(page, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
@@ -4042,7 +4041,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
 	if (was_writable)
-		pte = pte_mkwrite(pte);
+		pte = maybe_mkwrite(pte, vma);
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 34a842a8eb6a..b7199a1dc449 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2897,8 +2897,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		}
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
-		if (vma->vm_flags & VM_WRITE)
-			entry = pte_mkwrite(pte_mkdirty(entry));
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..a8edbcb3af99 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -135,7 +135,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			if (dirty_accountable && pte_dirty(ptent) &&
 					(pte_soft_dirty(ptent) ||
 					 !(vma->vm_flags & VM_SOFTDIRTY))) {
-				ptent = pte_mkwrite(ptent);
+				ptent = maybe_mkwrite(ptent, vma);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			pages++;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 16/25] mm: Add guard pages around a shadow stack.
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (14 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 15/25] mm: Fixup places that call pte_mkwrite() directly Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 17/25] mm/mmap: Add shadow stack pages to memory accounting Yu-cheng Yu
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
first and the last elements in the range, effectively touches those memory
areas.

The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
255 * 4 = 1020 bytes by INCSSPD.  Both ranges are far from PAGE_SIZE.
Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
CALL, and RET from going beyond.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Define ARCH_SHADOW_STACK_GUARD_GAP.

 arch/x86/include/asm/processor.h | 10 ++++++++++
 include/linux/mm.h               | 24 ++++++++++++++++++++----
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 97143d87994c..01acbd63cad8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -840,6 +840,16 @@ static inline void spin_lock_prefetch(const void *x)
 #define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
+/*
+ * Shadow stack pointer is moved by CALL, JMP, and INCSSP(Q/D).  INCSSPQ
+ * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
+ * touches the first and the last element in the range, which triggers a
+ * page fault if the range is not in a shadow stack.  Because of this,
+ * creating 4-KB guard pages around a shadow stack prevents these
+ * instructions from going beyond.
+ */
+#define ARCH_SHADOW_STACK_GUARD_GAP PAGE_SIZE
+
 #define INIT_THREAD  {						\
 	.addr_limit		= KERNEL_DS,			\
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c2efefb9281..d437ce0c85ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2608,6 +2608,10 @@ extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
 int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
+#ifndef ARCH_SHADOW_STACK_GUARD_GAP
+#define ARCH_SHADOW_STACK_GUARD_GAP 0
+#endif
+
 extern unsigned long stack_guard_gap;
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
@@ -2640,9 +2644,15 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
 	unsigned long vm_start = vma->vm_start;
+	unsigned long gap = 0;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
+	if (vma->vm_flags & VM_GROWSDOWN)
+		gap = stack_guard_gap;
+	else if (vma->vm_flags & VM_SHSTK)
+		gap = ARCH_SHADOW_STACK_GUARD_GAP;
+
+	if (gap != 0) {
+		vm_start -= gap;
 		if (vm_start > vma->vm_start)
 			vm_start = 0;
 	}
@@ -2652,9 +2662,15 @@ static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
 {
 	unsigned long vm_end = vma->vm_end;
+	unsigned long gap = 0;
+
+	if (vma->vm_flags & VM_GROWSUP)
+		gap = stack_guard_gap;
+	else if (vma->vm_flags & VM_SHSTK)
+		gap = ARCH_SHADOW_STACK_GUARD_GAP;
 
-	if (vma->vm_flags & VM_GROWSUP) {
-		vm_end += stack_guard_gap;
+	if (gap != 0) {
+		vm_end += gap;
 		if (vm_end < vma->vm_end)
 			vm_end = -PAGE_SIZE;
 	}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 17/25] mm/mmap: Add shadow stack pages to memory accounting
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (15 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 16/25] mm: Add guard pages around a shadow stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 18/25] mm: Update can_follow_write_pte() for shadow stack Yu-cheng Yu
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Account shadow stack pages to stack memory.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Use arch_shadow_stack_mapping() to make meaning clear.

v8:
- Change shadow stake pages from data_vm to stack_vm.

 arch/x86/mm/pgtable.c   |  7 +++++++
 include/linux/pgtable.h | 11 +++++++++++
 mm/mmap.c               |  5 +++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index a9666b64bc05..68e98f70298b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -893,3 +893,10 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #endif /* CONFIG_X86_64 */
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+	return (vm_flags & VM_SHSTK);
+}
+#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 4445d009f5ec..ea92b592c053 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1378,6 +1378,17 @@ static inline pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma
 #endif /* CONFIG_ARCH_MAYBE_MKWRITE */
 #endif /* CONFIG_MMU */
 
+#ifdef CONFIG_MMU
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+bool arch_shadow_stack_mapping(vm_flags_t vm_flags);
+#else
+static inline bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_HAS_SHADOW_STACK */
+#endif /* CONFIG_MMU */
+
 /*
  * Architecture PAGE_KERNEL_* fallbacks
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index 40248d84ad5f..574b3f273462 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1682,6 +1682,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	if (file && is_file_hugepages(file))
 		return 0;
 
+	if (arch_shadow_stack_mapping(vm_flags))
+		return 1;
+
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
@@ -3352,6 +3355,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->stack_vm += npages;
 	else if (is_data_mapping(flags))
 		mm->data_vm += npages;
+	else if (arch_shadow_stack_mapping(flags))
+		mm->stack_vm += npages;
 }
 
 static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 18/25] mm: Update can_follow_write_pte() for shadow stack
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (16 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 17/25] mm/mmap: Add shadow stack pages to memory accounting Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 19/25] mm: Re-introduce do_mmap_pgoff() Yu-cheng Yu
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

Can_follow_write_pte() ensures a read-only page is COWed by checking the
FOLL_COW flag, and uses pte_dirty() to validate the flag is still valid.

Like a writable data page, a shadow stack page is writable, and becomes
read-only during copy-on-write, but it is always dirty.  Thus, in the
can_follow_write_pte() check, it belongs to the writable page case and
should be excluded from the read-only page pte_dirty() check.  Apply
the same changes to can_follow_write_pmd().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Reverse name changes to can_follow_write_*().

 mm/gup.c         | 8 +++++---
 mm/huge_memory.c | 8 +++++---
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ae096ea7583f..f6a640f2ad57 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -384,9 +384,11 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
  * but only after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
+					struct vm_area_struct *vma)
 {
-	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte) &&
+				  !arch_shadow_stack_mapping(vma->vm_flags));
 }
 
 /*
@@ -439,7 +441,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags, vma)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a580b5fb6e1a..0f4ed61a77f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,9 +1296,11 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
  * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
  * but only after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags,
+					struct vm_area_struct *vma)
 {
-	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd) &&
+				  !arch_shadow_stack_mapping(vma->vm_flags));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1311,7 +1313,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags, vma))
 		goto out;
 
 	/* Avoid dumping huge zero page */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 19/25] mm: Re-introduce do_mmap_pgoff()
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (17 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 18/25] mm: Update can_follow_write_pte() for shadow stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 20/25] x86/cet/shstk: User-mode shadow stack support Yu-cheng Yu
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu, Peter Collingbourne, Andrew Morton

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

    commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now.  Shadow stack allocation passes VM_SHSTK to
do_mmap().  Re-introduce the vm_flags and do_mmap_pgoff().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: linux-mm@kvack.org
---
 fs/aio.c             |  6 +++---
 fs/hugetlbfs/inode.c |  2 +-
 include/linux/fs.h   |  2 +-
 include/linux/mm.h   | 12 +++++++++++-
 ipc/shm.c            |  2 +-
 mm/mmap.c            | 16 ++++++++--------
 mm/nommu.c           |  6 +++---
 mm/shmem.c           |  2 +-
 mm/util.c            |  4 ++--
 9 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5736bff48e9e..91e7cc4a9f17 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -525,9 +525,9 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 		return -EINTR;
 	}
 
-	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
-				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
+				       PROT_READ | PROT_WRITE,
+				       MAP_SHARED, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..f936bcf02cce 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -140,7 +140,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * already been checked by prepare_hugepage_range.  If you add
 	 * any error returns here, do so after setting VM_HUGETLB, so
 	 * is_vm_hugetlb_page tests below unmap_region go the right
-	 * way when do_mmap unwinds (may be important on powerpc
+	 * way when do_mmap_pgoff unwinds (may be important on powerpc
 	 * and ia64).
 	 */
 	vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e019ea2f1347..75a98288e7c5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -538,7 +538,7 @@ static inline int mapping_mapped(struct address_space *mapping)
 
 /*
  * Might pages of this file have been modified in userspace?
- * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap
+ * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
  * marks vma as VM_SHARED if it is shared, and the file was opened for
  * writing i.e. vma may be mprotected writable even if now readonly.
  *
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d437ce0c85ac..36b239aa5aa7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2553,13 +2553,23 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf);
 extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(unsigned long start, size_t len_in, int behavior);
 
+static inline unsigned long
+do_mmap_pgoff(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long prot, unsigned long flags,
+	unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf)
+{
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+}
+
 #ifdef CONFIG_MMU
 extern int __mm_populate(unsigned long addr, unsigned long len,
 			 int ignore_errors);
diff --git a/ipc/shm.c b/ipc/shm.c
index f1ed36e3ac9f..6cf24a5994ec 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1556,7 +1556,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 574b3f273462..81d4a00092da 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1030,7 +1030,7 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
  * anon_vmas, nor if same anon_vma is assigned but offsets incompatible.
  *
  * We don't check here for the merged mmap wrapping around the end of pagecache
- * indices (16TB on ia32) because do_mmap() does not permit mmap's which
+ * indices (16TB on ia32) because do_mmap_pgoff() does not permit mmap's which
  * wrap, nor mmaps which cover the final page at index -1UL.
  */
 static int
@@ -1365,11 +1365,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  */
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
-			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long flags, vm_flags_t vm_flags,
+			unsigned long pgoff, unsigned long *populate,
+			struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	vm_flags_t vm_flags;
 	int pkey = 0;
 
 	*populate = 0;
@@ -1431,7 +1431,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
@@ -2233,7 +2233,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		/*
 		 * mmap_region() will call shmem_zero_setup() to create a file,
 		 * so use shmem's get_unmapped_area in case it can be huge.
-		 * do_mmap() will clear pgoff, so match alignment.
+		 * do_mmap_pgoff() will clear pgoff, so match alignment.
 		 */
 		pgoff = 0;
 		get_area = shmem_get_unmapped_area;
@@ -3006,7 +3006,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	}
 
 	file = get_file(vma->vm_file);
-	ret = do_mmap(vma->vm_file, start, size,
+	ret = do_mmap_pgoff(vma->vm_file, start, size,
 			prot, flags, pgoff, &populate, NULL);
 	fput(file);
 out:
@@ -3226,7 +3226,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 	 * By setting it to reflect the virtual start address of the
 	 * vma, merges and splits can happen in a seamless way, just
 	 * using the existing file pgoff checks and manipulations.
-	 * Similarly in do_mmap and in do_brk.
+	 * Similarly in do_mmap_pgoff and in do_brk.
 	 */
 	if (vma_is_anonymous(vma)) {
 		BUG_ON(vma->anon_vma);
diff --git a/mm/nommu.c b/mm/nommu.c
index 75a327149af1..71a4ea828f06 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1078,6 +1078,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long len,
 			unsigned long prot,
 			unsigned long flags,
+			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
 			struct list_head *uf)
@@ -1085,7 +1086,6 @@ unsigned long do_mmap(struct file *file,
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	struct rb_node *rb;
-	vm_flags_t vm_flags;
 	unsigned long capabilities, result;
 	int ret;
 
@@ -1104,7 +1104,7 @@ unsigned long do_mmap(struct file *file,
 
 	/* we've determined that we can make the mapping, now translate what we
 	 * now know into VMA flags */
-	vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+	vm_flags |= determine_vm_flags(file, prot, flags, capabilities);
 
 	/* we're going to need to record the mapping */
 	region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL);
@@ -1763,7 +1763,7 @@ EXPORT_SYMBOL_GPL(access_process_vm);
  *
  * Check the shared mappings on an inode on behalf of a shrinking truncate to
  * make sure that any outstanding VMAs aren't broken and then shrink the
- * vm_regions that extend beyond so that do_mmap() doesn't
+ * vm_regions that extend beyond so that do_mmap_pgoff() doesn't
  * automatically grant mappings that are too large.
  */
 int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..dea76ecc849b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4246,7 +4246,7 @@ EXPORT_SYMBOL_GPL(shmem_file_setup_with_mnt);
 
 /**
  * shmem_zero_setup - setup a shared anonymous mapping
- * @vma: the vma to be mmapped is prepared by do_mmap
+ * @vma: the vma to be mmapped is prepared by do_mmap_pgoff
  */
 int shmem_zero_setup(struct vm_area_struct *vma)
 {
diff --git a/mm/util.c b/mm/util.c
index 5ef378a2a038..8d6280c05238 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -503,8 +503,8 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
-			      &uf);
+		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
+				    &populate, &uf);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 20/25] x86/cet/shstk: User-mode shadow stack support
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (18 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 19/25] mm: Re-introduce do_mmap_pgoff() Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 21/25] x86/cet/shstk: Handle signals for shadow stack Yu-cheng Yu
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

This patch adds basic shadow stack enabling/disabling routines.  A task's
shadow stack is allocated from memory with VM_SHSTK flag and has a fixed
size of min(RLIMIT_STACK, 4GB).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v11:
- Modify alloc_shstk() to take flags and pass to do_mmap().
  This is to be used by an arch_prctl() introduced later.

v10:
- Change no_cet_shstk to no_user_shstk.
- Limit shadow stack size to 4 GB, and round_up to PAGE_SIZE.
- Replace checking shstk_enabled with shstk_size being zero.
- WARN_ON_ONCE() when vm_munmap() fails.

v9:
- Change cpu_feature_enabled() to static_cpu_has().
- Merge cet_disable_shstk to cet_disable_free_shstk.
- Remove the empty slot at the top of the shadow stack, as it is not
  needed.
- Move do_mmap_locked() to alloc_shstk(), which is a static function.

v6:
- Create a function do_mmap_locked() for shadow stack allocation.

v2:
- Change noshstk to no_cet_shstk.

 arch/x86/include/asm/cet.h                    |  26 ++++
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/processor.h              |   5 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/cet.c                         | 138 ++++++++++++++++++
 arch/x86/kernel/cpu/common.c                  |  28 ++++
 arch/x86/kernel/process.c                     |   1 +
 .../arch/x86/include/asm/disabled-features.h  |   8 +-
 8 files changed, 214 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..caac0687c8e4
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+/*
+ * Per-thread CET status
+ */
+struct cet_status {
+	unsigned long	shstk_base;
+	unsigned long	shstk_size;
+};
+
+#ifdef CONFIG_X86_INTEL_CET
+int cet_setup_shstk(void);
+void cet_disable_free_shstk(struct task_struct *p);
+#else
+static inline void cet_disable_free_shstk(struct task_struct *p) {}
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 4ea8584682f9..a0e1b24cfa02 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK	0
+#else
+#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -75,7 +81,7 @@
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
-#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 01acbd63cad8..8c874d6ce871 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/cet.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -542,6 +543,10 @@ struct thread_struct {
 
 	unsigned int		sig_on_uaccess_err:1;
 
+#ifdef CONFIG_X86_INTEL_CET
+	struct cet_status	cet;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index e77261db2391..76f27f518266 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -145,6 +145,8 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
+obj-$(CONFIG_X86_INTEL_CET)		+= cet.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..a1be8ccee1cc
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * cet.c - Control-flow Enforcement (CET)
+ *
+ * Copyright (c) 2019, Intel Corporation.
+ * Yu-cheng Yu <yu-cheng.yu@intel.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <asm/msr.h>
+#include <asm/user.h>
+#include <asm/fpu/internal.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+
+static void start_update_msrs(void)
+{
+	fpregs_lock();
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		__fpregs_load_activate();
+}
+
+static void end_update_msrs(void)
+{
+	fpregs_unlock();
+}
+
+static unsigned long cet_get_shstk_addr(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	unsigned long ssp = 0;
+
+	fpregs_lock();
+
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+	} else {
+		struct cet_user_state *p;
+
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
+		if (p)
+			ssp = p->user_ssp;
+	}
+
+	fpregs_unlock();
+	return ssp;
+}
+
+static unsigned long alloc_shstk(unsigned long size, int flags)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, populate;
+
+	/* VM_SHSTK requires MAP_ANONYMOUS, MAP_PRIVATE */
+	flags |= MAP_ANONYMOUS | MAP_PRIVATE;
+
+	mmap_write_lock(mm);
+	addr = do_mmap(NULL, 0, size, PROT_READ, flags, VM_SHSTK, 0,
+		       &populate, NULL);
+	mmap_write_unlock(mm);
+
+	if (populate)
+		mm_populate(addr, populate);
+
+	return addr;
+}
+
+int cet_setup_shstk(void)
+{
+	unsigned long addr, size;
+	struct cet_status *cet = &current->thread.cet;
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK))
+		return -EOPNOTSUPP;
+
+	size = round_up(min(rlimit(RLIMIT_STACK), 1UL << 32), PAGE_SIZE);
+	addr = alloc_shstk(size, 0);
+
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	cet->shstk_base = addr;
+	cet->shstk_size = size;
+
+	start_update_msrs();
+	wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+	wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+	end_update_msrs();
+	return 0;
+}
+
+void cet_disable_free_shstk(struct task_struct *tsk)
+{
+	struct cet_status *cet = &tsk->thread.cet;
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK) ||
+	    !cet->shstk_size || !cet->shstk_base)
+		return;
+
+	if (!tsk->mm || (tsk->mm != current->mm))
+		return;
+
+	if (tsk == current) {
+		u64 msr_val;
+
+		start_update_msrs();
+		rdmsrl(MSR_IA32_U_CET, msr_val);
+		wrmsrl(MSR_IA32_U_CET, msr_val & ~CET_SHSTK_EN);
+		wrmsrl(MSR_IA32_PL3_SSP, 0);
+		end_update_msrs();
+	}
+
+	while (1) {
+		int r;
+
+		r = vm_munmap(cet->shstk_base, cet->shstk_size);
+
+		/*
+		 * Retry if mmap_lock is not available.
+		 */
+		if (r == -EINTR) {
+			cond_resched();
+			continue;
+		}
+
+		WARN_ON_ONCE(r);
+		break;
+	}
+	cet->shstk_base = 0;
+	cet->shstk_size = 0;
+}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c5d6f17d9b9d..5f60ddaabc46 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -56,6 +56,7 @@
 #include <asm/microcode_intel.h>
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
+#include <asm/cet.h>
 #include <asm/uv/uv.h>
 
 #include "cpu.h"
@@ -509,6 +510,32 @@ static __init int setup_disable_pku(char *arg)
 __setup("nopku", setup_disable_pku);
 #endif /* CONFIG_X86_64 */
 
+static __always_inline void setup_cet(struct cpuinfo_x86 *c)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+	    !cpu_feature_enabled(X86_FEATURE_IBT))
+		return;
+
+	cr4_set_bits(X86_CR4_CET);
+}
+
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+static __init int setup_disable_shstk(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (s[0] != '\0')
+		return 0;
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+	pr_info("x86: 'no_user_shstk' specified, disabling user Shadow Stack\n");
+	return 1;
+}
+__setup("no_user_shstk", setup_disable_shstk);
+#endif
+
 /*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
@@ -1544,6 +1571,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	setup_cet(c);
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 994d8393f2f7..e41f1a468ee3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -42,6 +42,7 @@
 #include <asm/spec-ctrl.h>
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
+#include <asm/cet.h>
 
 #include "process.h"
 
diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
index 4ea8584682f9..a0e1b24cfa02 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK	0
+#else
+#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -75,7 +81,7 @@
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
-#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 21/25] x86/cet/shstk: Handle signals for shadow stack
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (19 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 20/25] x86/cet/shstk: User-mode shadow stack support Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 22/25] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND properties Yu-cheng Yu
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

To deliver a signal, create a shadow stack restore token and put a restore
token and the signal restorer address on the shadow stack.  For sigreturn,
verify the token and restore the shadow stack pointer.

Introduce WRUSS, which is a kernel-mode instruction but writes directly to
user shadow stack.  It is used to construct the user signal stack as
described above.

Introduce a signal context extension struct 'sc_ext', which is used to save
shadow stack restore token address and WAIT_ENDBR status.  WAIT_ENDBR will
be introduced later in the Indirect Branch Tracking (IBT) series, but add
that into sc_ext now to keep the struct stable in case the IBT series is
applied later.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Combine with WRUSS instruction patch, since it is used only here.
- Revise signal restore code to the latest supervisor states handling.
  Move shadow stack restore token checking out of the fast path.

v9:
- Update CET MSR access according to XSAVES supervisor state changes.
- Add 'wait_endbr' to struct 'sc_ext'.
- Update and simplify signal frame allocation, setup, and restoration.
- Update commit log text.

v2:
- Move CET status from sigcontext to a separate struct sc_ext, which is
  located above the fpstate on the signal frame.
- Add a restore token for sigreturn address.

 arch/x86/ia32/ia32_signal.c            |  17 +++
 arch/x86/include/asm/cet.h             |   8 ++
 arch/x86/include/asm/fpu/internal.h    |  10 ++
 arch/x86/include/asm/special_insns.h   |  32 ++++++
 arch/x86/include/uapi/asm/sigcontext.h |   9 ++
 arch/x86/kernel/cet.c                  | 152 +++++++++++++++++++++++++
 arch/x86/kernel/fpu/signal.c           | 100 ++++++++++++++++
 arch/x86/kernel/signal.c               |  10 ++
 8 files changed, 338 insertions(+)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 81cf22398cd1..cec9cf0a00cf 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -35,6 +35,7 @@
 #include <asm/sigframe.h>
 #include <asm/sighandling.h>
 #include <asm/smap.h>
+#include <asm/cet.h>
 
 static inline void reload_segments(struct sigcontext_32 *sc)
 {
@@ -205,6 +206,7 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
 				 void __user **fpstate)
 {
 	unsigned long sp, fx_aligned, math_size;
+	void __user *restorer = NULL;
 
 	/* Default to using normal stack */
 	sp = regs->sp;
@@ -218,8 +220,23 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
 		 ksig->ka.sa.sa_restorer)
 		sp = (unsigned long) ksig->ka.sa.sa_restorer;
 
+	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+		restorer = ksig->ka.sa.sa_restorer;
+	} else if (current->mm->context.vdso) {
+		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
+			restorer = current->mm->context.vdso +
+				vdso_image_32.sym___kernel_rt_sigreturn;
+		else
+			restorer = current->mm->context.vdso +
+				vdso_image_32.sym___kernel_sigreturn;
+	}
+
 	sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
 	*fpstate = (struct _fpstate_32 __user *) sp;
+
+	if (save_cet_to_sigframe(1, *fpstate, (unsigned long)restorer))
+		return (void __user *) -1L;
+
 	if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
 				     math_size) < 0)
 		return (void __user *) -1L;
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index caac0687c8e4..56fe08eebae6 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct sc_ext;
+
 /*
  * Per-thread CET status
  */
@@ -17,8 +19,14 @@ struct cet_status {
 #ifdef CONFIG_X86_INTEL_CET
 int cet_setup_shstk(void);
 void cet_disable_free_shstk(struct task_struct *p);
+int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
+void cet_restore_signal(struct sc_ext *sc);
+int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 #else
 static inline void cet_disable_free_shstk(struct task_struct *p) {}
+static inline void cet_restore_signal(struct sc_ext *sc) { return; }
+static inline int cet_setup_signal(bool ia32, unsigned long rstor,
+				   struct sc_ext *sc) { return -EINVAL; }
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 0a460f2a3f90..cd4249b37f45 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -442,6 +442,16 @@ static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
 	__copy_kernel_to_fpregs(fpstate, -1);
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+extern int save_cet_to_sigframe(int ia32, void __user *fp,
+				unsigned long restorer);
+#else
+static inline int save_cet_to_sigframe(int ia32, void __user *fp,
+				unsigned long restorer)
+{
+	return 0;
+}
+#endif
 extern int copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size);
 
 /*
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 59a3e13204c3..21c42cef12a9 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -232,6 +232,38 @@ static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+	asm_volatile_goto("1: wrussd %1, (%0)\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: "r" (addr), "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EPERM;
+}
+#else
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+	WARN_ONCE(1, "%s used but not supported.\n", __func__);
+	return -EFAULT;
+}
+#endif
+
+static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
+{
+	asm_volatile_goto("1: wrussq %1, (%0)\n"
+			  _ASM_EXTABLE(1b, %l[fail])
+			  :: "r" (addr), "r" (val)
+			  :: fail);
+	return 0;
+fail:
+	return -EPERM;
+}
+#endif /* CONFIG_X86_INTEL_CET */
+
 #define nop() asm volatile ("nop")
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/include/uapi/asm/sigcontext.h b/arch/x86/include/uapi/asm/sigcontext.h
index 844d60eb1882..cf2d55db3be4 100644
--- a/arch/x86/include/uapi/asm/sigcontext.h
+++ b/arch/x86/include/uapi/asm/sigcontext.h
@@ -196,6 +196,15 @@ struct _xstate {
 	/* New processor state extensions go here: */
 };
 
+/*
+ * Located at the end of sigcontext->fpstate, aligned to 8.
+ */
+struct sc_ext {
+	unsigned long total_size;
+	unsigned long ssp;
+	unsigned long wait_endbr;
+};
+
 /*
  * The 32-bit signal frame:
  */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index a1be8ccee1cc..a08a956ac3aa 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -19,6 +19,8 @@
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
 #include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <uapi/asm/sigcontext.h>
 
 static void start_update_msrs(void)
 {
@@ -72,6 +74,80 @@ static unsigned long alloc_shstk(unsigned long size, int flags)
 	return addr;
 }
 
+#define TOKEN_MODE_MASK	3UL
+#define TOKEN_MODE_64	1UL
+#define IS_TOKEN_64(token) ((token & TOKEN_MODE_MASK) == TOKEN_MODE_64)
+#define IS_TOKEN_32(token) ((token & TOKEN_MODE_MASK) == 0)
+
+/*
+ * Verify the restore token at the address of 'ssp' is
+ * valid and then set shadow stack pointer according to the
+ * token.
+ */
+int cet_verify_rstor_token(bool ia32, unsigned long ssp,
+			   unsigned long *new_ssp)
+{
+	unsigned long token;
+
+	*new_ssp = 0;
+
+	if (!IS_ALIGNED(ssp, 8))
+		return -EINVAL;
+
+	if (get_user(token, (unsigned long __user *)ssp))
+		return -EFAULT;
+
+	/* Is 64-bit mode flag correct? */
+	if (!ia32 && !IS_TOKEN_64(token))
+		return -EINVAL;
+	else if (ia32 && !IS_TOKEN_32(token))
+		return -EINVAL;
+
+	token &= ~TOKEN_MODE_MASK;
+
+	/*
+	 * Restore address properly aligned?
+	 */
+	if ((!ia32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
+		return -EINVAL;
+
+	/*
+	 * Token was placed properly?
+	 */
+	if ((ALIGN_DOWN(token, 8) - 8) != ssp)
+		return -EINVAL;
+
+	*new_ssp = token;
+	return 0;
+}
+
+/*
+ * Create a restore token on the shadow stack.
+ * A token is always 8-byte and aligned to 8.
+ */
+static int create_rstor_token(bool ia32, unsigned long ssp,
+			      unsigned long *new_ssp)
+{
+	unsigned long addr;
+
+	*new_ssp = 0;
+
+	if ((!ia32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
+		return -EINVAL;
+
+	addr = ALIGN_DOWN(ssp, 8) - 8;
+
+	/* Is the token for 64-bit? */
+	if (!ia32)
+		ssp |= TOKEN_MODE_64;
+
+	if (write_user_shstk_64(addr, ssp))
+		return -EFAULT;
+
+	*new_ssp = addr;
+	return 0;
+}
+
 int cet_setup_shstk(void)
 {
 	unsigned long addr, size;
@@ -136,3 +212,79 @@ void cet_disable_free_shstk(struct task_struct *tsk)
 	cet->shstk_base = 0;
 	cet->shstk_size = 0;
 }
+
+/*
+ * Called from __fpu__restore_sig() and XSAVES buffer is protected by
+ * set_thread_flag(TIF_NEED_FPU_LOAD) in the slow path.
+ */
+void cet_restore_signal(struct sc_ext *sc_ext)
+{
+	struct cet_user_state *cet_user_state;
+	struct cet_status *cet = &current->thread.cet;
+	u64 msr_val = 0;
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK))
+		return;
+
+	cet_user_state = get_xsave_addr(&current->thread.fpu.state.xsave,
+					XFEATURE_CET_USER);
+	if (!cet_user_state)
+		return;
+
+	if (cet->shstk_size) {
+		if (test_thread_flag(TIF_NEED_FPU_LOAD))
+			cet_user_state->user_ssp = sc_ext->ssp;
+		else
+			wrmsrl(MSR_IA32_PL3_SSP, sc_ext->ssp);
+
+		msr_val |= CET_SHSTK_EN;
+	}
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		cet_user_state->user_cet = msr_val;
+	else
+		wrmsrl(MSR_IA32_U_CET, msr_val);
+}
+
+/*
+ * Setup the shadow stack for the signal handler: first,
+ * create a restore token to keep track of the current ssp,
+ * and then the return address of the signal handler.
+ */
+int cet_setup_signal(bool ia32, unsigned long rstor_addr, struct sc_ext *sc_ext)
+{
+	struct cet_status *cet = &current->thread.cet;
+	unsigned long ssp = 0, new_ssp = 0;
+	int err;
+
+	if (cet->shstk_size) {
+		if (!rstor_addr)
+			return -EINVAL;
+
+		ssp = cet_get_shstk_addr();
+		err = create_rstor_token(ia32, ssp, &new_ssp);
+		if (err)
+			return err;
+
+		if (ia32) {
+			ssp = new_ssp - sizeof(u32);
+			err = write_user_shstk_32(ssp, (unsigned int)rstor_addr);
+		} else {
+			ssp = new_ssp - sizeof(u64);
+			err = write_user_shstk_64(ssp, rstor_addr);
+		}
+
+		if (err)
+			return err;
+
+		sc_ext->ssp = new_ssp;
+	}
+
+	if (ssp) {
+		start_update_msrs();
+		wrmsrl(MSR_IA32_PL3_SSP, ssp);
+		end_update_msrs();
+	}
+
+	return 0;
+}
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index a4ec65317a7f..d02ea8c11128 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -52,6 +52,74 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
 	return 0;
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+int save_cet_to_sigframe(int ia32, void __user *fp, unsigned long restorer)
+{
+	int err = 0;
+
+	if (!current->thread.cet.shstk_size)
+		return 0;
+
+	if (fp) {
+		struct sc_ext ext = {0, 0, 0};
+
+		err = cet_setup_signal(ia32, restorer, &ext);
+		if (!err) {
+			void __user *p = fp;
+
+			ext.total_size = sizeof(ext);
+
+			if (ia32)
+				p += sizeof(struct fregs_state);
+
+			p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+			p = (void __user *)ALIGN((unsigned long)p, 8);
+
+			if (copy_to_user(p, &ext, sizeof(ext)))
+				return -EFAULT;
+		}
+	}
+
+	return err;
+}
+
+static int get_cet_from_sigframe(int ia32, void __user *fp, struct sc_ext *ext)
+{
+	int err = 0;
+
+	memset(ext, 0, sizeof(*ext));
+
+	if (!current->thread.cet.shstk_size)
+		return 0;
+
+	if (fp) {
+		void __user *p = fp;
+
+		if (ia32)
+			p += sizeof(struct fregs_state);
+
+		p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+		p = (void __user *)ALIGN((unsigned long)p, 8);
+
+		if (copy_from_user(ext, p, sizeof(*ext)))
+			return -EFAULT;
+
+		if (ext->total_size != sizeof(*ext))
+			return -EFAULT;
+
+		if (current->thread.cet.shstk_size)
+			err = cet_verify_rstor_token(ia32, ext->ssp, &ext->ssp);
+	}
+
+	return err;
+}
+#else
+static int get_cet_from_sigframe(int ia32, void __user *fp, struct sc_ext *ext)
+{
+	return 0;
+}
+#endif
+
 /*
  * Signal frame handlers.
  */
@@ -295,6 +363,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 	struct task_struct *tsk = current;
 	struct fpu *fpu = &tsk->thread.fpu;
 	struct user_i387_ia32_struct env;
+	struct sc_ext sc_ext;
 	u64 user_xfeatures = 0;
 	int fx_only = 0;
 	int ret = 0;
@@ -335,6 +404,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 	if ((unsigned long)buf_fx % 64)
 		fx_only = 1;
 
+	ret = get_cet_from_sigframe(ia32_fxstate, buf, &sc_ext);
+	if (ret)
+		return ret;
+
 	if (!ia32_fxstate) {
 		/*
 		 * Attempt to restore the FPU registers directly from user
@@ -349,6 +422,8 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		pagefault_enable();
 		if (!ret) {
 
+			cet_restore_signal(&sc_ext);
+
 			/*
 			 * Restore supervisor states: previous context switch
 			 * etc has done XSAVES and saved the supervisor states
@@ -423,6 +498,8 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		if (unlikely(init_bv))
 			copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
 
+		cet_restore_signal(&sc_ext);
+
 		/*
 		 * Restore previously saved supervisor xstates along with
 		 * copied-in user xstates.
@@ -491,12 +568,35 @@ int fpu__restore_sig(void __user *buf, int ia32_frame)
 	return __fpu__restore_sig(buf, buf_fx, size);
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+static unsigned long fpu__alloc_sigcontext_ext(unsigned long sp)
+{
+	struct cet_status *cet = &current->thread.cet;
+
+	/*
+	 * sigcontext_ext is at: fpu + fpu_user_xstate_size +
+	 * FP_XSTATE_MAGIC2_SIZE, then aligned to 8.
+	 */
+	if (cet->shstk_size)
+		sp -= (sizeof(struct sc_ext) + 8);
+
+	return sp;
+}
+#else
+static unsigned long fpu__alloc_sigcontext_ext(unsigned long sp)
+{
+	return sp;
+}
+#endif
+
 unsigned long
 fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
 		     unsigned long *buf_fx, unsigned long *size)
 {
 	unsigned long frame_size = xstate_sigframe_size();
 
+	sp = fpu__alloc_sigcontext_ext(sp);
+
 	*buf_fx = sp = round_down(sp - frame_size, 64);
 	if (ia32_frame && use_fxsr()) {
 		frame_size += sizeof(struct fregs_state);
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index d5fa494c2304..c5f465f5aeb4 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <asm/syscall.h>
 #include <asm/sigframe.h>
 #include <asm/signal.h>
+#include <asm/cet.h>
 
 #ifdef CONFIG_X86_64
 /*
@@ -239,6 +240,9 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
 	unsigned long buf_fx = 0;
 	int onsigstack = on_sig_stack(sp);
 	int ret;
+#ifdef CONFIG_X86_64
+	void __user *restorer = NULL;
+#endif
 
 	/* redzone */
 	if (IS_ENABLED(CONFIG_X86_64))
@@ -270,6 +274,12 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
 	if (onsigstack && !likely(on_sig_stack(sp)))
 		return (void __user *)-1L;
 
+#ifdef CONFIG_X86_64
+	if (ka->sa.sa_flags & SA_RESTORER)
+		restorer = ka->sa.sa_restorer;
+	ret = save_cet_to_sigframe(0, *fpstate, (unsigned long)restorer);
+#endif
+
 	/* save i387 and extended state */
 	ret = copy_fpstate_to_sigframe(*fpstate, (void __user *)buf_fx, math_size);
 	if (ret < 0)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 22/25] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND properties
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (20 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 21/25] x86/cet/shstk: Handle signals for shadow stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 23/25] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

An ELF file's .note.gnu.property indicates architecture features of the
file.. Introduce feature definitions for Shadow Stack and Indirect Branch
Tracking.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 include/uapi/linux/elf.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 22220945a5fd..ca5875f384f6 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -454,4 +454,13 @@ typedef struct elf64_note {
 /* Bits for GNU_PROPERTY_AARCH64_FEATURE_1_BTI */
 #define GNU_PROPERTY_AARCH64_FEATURE_1_BTI	(1U << 0)
 
+/* .note.gnu.property types for x86: */
+#define GNU_PROPERTY_X86_FEATURE_1_AND		0xc0000002
+
+/* Bits for GNU_PROPERTY_X86_FEATURE_1_AND */
+#define GNU_PROPERTY_X86_FEATURE_1_IBT		0x00000001
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK	0x00000002
+#define GNU_PROPERTY_X86_FEATURE_1_INVAL ~(GNU_PROPERTY_X86_FEATURE_1_IBT | \
+					    GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+
 #endif /* _UAPI_LINUX_ELF_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 23/25] ELF: Introduce arch_setup_elf_property()
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (21 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 22/25] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND properties Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 24/25] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for " Yu-cheng Yu
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu, Mark Brown, Catalin Marinas

An ELF file's .note.gnu.property indicates arch features supported by the
file.  These features are extracted by arch_parse_elf_property() and stored
in 'arch_elf_state'.  Introduce arch_setup_elf_property() for enabling such
features.  The first use-case of this function is shadow stack.

ARM64 is the other arch that has ARCH_USER_GNU_PROPERTY and arch_parse_elf_
property().  Add arch_setup_elf_property() for it.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Martin <Dave.Martin@arm.com>
---
v11:
- Combine three patches of arch_setup_elf_property() into one.
- Add empty arch_setup_elf_property() for arm64.

v9:
- Change cpu_feature_enabled() to static_cpu_has().

 arch/arm64/include/asm/elf.h |  5 +++++
 arch/x86/Kconfig             |  2 ++
 arch/x86/include/asm/elf.h   | 13 +++++++++++++
 arch/x86/kernel/process_64.c | 32 ++++++++++++++++++++++++++++++++
 fs/binfmt_elf.c              |  4 ++++
 include/linux/elf.h          |  6 ++++++
 6 files changed, 62 insertions(+)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 8d1c8dcb87fd..d37bc7915935 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -281,6 +281,11 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
 	return 0;
 }
 
+static inline int arch_setup_elf_property(struct arch_elf_state *arch)
+{
+	return 0;
+}
+
 static inline int arch_elf_pt_proc(void *ehdr, void *phdr,
 				   struct file *f, bool is_interp,
 				   struct arch_elf_state *state)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e93be385cd04..6b6dad011763 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1950,6 +1950,8 @@ config X86_INTEL_SHADOW_STACK_USER
 	select X86_INTEL_CET
 	select ARCH_MAYBE_MKWRITE
 	select ARCH_HAS_SHADOW_STACK
+	select ARCH_USE_GNU_PROPERTY
+	select ARCH_BINFMT_ELF_STATE
 	help
 	  Shadow Stacks provides protection against program stack
 	  corruption.  It's a hardware feature.  This only matters
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index b9a5d488f1a5..0e1be2a13359 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -385,6 +385,19 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
 					      int uses_interp);
 #define compat_arch_setup_additional_pages compat_arch_setup_additional_pages
 
+#ifdef CONFIG_ARCH_BINFMT_ELF_STATE
+struct arch_elf_state {
+	unsigned int gnu_property;
+};
+
+#define INIT_ARCH_ELF_STATE {	\
+	.gnu_property = 0,	\
+}
+
+#define arch_elf_pt_proc(ehdr, phdr, elf, interp, state) (0)
+#define arch_check_elf(ehdr, interp, interp_ehdr, state) (0)
+#endif
+
 /* Do not change the values. See get_align_mask() */
 enum align_flags {
 	ALIGN_VA_32	= BIT(0),
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9afefe325acb..fd4644865a3b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -837,3 +837,35 @@ unsigned long KSTK_ESP(struct task_struct *task)
 {
 	return task_pt_regs(task)->sp;
 }
+
+#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
+int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+			     bool compat, struct arch_elf_state *state)
+{
+	if (type != GNU_PROPERTY_X86_FEATURE_1_AND)
+		return 0;
+
+	if (datasz != sizeof(unsigned int))
+		return -ENOEXEC;
+
+	state->gnu_property = *(unsigned int *)data;
+	return 0;
+}
+
+int arch_setup_elf_property(struct arch_elf_state *state)
+{
+	int r = 0;
+
+	if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
+		return r;
+
+	memset(&current->thread.cet, 0, sizeof(struct cet_status));
+
+	if (static_cpu_has(X86_FEATURE_SHSTK)) {
+		if (state->gnu_property & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+			r = cet_setup_shstk();
+	}
+
+	return r;
+}
+#endif
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 13d053982dd7..2b4cfc256895 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1217,6 +1217,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 	set_binfmt(&elf_format);
 
+	retval = arch_setup_elf_property(&arch_state);
+	if (retval < 0)
+		goto out;
+
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
 	retval = arch_setup_additional_pages(bprm, !!interpreter);
 	if (retval < 0)
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 5d5b0321da0b..4827695ca415 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -82,9 +82,15 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
 {
 	return 0;
 }
+
+static inline int arch_setup_elf_property(struct arch_elf_state *arch)
+{
+	return 0;
+}
 #else
 extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
 				   bool compat, struct arch_elf_state *arch);
+extern int arch_setup_elf_property(struct arch_elf_state *arch);
 #endif
 
 #ifdef CONFIG_ARCH_HAVE_ELF_PROT
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 24/25] x86/cet/shstk: Handle thread shadow stack
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (22 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 23/25] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:25 ` [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for " Yu-cheng Yu
  24 siblings, 0 replies; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

The kernel allocates (and frees on thread exit) a new shadow stack for a
pthread child.

    It is possible for the kernel to complete the clone syscall and set the
    child's shadow stack pointer to NULL and let the child thread allocate
    a shadow stack for itself.  There are two issues in this approach: It
    is not compatible with existing code that does inline syscall and it
    cannot handle signals before the child can successfully allocate a
    shadow stack.

A 64-bit shadow stack has a size of min(RLIMIT_STACK, 4 GB).  A compat-mode
thread shadow stack has a size of 1/4 min(RLIMIT_STACK, 4 GB).  This allows
more threads to run in a 32-bit address space.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v10:
- Limit shadow stack size to 4 GB.

 arch/x86/include/asm/cet.h         |  2 ++
 arch/x86/include/asm/mmu_context.h |  3 +++
 arch/x86/kernel/cet.c              | 41 ++++++++++++++++++++++++++++++
 arch/x86/kernel/process.c          |  7 +++++
 4 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 56fe08eebae6..71dc92acd2f2 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -18,11 +18,13 @@ struct cet_status {
 
 #ifdef CONFIG_X86_INTEL_CET
 int cet_setup_shstk(void);
+int cet_setup_thread_shstk(struct task_struct *p);
 void cet_disable_free_shstk(struct task_struct *p);
 int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
 void cet_restore_signal(struct sc_ext *sc);
 int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 #else
+static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
 static inline void cet_disable_free_shstk(struct task_struct *p) {}
 static inline void cet_restore_signal(struct sc_ext *sc) { return; }
 static inline int cet_setup_signal(bool ia32, unsigned long rstor,
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index d98016b83755..e3ca4397258b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -11,6 +11,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
+#include <asm/cet.h>
 #include <asm/debugreg.h>
 
 extern atomic64_t last_mm_ctx_id;
@@ -142,6 +143,8 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		cet_disable_free_shstk(tsk);	\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index a08a956ac3aa..b30c61a66c8e 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -172,6 +172,47 @@ int cet_setup_shstk(void)
 	return 0;
 }
 
+int cet_setup_thread_shstk(struct task_struct *tsk)
+{
+	unsigned long addr, size;
+	struct cet_user_state *state;
+	struct cet_status *cet = &tsk->thread.cet;
+
+	if (!cet->shstk_size)
+		return 0;
+
+	state = get_xsave_addr(&tsk->thread.fpu.state.xsave,
+			       XFEATURE_CET_USER);
+
+	if (!state)
+		return -EINVAL;
+
+	/* Cap shadow stack size to 4 GB */
+	size = min(rlimit(RLIMIT_STACK), 1UL << 32);
+
+	/*
+	 * Compat-mode pthreads share a limited address space.
+	 * If each function call takes an average of four slots
+	 * stack space, we need 1/4 of stack size for shadow stack.
+	 */
+	if (in_compat_syscall())
+		size /= 4;
+	size = round_up(size, PAGE_SIZE);
+	addr = alloc_shstk(size, 0);
+
+	if (IS_ERR_VALUE(addr)) {
+		cet->shstk_base = 0;
+		cet->shstk_size = 0;
+		return PTR_ERR((void *)addr);
+	}
+
+	fpu__prepare_write(&tsk->thread.fpu);
+	state->user_ssp = (u64)(addr + size);
+	cet->shstk_base = addr;
+	cet->shstk_size = size;
+	return 0;
+}
+
 void cet_disable_free_shstk(struct task_struct *tsk)
 {
 	struct cet_status *cet = &tsk->thread.cet;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e41f1a468ee3..b6fe5b061841 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -109,6 +109,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	cet_disable_free_shstk(tsk);
 	fpu__drop(fpu);
 }
 
@@ -181,6 +182,12 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	if (clone_flags & CLONE_SETTLS)
 		ret = set_new_tls(p, tls);
 
+#ifdef CONFIG_X86_64
+	/* Allocate a new shadow stack for pthread */
+	if (!ret && (clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM)
+		ret = cet_setup_thread_shstk(p);
+#endif
+
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
                   ` (23 preceding siblings ...)
  2020-08-25  0:25 ` [PATCH v11 24/25] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
@ 2020-08-25  0:25 ` Yu-cheng Yu
  2020-08-25  0:36   ` Andy Lutomirski
  24 siblings, 1 reply; 89+ messages in thread
From: Yu-cheng Yu @ 2020-08-25  0:25 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang
  Cc: Yu-cheng Yu

arch_prctl(ARCH_X86_CET_STATUS, u64 *args)
    Get CET feature status.

    The parameter 'args' is a pointer to a user buffer.  The kernel returns
    the following information:

    *args = shadow stack/IBT status
    *(args + 1) = shadow stack base address
    *(args + 2) = shadow stack size

arch_prctl(ARCH_X86_CET_DISABLE, u64 features)
    Disable CET features specified in 'features'.  Return -EPERM if CET is
    locked.

arch_prctl(ARCH_X86_CET_LOCK)
    Lock in CET features.

arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args)
    Allocate a new shadow stack.

    The parameter 'args' is a pointer to a user buffer.

    *args = desired size
    *(args + 1) = MAP_32BIT or MAP_POPULATE

    On returning, *args is the allocated shadow stack address.

Also change do_arch_prctl_common()'s parameter 'cpuid_enabled' to
'arg2', as it is now also passed to prctl_cet().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
v11:
- Check input for invalid features.
- Fix prctl_cet() return values.
- Change ARCH_X86_CET_ALLOC_SHSTK to ARCH_X86_CET_MMAP_SHSTK to take
  MAP_32BIT, MAP_POPULATE as inputs.

v10:
- Verify CET is enabled before handling arch_prctl.
- Change input parameters from unsigned long to u64, to make it clear they
  are 64-bit.

 arch/x86/include/asm/cet.h              |  4 +
 arch/x86/include/uapi/asm/prctl.h       |  5 ++
 arch/x86/kernel/Makefile                |  2 +-
 arch/x86/kernel/cet.c                   | 26 +++++++
 arch/x86/kernel/cet_prctl.c             | 98 +++++++++++++++++++++++++
 arch/x86/kernel/process.c               |  6 +-
 tools/arch/x86/include/uapi/asm/prctl.h |  5 ++
 7 files changed, 142 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/kernel/cet_prctl.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 71dc92acd2f2..f7eb197998ad 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -14,16 +14,20 @@ struct sc_ext;
 struct cet_status {
 	unsigned long	shstk_base;
 	unsigned long	shstk_size;
+	unsigned int	locked:1;
 };
 
 #ifdef CONFIG_X86_INTEL_CET
+int prctl_cet(int option, u64 arg2);
 int cet_setup_shstk(void);
 int cet_setup_thread_shstk(struct task_struct *p);
+unsigned long cet_alloc_shstk(unsigned long size, int flags);
 void cet_disable_free_shstk(struct task_struct *p);
 int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
 void cet_restore_signal(struct sc_ext *sc);
 int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 #else
+static inline int prctl_cet(int option, u64 arg2) { return -EINVAL; }
 static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
 static inline void cet_disable_free_shstk(struct task_struct *p) {}
 static inline void cet_restore_signal(struct sc_ext *sc) { return; }
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..3aaac13cdc87 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,9 @@
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
 
+#define ARCH_X86_CET_STATUS		0x3001
+#define ARCH_X86_CET_DISABLE		0x3002
+#define ARCH_X86_CET_LOCK		0x3003
+#define ARCH_X86_CET_MMAP_SHSTK		0x3004
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 76f27f518266..97556e4204d6 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -145,7 +145,7 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
-obj-$(CONFIG_X86_INTEL_CET)		+= cet.o
+obj-$(CONFIG_X86_INTEL_CET)		+= cet.o cet_prctl.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index b30c61a66c8e..2bf1a6b6abb6 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -148,6 +148,32 @@ static int create_rstor_token(bool ia32, unsigned long ssp,
 	return 0;
 }
 
+unsigned long cet_alloc_shstk(unsigned long len, int flags)
+{
+	unsigned long token;
+	unsigned long addr, ssp;
+
+	addr = alloc_shstk(round_up(len, PAGE_SIZE), flags);
+
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
+	/* Restore token is 8 bytes and aligned to 8 bytes */
+	ssp = addr + len;
+	token = ssp;
+
+	if (!in_ia32_syscall())
+		token |= TOKEN_MODE_64;
+	ssp -= 8;
+
+	if (write_user_shstk_64(ssp, token)) {
+		vm_munmap(addr, len);
+		return -EINVAL;
+	}
+
+	return addr;
+}
+
 int cet_setup_shstk(void)
 {
 	unsigned long addr, size;
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
new file mode 100644
index 000000000000..cc49eef08ab0
--- /dev/null
+++ b/arch/x86/kernel/cet_prctl.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/errno.h>
+#include <linux/uaccess.h>
+#include <linux/prctl.h>
+#include <linux/compat.h>
+#include <linux/mman.h>
+#include <linux/elfcore.h>
+#include <asm/processor.h>
+#include <asm/prctl.h>
+#include <asm/cet.h>
+
+/* See Documentation/x86/intel_cet.rst. */
+
+static int copy_status_to_user(struct cet_status *cet, u64 arg2)
+{
+	u64 buf[3] = {0, 0, 0};
+
+	if (cet->shstk_size) {
+		buf[0] |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
+		buf[1] = (u64)cet->shstk_base;
+		buf[2] = (u64)cet->shstk_size;
+	}
+
+	return copy_to_user((u64 __user *)arg2, buf, sizeof(buf));
+}
+
+static int handle_mmap_shstk(u64 arg2)
+{
+	u64 buf[3];
+	unsigned long addr, size;
+	int allowed_flags;
+
+	if (copy_from_user(buf, (unsigned long __user *)arg2, sizeof(buf)))
+		return -EFAULT;
+
+	size = buf[0];
+
+	/*
+	 * Check invalid flags
+	 */
+	allowed_flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_32BIT | MAP_POPULATE;
+
+	if (buf[1] & ~allowed_flags)
+		return -EINVAL;
+
+	addr = cet_alloc_shstk(size, buf[1]);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	if (put_user(addr, (u64 __user *)arg2)) {
+		vm_munmap(addr, size);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+int prctl_cet(int option, u64 arg2)
+{
+	struct cet_status *cet;
+
+	/*
+	 * GLIBC's ENOTSUPP == EOPNOTSUPP == 95, and it does not recognize
+	 * the kernel's ENOTSUPP (524).  So return EOPNOTSUPP here.
+	 */
+	if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
+		return -EOPNOTSUPP;
+
+	cet = &current->thread.cet;
+
+	if (option == ARCH_X86_CET_STATUS)
+		return copy_status_to_user(cet, arg2);
+
+	if (!static_cpu_has(X86_FEATURE_SHSTK))
+		return -EOPNOTSUPP;
+
+	switch (option) {
+	case ARCH_X86_CET_DISABLE:
+		if (cet->locked)
+			return -EPERM;
+		if (arg2 & GNU_PROPERTY_X86_FEATURE_1_INVAL)
+			return -EINVAL;
+		if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+			cet_disable_free_shstk(current);
+		return 0;
+
+	case ARCH_X86_CET_LOCK:
+		cet->locked = 1;
+		return 0;
+
+	case ARCH_X86_CET_MMAP_SHSTK:
+		return handle_mmap_shstk(arg2);
+
+	default:
+		return -ENOSYS;
+	}
+}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6fe5b061841..5a657a9774dc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -980,14 +980,14 @@ unsigned long get_wchan(struct task_struct *p)
 }
 
 long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long cpuid_enabled)
+			  unsigned long arg2)
 {
 	switch (option) {
 	case ARCH_GET_CPUID:
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
-		return set_cpuid_mode(task, cpuid_enabled);
+		return set_cpuid_mode(task, arg2);
 	}
 
-	return -EINVAL;
+	return prctl_cet(option, arg2);
 }
diff --git a/tools/arch/x86/include/uapi/asm/prctl.h b/tools/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..3aaac13cdc87 100644
--- a/tools/arch/x86/include/uapi/asm/prctl.h
+++ b/tools/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,9 @@
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
 
+#define ARCH_X86_CET_STATUS		0x3001
+#define ARCH_X86_CET_DISABLE		0x3002
+#define ARCH_X86_CET_LOCK		0x3003
+#define ARCH_X86_CET_MMAP_SHSTK		0x3004
+
 #endif /* _ASM_X86_PRCTL_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25  0:25 ` [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for " Yu-cheng Yu
@ 2020-08-25  0:36   ` Andy Lutomirski
  2020-08-25 18:43     ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2020-08-25  0:36 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Dave Martin, Weijiang Yang

On Mon, Aug 24, 2020 at 5:30 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:

> arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args)
>     Allocate a new shadow stack.
>
>     The parameter 'args' is a pointer to a user buffer.
>
>     *args = desired size
>     *(args + 1) = MAP_32BIT or MAP_POPULATE
>
>     On returning, *args is the allocated shadow stack address.

This is hideous.  Would this be better as a new syscall?

--Andy

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25  0:36   ` Andy Lutomirski
@ 2020-08-25 18:43     ` Yu, Yu-cheng
  2020-08-25 19:19       ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-08-25 18:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang

On 8/24/2020 5:36 PM, Andy Lutomirski wrote:
> On Mon, Aug 24, 2020 at 5:30 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> 
>> arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args)
>>      Allocate a new shadow stack.
>>
>>      The parameter 'args' is a pointer to a user buffer.
>>
>>      *args = desired size
>>      *(args + 1) = MAP_32BIT or MAP_POPULATE
>>
>>      On returning, *args is the allocated shadow stack address.
> 
> This is hideous.  Would this be better as a new syscall?

Could you point out why this is hideous, so that I can modify the 
arch_prctl?

I think this is more arch-specific.  Even if it becomes a new syscall, 
we still need to pass the same parameters.

Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25 18:43     ` Yu, Yu-cheng
@ 2020-08-25 19:19       ` Dave Hansen
  2020-08-25 21:04         ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-08-25 19:19 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang

On 8/25/20 11:43 AM, Yu, Yu-cheng wrote:
>>> arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args)
>>>      Allocate a new shadow stack.
>>>
>>>      The parameter 'args' is a pointer to a user buffer.
>>>
>>>      *args = desired size
>>>      *(args + 1) = MAP_32BIT or MAP_POPULATE
>>>
>>>      On returning, *args is the allocated shadow stack address.
>>
>> This is hideous.  Would this be better as a new syscall?
> 
> Could you point out why this is hideous, so that I can modify the
> arch_prctl?

Passing values in memory is hideous when we don't have to.  A syscall
would let you have separate arguments for size and flags and would also
let you have a nice return value instead of needing to do that in memory
too.

> I think this is more arch-specific.  Even if it becomes a new syscall,
> we still need to pass the same parameters.

Right, but without the copying in and out of memory.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25 19:19       ` Dave Hansen
@ 2020-08-25 21:04         ` Yu, Yu-cheng
  2020-08-25 23:20           ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-08-25 21:04 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang

On 8/25/2020 12:19 PM, Dave Hansen wrote:
> On 8/25/20 11:43 AM, Yu, Yu-cheng wrote:
>>>> arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args)
>>>>       Allocate a new shadow stack.
>>>>
>>>>       The parameter 'args' is a pointer to a user buffer.
>>>>
>>>>       *args = desired size
>>>>       *(args + 1) = MAP_32BIT or MAP_POPULATE
>>>>
>>>>       On returning, *args is the allocated shadow stack address.
>>>
>>> This is hideous.  Would this be better as a new syscall?
>>
>> Could you point out why this is hideous, so that I can modify the
>> arch_prctl?
> 
> Passing values in memory is hideous when we don't have to.  A syscall
> would let you have separate arguments for size and flags and would also
> let you have a nice return value instead of needing to do that in memory
> too.

That is a good justification.

> 
>> I think this is more arch-specific.  Even if it becomes a new syscall,
>> we still need to pass the same parameters.
> 
> Right, but without the copying in and out of memory.
> 
Linux-api is already on the Cc list.  Do we need to add more people to 
get some agreements for the syscall?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25 21:04         ` Yu, Yu-cheng
@ 2020-08-25 23:20           ` Dave Hansen
  2020-08-25 23:34             ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-08-25 23:20 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang

On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
>>> I think this is more arch-specific.  Even if it becomes a new syscall,
>>> we still need to pass the same parameters.
>>
>> Right, but without the copying in and out of memory.
>>
> Linux-api is already on the Cc list.  Do we need to add more people to
> get some agreements for the syscall?
What kind of agreement are you looking for?  I'd suggest just coding it
up and posting the patches.  Adding syscalls really is really pretty
straightforward and isn't much code at all.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25 23:20           ` Dave Hansen
@ 2020-08-25 23:34             ` Yu, Yu-cheng
  2020-08-26 16:46               ` Dave Martin
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-08-25 23:34 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang

On 8/25/2020 4:20 PM, Dave Hansen wrote:
> On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
>>>> I think this is more arch-specific.  Even if it becomes a new syscall,
>>>> we still need to pass the same parameters.
>>>
>>> Right, but without the copying in and out of memory.
>>>
>> Linux-api is already on the Cc list.  Do we need to add more people to
>> get some agreements for the syscall?
> What kind of agreement are you looking for?  I'd suggest just coding it
> up and posting the patches.  Adding syscalls really is really pretty
> straightforward and isn't much code at all.
> 

Sure, I will do that.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-25 23:34             ` Yu, Yu-cheng
@ 2020-08-26 16:46               ` Dave Martin
  2020-08-26 16:51                 ` Florian Weimer
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Martin @ 2020-08-26 16:46 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: Dave Hansen, Andy Lutomirski, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
> On 8/25/2020 4:20 PM, Dave Hansen wrote:
> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
> >>>>I think this is more arch-specific.  Even if it becomes a new syscall,
> >>>>we still need to pass the same parameters.
> >>>
> >>>Right, but without the copying in and out of memory.
> >>>
> >>Linux-api is already on the Cc list.  Do we need to add more people to
> >>get some agreements for the syscall?
> >What kind of agreement are you looking for?  I'd suggest just coding it
> >up and posting the patches.  Adding syscalls really is really pretty
> >straightforward and isn't much code at all.
> >
> 
> Sure, I will do that.

Alternatively, would a regular prctl() work here?

arch_prctl() feels like a historical weirdness for x86 -- other arches
all seem to be using regular prctl(), which allows for 4 args.  I don't
know the history behind the difference here.

(Since prctl() and arch_prctl() use non-clashing command numbers, I had
wondered whether it would be worth just merging the x86 calls in with
the rest and making the two calls aliases.  That's one for later,
though...)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 16:46               ` Dave Martin
@ 2020-08-26 16:51                 ` Florian Weimer
  2020-08-26 17:04                   ` Andy Lutomirski
  2020-08-26 17:08                   ` Dave Martin
  0 siblings, 2 replies; 89+ messages in thread
From: Florian Weimer @ 2020-08-26 16:51 UTC (permalink / raw)
  To: Dave Martin
  Cc: Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

* Dave Martin:

> On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
>> On 8/25/2020 4:20 PM, Dave Hansen wrote:
>> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
>> >>>>I think this is more arch-specific.  Even if it becomes a new syscall,
>> >>>>we still need to pass the same parameters.
>> >>>
>> >>>Right, but without the copying in and out of memory.
>> >>>
>> >>Linux-api is already on the Cc list.  Do we need to add more people to
>> >>get some agreements for the syscall?
>> >What kind of agreement are you looking for?  I'd suggest just coding it
>> >up and posting the patches.  Adding syscalls really is really pretty
>> >straightforward and isn't much code at all.
>> >
>> 
>> Sure, I will do that.
>
> Alternatively, would a regular prctl() work here?

Is this something appliation code has to call, or just the dynamic
loader?

prctl in glibc is a variadic function, so if there's a mismatch between
the kernel/userspace syscall convention and the userspace calling
convention (for variadic functions) for specific types, it can't be made
to work in a generic way.

The loader can use inline assembly for system calls and does not have
this issue, but applications would be implcated by it.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 16:51                 ` Florian Weimer
@ 2020-08-26 17:04                   ` Andy Lutomirski
  2020-08-26 18:49                     ` Yu, Yu-cheng
  2020-08-26 17:08                   ` Dave Martin
  1 sibling, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2020-08-26 17:04 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Martin, Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Wed, Aug 26, 2020 at 9:52 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Dave Martin:
>
> > On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
> >> On 8/25/2020 4:20 PM, Dave Hansen wrote:
> >> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
> >> >>>>I think this is more arch-specific.  Even if it becomes a new syscall,
> >> >>>>we still need to pass the same parameters.
> >> >>>
> >> >>>Right, but without the copying in and out of memory.
> >> >>>
> >> >>Linux-api is already on the Cc list.  Do we need to add more people to
> >> >>get some agreements for the syscall?
> >> >What kind of agreement are you looking for?  I'd suggest just coding it
> >> >up and posting the patches.  Adding syscalls really is really pretty
> >> >straightforward and isn't much code at all.
> >> >
> >>
> >> Sure, I will do that.
> >
> > Alternatively, would a regular prctl() work here?
>
> Is this something appliation code has to call, or just the dynamic
> loader?
>
> prctl in glibc is a variadic function, so if there's a mismatch between
> the kernel/userspace syscall convention and the userspace calling
> convention (for variadic functions) for specific types, it can't be made
> to work in a generic way.
>
> The loader can use inline assembly for system calls and does not have
> this issue, but applications would be implcated by it.
>

I would expect things like Go and various JITs to call it directly.

If we wanted to be fancy and add a potentially more widely useful
syscall, how about:

mmap_special(void *addr, size_t length, int prot, int flags, int type);

Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
this is really just mmap() except that we want to map something a bit
magical, and we don't want to require opening a device node to do it.

--Andy

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 16:51                 ` Florian Weimer
  2020-08-26 17:04                   ` Andy Lutomirski
@ 2020-08-26 17:08                   ` Dave Martin
  2020-08-27 13:18                     ` Florian Weimer
  1 sibling, 1 reply; 89+ messages in thread
From: Dave Martin @ 2020-08-26 17:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Wed, Aug 26, 2020 at 06:51:48PM +0200, Florian Weimer wrote:
> * Dave Martin:
> 
> > On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
> >> On 8/25/2020 4:20 PM, Dave Hansen wrote:
> >> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
> >> >>>>I think this is more arch-specific.  Even if it becomes a new syscall,
> >> >>>>we still need to pass the same parameters.
> >> >>>
> >> >>>Right, but without the copying in and out of memory.
> >> >>>
> >> >>Linux-api is already on the Cc list.  Do we need to add more people to
> >> >>get some agreements for the syscall?
> >> >What kind of agreement are you looking for?  I'd suggest just coding it
> >> >up and posting the patches.  Adding syscalls really is really pretty
> >> >straightforward and isn't much code at all.
> >> >
> >> 
> >> Sure, I will do that.
> >
> > Alternatively, would a regular prctl() work here?
> 
> Is this something appliation code has to call, or just the dynamic
> loader?
> 
> prctl in glibc is a variadic function, so if there's a mismatch between
> the kernel/userspace syscall convention and the userspace calling
> convention (for variadic functions) for specific types, it can't be made
> to work in a generic way.
>
> The loader can use inline assembly for system calls and does not have
> this issue, but applications would be implcated by it.

To the extent that this is a problem, libc's prctl() wrapper has to
handle it already.  New prctl() calls tend to demand precisely 4
arguments and require unused arguments to be 0, but this is more down to
policy rather than because anything breaks otherwise.

You're right that this has implications: for i386, libc probably pulls
more arguments off the stack than are really there in some situations.
This isn't a new problem though.  There are already generic prctls with
fewer than 4 args that are used on x86.

Merging the actual prctl() and arch_prctl() syscalls doesn't acutally
stop libc from retaining separate wrappers if they have different
argument marshaling requirements in some corner cases.


There might be some underlying reason by x86 has its own call and nobody
else followed the same model, but I don't know what it is.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 17:04                   ` Andy Lutomirski
@ 2020-08-26 18:49                     ` Yu, Yu-cheng
  2020-08-26 19:43                       ` H.J. Lu
  2020-08-26 19:57                       ` Dave Hansen
  0 siblings, 2 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-08-26 18:49 UTC (permalink / raw)
  To: Andy Lutomirski, Florian Weimer
  Cc: Dave Martin, Dave Hansen, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H.J. Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 8/26/2020 10:04 AM, Andy Lutomirski wrote:
> On Wed, Aug 26, 2020 at 9:52 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Dave Martin:
>>
>>> On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
>>>> On 8/25/2020 4:20 PM, Dave Hansen wrote:
>>>>> On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
>>>>>>>> I think this is more arch-specific.  Even if it becomes a new syscall,
>>>>>>>> we still need to pass the same parameters.
>>>>>>>
>>>>>>> Right, but without the copying in and out of memory.
>>>>>>>
>>>>>> Linux-api is already on the Cc list.  Do we need to add more people to
>>>>>> get some agreements for the syscall?
>>>>> What kind of agreement are you looking for?  I'd suggest just coding it
>>>>> up and posting the patches.  Adding syscalls really is really pretty
>>>>> straightforward and isn't much code at all.
>>>>>
>>>>
>>>> Sure, I will do that.
>>>
>>> Alternatively, would a regular prctl() work here?
>>
>> Is this something appliation code has to call, or just the dynamic
>> loader?
>>
>> prctl in glibc is a variadic function, so if there's a mismatch between
>> the kernel/userspace syscall convention and the userspace calling
>> convention (for variadic functions) for specific types, it can't be made
>> to work in a generic way.
>>
>> The loader can use inline assembly for system calls and does not have
>> this issue, but applications would be implcated by it.
>>
> 
> I would expect things like Go and various JITs to call it directly.
> 
> If we wanted to be fancy and add a potentially more widely useful
> syscall, how about:
> 
> mmap_special(void *addr, size_t length, int prot, int flags, int type);
> 
> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
> this is really just mmap() except that we want to map something a bit
> magical, and we don't want to require opening a device node to do it.
> 

One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
Does ARM have similar needs for memory mapping, Dave?

Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 18:49                     ` Yu, Yu-cheng
@ 2020-08-26 19:43                       ` H.J. Lu
  2020-08-26 19:57                       ` Dave Hansen
  1 sibling, 0 replies; 89+ messages in thread
From: H.J. Lu @ 2020-08-26 19:43 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: Andy Lutomirski, Florian Weimer, Dave Martin, Dave Hansen,
	X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Wed, Aug 26, 2020 at 11:49 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>
> On 8/26/2020 10:04 AM, Andy Lutomirski wrote:
> > On Wed, Aug 26, 2020 at 9:52 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Dave Martin:
> >>
> >>> On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
> >>>> On 8/25/2020 4:20 PM, Dave Hansen wrote:
> >>>>> On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
> >>>>>>>> I think this is more arch-specific.  Even if it becomes a new syscall,
> >>>>>>>> we still need to pass the same parameters.
> >>>>>>>
> >>>>>>> Right, but without the copying in and out of memory.
> >>>>>>>
> >>>>>> Linux-api is already on the Cc list.  Do we need to add more people to
> >>>>>> get some agreements for the syscall?
> >>>>> What kind of agreement are you looking for?  I'd suggest just coding it
> >>>>> up and posting the patches.  Adding syscalls really is really pretty
> >>>>> straightforward and isn't much code at all.
> >>>>>
> >>>>
> >>>> Sure, I will do that.
> >>>
> >>> Alternatively, would a regular prctl() work here?
> >>
> >> Is this something appliation code has to call, or just the dynamic
> >> loader?
> >>
> >> prctl in glibc is a variadic function, so if there's a mismatch between
> >> the kernel/userspace syscall convention and the userspace calling
> >> convention (for variadic functions) for specific types, it can't be made
> >> to work in a generic way.
> >>
> >> The loader can use inline assembly for system calls and does not have
> >> this issue, but applications would be implcated by it.
> >>
> >
> > I would expect things like Go and various JITs to call it directly.
> >
> > If we wanted to be fancy and add a potentially more widely useful
> > syscall, how about:
> >
> > mmap_special(void *addr, size_t length, int prot, int flags, int type);
> >
> > Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
> > this is really just mmap() except that we want to map something a bit
> > magical, and we don't want to require opening a device node to do it.
> >
>
> One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
> Does ARM have similar needs for memory mapping, Dave?
>

arch/arm64/include/uapi/asm/mman.h:

#define PROT_BTI   0x10     /* BTI guarded page */

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 18:49                     ` Yu, Yu-cheng
  2020-08-26 19:43                       ` H.J. Lu
@ 2020-08-26 19:57                       ` Dave Hansen
  2020-08-27 13:26                         ` H.J. Lu
  1 sibling, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-08-26 19:57 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski, Florian Weimer
  Cc: Dave Martin, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On 8/26/20 11:49 AM, Yu, Yu-cheng wrote:
>> I would expect things like Go and various JITs to call it directly.
>>
>> If we wanted to be fancy and add a potentially more widely useful
>> syscall, how about:
>>
>> mmap_special(void *addr, size_t length, int prot, int flags, int type);
>>
>> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
>> this is really just mmap() except that we want to map something a bit
>> magical, and we don't want to require opening a device node to do it.
> 
> One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
> Does ARM have similar needs for memory mapping, Dave?

No idea.

But, mmap_special() is *basically* mmap2() with extra-big flags space.
I suspect it will grow some more uses on top of shadow stacks.  It could
have, for instance, been used to allocate MPX bounds tables.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 17:08                   ` Dave Martin
@ 2020-08-27 13:18                     ` Florian Weimer
  2020-08-27 13:28                       ` H.J. Lu
  0 siblings, 1 reply; 89+ messages in thread
From: Florian Weimer @ 2020-08-27 13:18 UTC (permalink / raw)
  To: Dave Martin
  Cc: Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

* Dave Martin:

> You're right that this has implications: for i386, libc probably pulls
> more arguments off the stack than are really there in some situations.
> This isn't a new problem though.  There are already generic prctls with
> fewer than 4 args that are used on x86.

As originally posted, glibc prctl would have to know that it has to pull
an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
then the u64 argument is a problem for arch_prctl as well.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-26 19:57                       ` Dave Hansen
@ 2020-08-27 13:26                         ` H.J. Lu
  2020-09-01 10:28                           ` Dave Martin
  0 siblings, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-27 13:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu, Yu-cheng, Andy Lutomirski, Florian Weimer, Dave Martin,
	X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Wed, Aug 26, 2020 at 12:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 8/26/20 11:49 AM, Yu, Yu-cheng wrote:
> >> I would expect things like Go and various JITs to call it directly.
> >>
> >> If we wanted to be fancy and add a potentially more widely useful
> >> syscall, how about:
> >>
> >> mmap_special(void *addr, size_t length, int prot, int flags, int type);
> >>
> >> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
> >> this is really just mmap() except that we want to map something a bit
> >> magical, and we don't want to require opening a device node to do it.
> >
> > One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
> > Does ARM have similar needs for memory mapping, Dave?
>
> No idea.
>
> But, mmap_special() is *basically* mmap2() with extra-big flags space.
> I suspect it will grow some more uses on top of shadow stacks.  It could
> have, for instance, been used to allocate MPX bounds tables.

There is no reason we can't use

long arch_prctl (int, unsigned long, unsigned long, unsigned long, ..);

for ARCH_X86_CET_MMAP_SHSTK.   We just need to use

syscall (SYS_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, ...);

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 13:18                     ` Florian Weimer
@ 2020-08-27 13:28                       ` H.J. Lu
  2020-08-27 13:36                         ` Florian Weimer
  0 siblings, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-27 13:28 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Martin, Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Dave Martin:
>
> > You're right that this has implications: for i386, libc probably pulls
> > more arguments off the stack than are really there in some situations.
> > This isn't a new problem though.  There are already generic prctls with
> > fewer than 4 args that are used on x86.
>
> As originally posted, glibc prctl would have to know that it has to pull
> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
> then the u64 argument is a problem for arch_prctl as well.
>

Argument of ARCH_X86_CET_DISABLE is int and passed in register.

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 13:28                       ` H.J. Lu
@ 2020-08-27 13:36                         ` Florian Weimer
  2020-08-27 14:07                           ` H.J. Lu
  2020-08-27 18:13                           ` Yu, Yu-cheng
  0 siblings, 2 replies; 89+ messages in thread
From: Florian Weimer @ 2020-08-27 13:36 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Martin, Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

* H. J. Lu:

> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Dave Martin:
>>
>> > You're right that this has implications: for i386, libc probably pulls
>> > more arguments off the stack than are really there in some situations.
>> > This isn't a new problem though.  There are already generic prctls with
>> > fewer than 4 args that are used on x86.
>>
>> As originally posted, glibc prctl would have to know that it has to pull
>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
>> then the u64 argument is a problem for arch_prctl as well.
>>
>
> Argument of ARCH_X86_CET_DISABLE is int and passed in register.

The commit message and the C source say otherwise, I think (not sure
about the C source, not a kernel hacker).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 13:36                         ` Florian Weimer
@ 2020-08-27 14:07                           ` H.J. Lu
  2020-08-27 14:08                             ` H.J. Lu
  2020-08-27 18:13                           ` Yu, Yu-cheng
  1 sibling, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-27 14:07 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Martin, Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 6:36 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Dave Martin:
> >>
> >> > You're right that this has implications: for i386, libc probably pulls
> >> > more arguments off the stack than are really there in some situations.
> >> > This isn't a new problem though.  There are already generic prctls with
> >> > fewer than 4 args that are used on x86.
> >>
> >> As originally posted, glibc prctl would have to know that it has to pull
> >> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
> >> then the u64 argument is a problem for arch_prctl as well.
> >>
> >
> > Argument of ARCH_X86_CET_DISABLE is int and passed in register.
>
> The commit message and the C source say otherwise, I think (not sure
> about the C source, not a kernel hacker).

It should read:

arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 14:07                           ` H.J. Lu
@ 2020-08-27 14:08                             ` H.J. Lu
  2020-09-01 17:49                               ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-27 14:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Martin, Yu, Yu-cheng, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 7:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Aug 27, 2020 at 6:36 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * H. J. Lu:
> >
> > > On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
> > >>
> > >> * Dave Martin:
> > >>
> > >> > You're right that this has implications: for i386, libc probably pulls
> > >> > more arguments off the stack than are really there in some situations.
> > >> > This isn't a new problem though.  There are already generic prctls with
> > >> > fewer than 4 args that are used on x86.
> > >>
> > >> As originally posted, glibc prctl would have to know that it has to pull
> > >> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
> > >> then the u64 argument is a problem for arch_prctl as well.
> > >>
> > >
> > > Argument of ARCH_X86_CET_DISABLE is int and passed in register.
> >
> > The commit message and the C source say otherwise, I think (not sure
> > about the C source, not a kernel hacker).
>
> It should read:
>
> arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
>

Or

arch_prctl(ARCH_X86_CET_DISABLE, unsigned int features)


-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 13:36                         ` Florian Weimer
  2020-08-27 14:07                           ` H.J. Lu
@ 2020-08-27 18:13                           ` Yu, Yu-cheng
  2020-08-27 18:56                             ` Andy Lutomirski
  1 sibling, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-08-27 18:13 UTC (permalink / raw)
  To: Florian Weimer, H.J. Lu
  Cc: Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 8/27/2020 6:36 AM, Florian Weimer wrote:
> * H. J. Lu:
> 
>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>
>>> * Dave Martin:
>>>
>>>> You're right that this has implications: for i386, libc probably pulls
>>>> more arguments off the stack than are really there in some situations.
>>>> This isn't a new problem though.  There are already generic prctls with
>>>> fewer than 4 args that are used on x86.
>>>
>>> As originally posted, glibc prctl would have to know that it has to pull
>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
>>> then the u64 argument is a problem for arch_prctl as well.
>>>
>>
>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
> 
> The commit message and the C source say otherwise, I think (not sure
> about the C source, not a kernel hacker).
> 

H.J. Lu suggested that we fix x86 arch_prctl() to take four arguments, 
and then keep MMAP_SHSTK as an arch_prctl().  Because now the map flags 
and size are all in registers, this also solves problems being pointed 
out earlier.  Without a wrapper, the shadow stack mmap call (from user 
space) will be:

syscall(_NR_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, size, MAP_32BIT).

I think this would be a nice alternative to another new syscall.

If this looks good to everyone, I can send out new patches as response 
to my current version, and then after all issues fixed, send v12.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 18:13                           ` Yu, Yu-cheng
@ 2020-08-27 18:56                             ` Andy Lutomirski
  2020-08-27 19:33                               ` Yu, Yu-cheng
  2020-08-27 19:37                               ` H.J. Lu
  0 siblings, 2 replies; 89+ messages in thread
From: Andy Lutomirski @ 2020-08-27 18:56 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: Florian Weimer, H.J. Lu, Dave Martin, Dave Hansen,
	Andy Lutomirski, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang



> On Aug 27, 2020, at 11:13 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
> 
> On 8/27/2020 6:36 AM, Florian Weimer wrote:
>> * H. J. Lu:
>>>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>> 
>>>>> * Dave Martin:
>>>>> 
>>>>>> You're right that this has implications: for i386, libc probably pulls
>>>>>> more arguments off the stack than are really there in some situations.
>>>>>> This isn't a new problem though.  There are already generic prctls with
>>>>>> fewer than 4 args that are used on x86.
>>>>> 
>>>>> As originally posted, glibc prctl would have to know that it has to pull
>>>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
>>>>> then the u64 argument is a problem for arch_prctl as well.
>>>>> 
>>> 
>>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
>> The commit message and the C source say otherwise, I think (not sure
>> about the C source, not a kernel hacker).
> 
> H.J. Lu suggested that we fix x86 arch_prctl() to take four arguments, and then keep MMAP_SHSTK as an arch_prctl().  Because now the map flags and size are all in registers, this also solves problems being pointed out earlier.  Without a wrapper, the shadow stack mmap call (from user space) will be:
> 
> syscall(_NR_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, size, MAP_32BIT).

I admit I don’t see a show stopping technical reason we can’t add arguments to an existing syscall, but I’m pretty sure it’s unprecedented, and it doesn’t seem like a good idea.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 18:56                             ` Andy Lutomirski
@ 2020-08-27 19:33                               ` Yu, Yu-cheng
  2020-08-27 19:37                               ` H.J. Lu
  1 sibling, 0 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-08-27 19:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Florian Weimer, H.J. Lu, Dave Martin, Dave Hansen,
	Andy Lutomirski, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On 8/27/2020 11:56 AM, Andy Lutomirski wrote:
> 
> 
>> On Aug 27, 2020, at 11:13 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>>
>> On 8/27/2020 6:36 AM, Florian Weimer wrote:
>>> * H. J. Lu:
>>>>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>>>
>>>>>> * Dave Martin:
>>>>>>
>>>>>>> You're right that this has implications: for i386, libc probably pulls
>>>>>>> more arguments off the stack than are really there in some situations.
>>>>>>> This isn't a new problem though.  There are already generic prctls with
>>>>>>> fewer than 4 args that are used on x86.
>>>>>>
>>>>>> As originally posted, glibc prctl would have to know that it has to pull
>>>>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
>>>>>> then the u64 argument is a problem for arch_prctl as well.
>>>>>>
>>>>
>>>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
>>> The commit message and the C source say otherwise, I think (not sure
>>> about the C source, not a kernel hacker).
>>
>> H.J. Lu suggested that we fix x86 arch_prctl() to take four arguments, and then keep MMAP_SHSTK as an arch_prctl().  Because now the map flags and size are all in registers, this also solves problems being pointed out earlier.  Without a wrapper, the shadow stack mmap call (from user space) will be:
>>
>> syscall(_NR_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, size, MAP_32BIT).
> 
> I admit I don’t see a show stopping technical reason we can’t add arguments to an existing syscall, but I’m pretty sure it’s unprecedented, and it doesn’t seem like a good idea.
> 

There are nine existing arch_prctl calls now.  If the concern is the 
extra new arguments getting misused, we can mask them out for the 
existing calls.  Otherwise, I have not seen anything that can break.

Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 18:56                             ` Andy Lutomirski
  2020-08-27 19:33                               ` Yu, Yu-cheng
@ 2020-08-27 19:37                               ` H.J. Lu
  2020-08-28  1:35                                 ` Andy Lutomirski
  1 sibling, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-27 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu, Yu-cheng, Florian Weimer, Dave Martin, Dave Hansen,
	Andy Lutomirski, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 11:56 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Aug 27, 2020, at 11:13 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
> >
> > On 8/27/2020 6:36 AM, Florian Weimer wrote:
> >> * H. J. Lu:
> >>>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>>>>
> >>>>> * Dave Martin:
> >>>>>
> >>>>>> You're right that this has implications: for i386, libc probably pulls
> >>>>>> more arguments off the stack than are really there in some situations.
> >>>>>> This isn't a new problem though.  There are already generic prctls with
> >>>>>> fewer than 4 args that are used on x86.
> >>>>>
> >>>>> As originally posted, glibc prctl would have to know that it has to pull
> >>>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
> >>>>> then the u64 argument is a problem for arch_prctl as well.
> >>>>>
> >>>
> >>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
> >> The commit message and the C source say otherwise, I think (not sure
> >> about the C source, not a kernel hacker).
> >
> > H.J. Lu suggested that we fix x86 arch_prctl() to take four arguments, and then keep MMAP_SHSTK as an arch_prctl().  Because now the map flags and size are all in registers, this also solves problems being pointed out earlier.  Without a wrapper, the shadow stack mmap call (from user space) will be:
> >
> > syscall(_NR_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, size, MAP_32BIT).
>
> I admit I don’t see a show stopping technical reason we can’t add arguments to an existing syscall, but I’m pretty sure it’s unprecedented, and it doesn’t seem like a good idea.

prctl prototype is:

extern int prctl (int __option, ...)

and implemented in kernel as:

      int prctl(int option, unsigned long arg2, unsigned long arg3,
                 unsigned long arg4, unsigned long arg5);

Not all prctl operations take all 5 arguments.   It also applies
to arch_prctl.  It is quite normal for different operations of
arch_prctl to take different numbers of arguments.

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 19:37                               ` H.J. Lu
@ 2020-08-28  1:35                                 ` Andy Lutomirski
  2020-08-28  1:44                                   ` H.J. Lu
  0 siblings, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2020-08-28  1:35 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Yu, Yu-cheng, Florian Weimer, Dave Martin, Dave Hansen,
	Andy Lutomirski, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 12:38 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Aug 27, 2020 at 11:56 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >
> >
> > > On Aug 27, 2020, at 11:13 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
> > >
> > > On 8/27/2020 6:36 AM, Florian Weimer wrote:
> > >> * H. J. Lu:
> > >>>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
> > >>>>>
> > >>>>> * Dave Martin:
> > >>>>>
> > >>>>>> You're right that this has implications: for i386, libc probably pulls
> > >>>>>> more arguments off the stack than are really there in some situations.
> > >>>>>> This isn't a new problem though.  There are already generic prctls with
> > >>>>>> fewer than 4 args that are used on x86.
> > >>>>>
> > >>>>> As originally posted, glibc prctl would have to know that it has to pull
> > >>>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
> > >>>>> then the u64 argument is a problem for arch_prctl as well.
> > >>>>>
> > >>>
> > >>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
> > >> The commit message and the C source say otherwise, I think (not sure
> > >> about the C source, not a kernel hacker).
> > >
> > > H.J. Lu suggested that we fix x86 arch_prctl() to take four arguments, and then keep MMAP_SHSTK as an arch_prctl().  Because now the map flags and size are all in registers, this also solves problems being pointed out earlier.  Without a wrapper, the shadow stack mmap call (from user space) will be:
> > >
> > > syscall(_NR_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, size, MAP_32BIT).
> >
> > I admit I don’t see a show stopping technical reason we can’t add arguments to an existing syscall, but I’m pretty sure it’s unprecedented, and it doesn’t seem like a good idea.
>
> prctl prototype is:
>
> extern int prctl (int __option, ...)
>
> and implemented in kernel as:
>
>       int prctl(int option, unsigned long arg2, unsigned long arg3,
>                  unsigned long arg4, unsigned long arg5);
>
> Not all prctl operations take all 5 arguments.   It also applies
> to arch_prctl.  It is quite normal for different operations of
> arch_prctl to take different numbers of arguments.

If by "quite normal" you mean "does not happen", then I agree.

In any event, I will not have anything to do with a patch that changes
an existing syscall signature unless Linus personally acks it.  So if
you want to email him and linux-abi, be my guest.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-28  1:35                                 ` Andy Lutomirski
@ 2020-08-28  1:44                                   ` H.J. Lu
  2020-08-28  6:23                                     ` Florian Weimer
  0 siblings, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-28  1:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu, Yu-cheng, Florian Weimer, Dave Martin, Dave Hansen, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 6:35 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Aug 27, 2020 at 12:38 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Thu, Aug 27, 2020 at 11:56 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > >
> > >
> > > > On Aug 27, 2020, at 11:13 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
> > > >
> > > > On 8/27/2020 6:36 AM, Florian Weimer wrote:
> > > >> * H. J. Lu:
> > > >>>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
> > > >>>>>
> > > >>>>> * Dave Martin:
> > > >>>>>
> > > >>>>>> You're right that this has implications: for i386, libc probably pulls
> > > >>>>>> more arguments off the stack than are really there in some situations.
> > > >>>>>> This isn't a new problem though.  There are already generic prctls with
> > > >>>>>> fewer than 4 args that are used on x86.
> > > >>>>>
> > > >>>>> As originally posted, glibc prctl would have to know that it has to pull
> > > >>>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
> > > >>>>> then the u64 argument is a problem for arch_prctl as well.
> > > >>>>>
> > > >>>
> > > >>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
> > > >> The commit message and the C source say otherwise, I think (not sure
> > > >> about the C source, not a kernel hacker).
> > > >
> > > > H.J. Lu suggested that we fix x86 arch_prctl() to take four arguments, and then keep MMAP_SHSTK as an arch_prctl().  Because now the map flags and size are all in registers, this also solves problems being pointed out earlier.  Without a wrapper, the shadow stack mmap call (from user space) will be:
> > > >
> > > > syscall(_NR_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, size, MAP_32BIT).
> > >
> > > I admit I don’t see a show stopping technical reason we can’t add arguments to an existing syscall, but I’m pretty sure it’s unprecedented, and it doesn’t seem like a good idea.
> >
> > prctl prototype is:
> >
> > extern int prctl (int __option, ...)
> >
> > and implemented in kernel as:
> >
> >       int prctl(int option, unsigned long arg2, unsigned long arg3,
> >                  unsigned long arg4, unsigned long arg5);
> >
> > Not all prctl operations take all 5 arguments.   It also applies
> > to arch_prctl.  It is quite normal for different operations of
> > arch_prctl to take different numbers of arguments.
>
> If by "quite normal" you mean "does not happen", then I agree.
>
> In any event, I will not have anything to do with a patch that changes
> an existing syscall signature unless Linus personally acks it.  So if
> you want to email him and linux-abi, be my guest.

Can you think of ANY issues of passing more arguments to arch_prctl?
syscall () provided by glibc always passes 6 arguments to the kernel.
Arguments are already in the registers.  What kind of problems do
you see?

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-28  1:44                                   ` H.J. Lu
@ 2020-08-28  6:23                                     ` Florian Weimer
  2020-08-28 11:37                                       ` H.J. Lu
  0 siblings, 1 reply; 89+ messages in thread
From: Florian Weimer @ 2020-08-28  6:23 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Andy Lutomirski, Yu, Yu-cheng, Dave Martin, Dave Hansen, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

* H. J. Lu:

> Can you think of ANY issues of passing more arguments to arch_prctl?

On x32, the glibc arch_prctl system call wrapper only passes two
arguments to the kernel, and applications have no way of detecting that.
musl only passes two arguments on all architectures.  It happens to work
anyway with default compiler flags, but that's an accident.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-28  6:23                                     ` Florian Weimer
@ 2020-08-28 11:37                                       ` H.J. Lu
  2020-08-28 17:39                                         ` Andy Lutomirski
  0 siblings, 1 reply; 89+ messages in thread
From: H.J. Lu @ 2020-08-28 11:37 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Yu, Yu-cheng, Dave Martin, Dave Hansen, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > Can you think of ANY issues of passing more arguments to arch_prctl?
>
> On x32, the glibc arch_prctl system call wrapper only passes two
> arguments to the kernel, and applications have no way of detecting that.
> musl only passes two arguments on all architectures.  It happens to work
> anyway with default compiler flags, but that's an accident.

In the current glibc, there is no arch_prctl wrapper for i386.  There are
arch_prctl wrappers with 2 arguments for x86-64 and x32.  But this isn't an
issue for glibc since glibc is both the provider and the user of the new
arch_prctl extension.  Besides,

long syscall(long number, ...);

is always available.

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-28 11:37                                       ` H.J. Lu
@ 2020-08-28 17:39                                         ` Andy Lutomirski
  2020-08-28 17:45                                           ` H.J. Lu
  0 siblings, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2020-08-28 17:39 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Florian Weimer, Andy Lutomirski, Yu, Yu-cheng, Dave Martin,
	Dave Hansen, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Fri, Aug 28, 2020 at 4:38 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Aug 27, 2020 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * H. J. Lu:
> >
> > > Can you think of ANY issues of passing more arguments to arch_prctl?
> >
> > On x32, the glibc arch_prctl system call wrapper only passes two
> > arguments to the kernel, and applications have no way of detecting that.
> > musl only passes two arguments on all architectures.  It happens to work
> > anyway with default compiler flags, but that's an accident.
>
> In the current glibc, there is no arch_prctl wrapper for i386.  There are
> arch_prctl wrappers with 2 arguments for x86-64 and x32.  But this isn't an
> issue for glibc since glibc is both the provider and the user of the new
> arch_prctl extension.  Besides,
>
> long syscall(long number, ...);
>
> is always available.

Userspace is probably full of tools and libraries that contain tables
of system calls and their signatures.  Think tracing, audit, container
management, etc.  I don't know how they will react to the addition of
new arguments.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-28 17:39                                         ` Andy Lutomirski
@ 2020-08-28 17:45                                           ` H.J. Lu
  0 siblings, 0 replies; 89+ messages in thread
From: H.J. Lu @ 2020-08-28 17:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Florian Weimer, Yu, Yu-cheng, Dave Martin, Dave Hansen, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Fri, Aug 28, 2020 at 10:39 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Aug 28, 2020 at 4:38 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Thu, Aug 27, 2020 at 11:24 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >
> > > * H. J. Lu:
> > >
> > > > Can you think of ANY issues of passing more arguments to arch_prctl?
> > >
> > > On x32, the glibc arch_prctl system call wrapper only passes two
> > > arguments to the kernel, and applications have no way of detecting that.
> > > musl only passes two arguments on all architectures.  It happens to work
> > > anyway with default compiler flags, but that's an accident.
> >
> > In the current glibc, there is no arch_prctl wrapper for i386.  There are
> > arch_prctl wrappers with 2 arguments for x86-64 and x32.  But this isn't an
> > issue for glibc since glibc is both the provider and the user of the new
> > arch_prctl extension.  Besides,
> >
> > long syscall(long number, ...);
> >
> > is always available.
>
> Userspace is probably full of tools and libraries that contain tables
> of system calls and their signatures.  Think tracing, audit, container
> management, etc.  I don't know how they will react to the addition of
> new arguments.

Yes, they need to be updated to understand other new operations
added for CET.

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 13:26                         ` H.J. Lu
@ 2020-09-01 10:28                           ` Dave Martin
  2020-09-01 17:23                             ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Martin @ 2020-09-01 10:28 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Yu, Yu-cheng, Andy Lutomirski, Florian Weimer,
	X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Thu, Aug 27, 2020 at 06:26:11AM -0700, H.J. Lu wrote:
> On Wed, Aug 26, 2020 at 12:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 8/26/20 11:49 AM, Yu, Yu-cheng wrote:
> > >> I would expect things like Go and various JITs to call it directly.
> > >>
> > >> If we wanted to be fancy and add a potentially more widely useful
> > >> syscall, how about:
> > >>
> > >> mmap_special(void *addr, size_t length, int prot, int flags, int type);
> > >>
> > >> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
> > >> this is really just mmap() except that we want to map something a bit
> > >> magical, and we don't want to require opening a device node to do it.
> > >
> > > One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
> > > Does ARM have similar needs for memory mapping, Dave?
> >
> > No idea.
> >
> > But, mmap_special() is *basically* mmap2() with extra-big flags space.
> > I suspect it will grow some more uses on top of shadow stacks.  It could
> > have, for instance, been used to allocate MPX bounds tables.
> 
> There is no reason we can't use
> 
> long arch_prctl (int, unsigned long, unsigned long, unsigned long, ..);
> 
> for ARCH_X86_CET_MMAP_SHSTK.   We just need to use
> 
> syscall (SYS_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, ...);


For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
family of calls.  One or two additional arch-specific mmap flags are
sufficient for now.

Is x86 definitely not going to fit within those calls?

For now, I can't see what arg[2] is used for (and hence the type
argument of mmap_special()), but I haven't dug through the whole series.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 10:28                           ` Dave Martin
@ 2020-09-01 17:23                             ` Yu, Yu-cheng
  2020-09-01 17:45                               ` Andy Lutomirski
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-01 17:23 UTC (permalink / raw)
  To: Dave Martin, H.J. Lu
  Cc: Dave Hansen, Andy Lutomirski, Florian Weimer, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/1/2020 3:28 AM, Dave Martin wrote:
> On Thu, Aug 27, 2020 at 06:26:11AM -0700, H.J. Lu wrote:
>> On Wed, Aug 26, 2020 at 12:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>>
>>> On 8/26/20 11:49 AM, Yu, Yu-cheng wrote:
>>>>> I would expect things like Go and various JITs to call it directly.
>>>>>
>>>>> If we wanted to be fancy and add a potentially more widely useful
>>>>> syscall, how about:
>>>>>
>>>>> mmap_special(void *addr, size_t length, int prot, int flags, int type);
>>>>>
>>>>> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
>>>>> this is really just mmap() except that we want to map something a bit
>>>>> magical, and we don't want to require opening a device node to do it.
>>>>
>>>> One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
>>>> Does ARM have similar needs for memory mapping, Dave?
>>>
>>> No idea.
>>>
>>> But, mmap_special() is *basically* mmap2() with extra-big flags space.
>>> I suspect it will grow some more uses on top of shadow stacks.  It could
>>> have, for instance, been used to allocate MPX bounds tables.
>>
>> There is no reason we can't use
>>
>> long arch_prctl (int, unsigned long, unsigned long, unsigned long, ..);
>>
>> for ARCH_X86_CET_MMAP_SHSTK.   We just need to use
>>
>> syscall (SYS_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, ...);
> 
> 
> For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
> family of calls.  One or two additional arch-specific mmap flags are
> sufficient for now.
> 
> Is x86 definitely not going to fit within those calls?

That can work for x86.  Andy, what if we create PROT_SHSTK, which can 
been seen only from the user.  Once in kernel, it is translated to 
VM_SHSTK.  One question for mremap/mprotect is, do we allow a normal 
data area to become shadow stack?

> 
> For now, I can't see what arg[2] is used for (and hence the type
> argument of mmap_special()), but I haven't dug through the whole series.

If we use the approach above, then we don't need arch_prctl changes.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 17:23                             ` Yu, Yu-cheng
@ 2020-09-01 17:45                               ` Andy Lutomirski
  2020-09-01 18:11                                 ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2020-09-01 17:45 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: Dave Martin, H.J. Lu, Dave Hansen, Andy Lutomirski,
	Florian Weimer, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Tue, Sep 1, 2020 at 10:23 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>
> On 9/1/2020 3:28 AM, Dave Martin wrote:
> > On Thu, Aug 27, 2020 at 06:26:11AM -0700, H.J. Lu wrote:
> >> On Wed, Aug 26, 2020 at 12:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>>
> >>> On 8/26/20 11:49 AM, Yu, Yu-cheng wrote:
> >>>>> I would expect things like Go and various JITs to call it directly.
> >>>>>
> >>>>> If we wanted to be fancy and add a potentially more widely useful
> >>>>> syscall, how about:
> >>>>>
> >>>>> mmap_special(void *addr, size_t length, int prot, int flags, int type);
> >>>>>
> >>>>> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
> >>>>> this is really just mmap() except that we want to map something a bit
> >>>>> magical, and we don't want to require opening a device node to do it.
> >>>>
> >>>> One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
> >>>> Does ARM have similar needs for memory mapping, Dave?
> >>>
> >>> No idea.
> >>>
> >>> But, mmap_special() is *basically* mmap2() with extra-big flags space.
> >>> I suspect it will grow some more uses on top of shadow stacks.  It could
> >>> have, for instance, been used to allocate MPX bounds tables.
> >>
> >> There is no reason we can't use
> >>
> >> long arch_prctl (int, unsigned long, unsigned long, unsigned long, ..);
> >>
> >> for ARCH_X86_CET_MMAP_SHSTK.   We just need to use
> >>
> >> syscall (SYS_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, ...);
> >
> >
> > For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
> > family of calls.  One or two additional arch-specific mmap flags are
> > sufficient for now.
> >
> > Is x86 definitely not going to fit within those calls?
>
> That can work for x86.  Andy, what if we create PROT_SHSTK, which can
> been seen only from the user.  Once in kernel, it is translated to
> VM_SHSTK.  One question for mremap/mprotect is, do we allow a normal
> data area to become shadow stack?

I'm unconvinced that we want to use a somewhat precious PROT_ or VM_
bit for this.  Using a flag bit makes sense if we expect anyone to
ever map an fd or similar as a shadow stack, but that seems a bit odd
in the first place.  To me, it seems more logical for a shadow stack
to be a special sort of mapping with a special vm_ops, not a normal
mapping with a special flag set.  Although I realize that we want
shadow stacks to work like anonymous memory with respect to fork().
Dave?

--Andy

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-08-27 14:08                             ` H.J. Lu
@ 2020-09-01 17:49                               ` Yu, Yu-cheng
  2020-09-01 17:50                                 ` Florian Weimer
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-01 17:49 UTC (permalink / raw)
  To: H.J. Lu, Florian Weimer
  Cc: Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 8/27/2020 7:08 AM, H.J. Lu wrote:
> On Thu, Aug 27, 2020 at 7:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, Aug 27, 2020 at 6:36 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>
>>> * H. J. Lu:
>>>
>>>> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>>
>>>>> * Dave Martin:
>>>>>
>>>>>> You're right that this has implications: for i386, libc probably pulls
>>>>>> more arguments off the stack than are really there in some situations.
>>>>>> This isn't a new problem though.  There are already generic prctls with
>>>>>> fewer than 4 args that are used on x86.
>>>>>
>>>>> As originally posted, glibc prctl would have to know that it has to pull
>>>>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
>>>>> then the u64 argument is a problem for arch_prctl as well.
>>>>>
>>>>
>>>> Argument of ARCH_X86_CET_DISABLE is int and passed in register.
>>>
>>> The commit message and the C source say otherwise, I think (not sure
>>> about the C source, not a kernel hacker).
>>
>> It should read:
>>
>> arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
>>
> 
> Or
> 
> arch_prctl(ARCH_X86_CET_DISABLE, unsigned int features)
> 

Like other arch_prctl()'s, this parameter was 'unsigned long' earlier. 
The idea was, since this arch_prctl is only implemented for the 64-bit 
kernel, we wanted it to look as 64-bit only.  I will change it back to 
'unsigned long'.

Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 17:49                               ` Yu, Yu-cheng
@ 2020-09-01 17:50                                 ` Florian Weimer
  2020-09-01 17:58                                   ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Florian Weimer @ 2020-09-01 17:50 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: H.J. Lu, Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

* Yu-cheng Yu:

> Like other arch_prctl()'s, this parameter was 'unsigned long'
> earlier. The idea was, since this arch_prctl is only implemented for
> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
> it back to 'unsigned long'.

What about x32?  In general, long is rather problematic for x32.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 17:50                                 ` Florian Weimer
@ 2020-09-01 17:58                                   ` Yu, Yu-cheng
  2020-09-01 18:17                                     ` Florian Weimer
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-01 17:58 UTC (permalink / raw)
  To: Florian Weimer
  Cc: H.J. Lu, Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/1/2020 10:50 AM, Florian Weimer wrote:
> * Yu-cheng Yu:
> 
>> Like other arch_prctl()'s, this parameter was 'unsigned long'
>> earlier. The idea was, since this arch_prctl is only implemented for
>> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
>> it back to 'unsigned long'.
> 
> What about x32?  In general, long is rather problematic for x32.

The problem is the size of 'long', right?
Because this parameter is passed in a register, and only the lower bits 
are used, x32 works as well.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 17:45                               ` Andy Lutomirski
@ 2020-09-01 18:11                                 ` Dave Hansen
  2020-09-02 13:58                                   ` Dave Martin
       [not found]                                   ` <46dffdfd-92f8-0f05-6164-945f217b0958@intel.com>
  0 siblings, 2 replies; 89+ messages in thread
From: Dave Hansen @ 2020-09-01 18:11 UTC (permalink / raw)
  To: Andy Lutomirski, Yu, Yu-cheng
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/1/20 10:45 AM, Andy Lutomirski wrote:
>>> For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
>>> family of calls.  One or two additional arch-specific mmap flags are
>>> sufficient for now.
>>>
>>> Is x86 definitely not going to fit within those calls?
>> That can work for x86.  Andy, what if we create PROT_SHSTK, which can
>> been seen only from the user.  Once in kernel, it is translated to
>> VM_SHSTK.  One question for mremap/mprotect is, do we allow a normal
>> data area to become shadow stack?
> I'm unconvinced that we want to use a somewhat precious PROT_ or VM_
> bit for this.  Using a flag bit makes sense if we expect anyone to
> ever map an fd or similar as a shadow stack, but that seems a bit odd
> in the first place.  To me, it seems more logical for a shadow stack
> to be a special sort of mapping with a special vm_ops, not a normal
> mapping with a special flag set.  Although I realize that we want
> shadow stacks to work like anonymous memory with respect to fork().
> Dave?

I actually don't like the idea of *creating* mappings much.

I think the pkey model has worked out pretty well where we separate
creating the mapping from doing something *to* it, like changing
protections.  For instance, it would be nice if we could preserve things
like using hugetlbfs or heck even doing KSM for shadow stacks.

If we're *creating* mappings, we've pretty much ruled out things like
hugetlbfs.

Something like mprotect_shstk() would allow an implementation today that
only works on anonymous memory *and* sets up a special vm_ops.  But, the
same exact ABI could do wonky stuff in the future if we decided we
wanted to do shadow stacks on DAX or hugetlbfs or whatever.

I don't really like the idea of PROT_SHSTK those are plumbed into a
bunch of interfaces.  But, I also can't deny that it seems to be working
fine for the arm64 folks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 17:58                                   ` Yu, Yu-cheng
@ 2020-09-01 18:17                                     ` Florian Weimer
  2020-09-01 18:19                                       ` H.J. Lu
  2020-09-01 18:24                                       ` Yu, Yu-cheng
  0 siblings, 2 replies; 89+ messages in thread
From: Florian Weimer @ 2020-09-01 18:17 UTC (permalink / raw)
  To: Yu, Yu-cheng
  Cc: H.J. Lu, Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

* Yu-cheng Yu:

> On 9/1/2020 10:50 AM, Florian Weimer wrote:
>> * Yu-cheng Yu:
>> 
>>> Like other arch_prctl()'s, this parameter was 'unsigned long'
>>> earlier. The idea was, since this arch_prctl is only implemented for
>>> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
>>> it back to 'unsigned long'.
>> What about x32?  In general, long is rather problematic for x32.
>
> The problem is the size of 'long', right?
> Because this parameter is passed in a register, and only the lower
> bits are used, x32 works as well.

The userspace calling convention leaves the upper 32-bit undefined.
Therefore, this only works by accident if the kernel does not check that
the upper 32-bit are zero, which is probably a kernel bug.

It's unclear to me what you are trying to accomplish.  Why do you want
to use unsigned long here?  The correct type appears to be unsigned int.
This correctly expresses that the upper 32 bits of the register do not
matter.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 18:17                                     ` Florian Weimer
@ 2020-09-01 18:19                                       ` H.J. Lu
  2020-09-01 18:24                                       ` Yu, Yu-cheng
  1 sibling, 0 replies; 89+ messages in thread
From: H.J. Lu @ 2020-09-01 18:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yu, Yu-cheng, Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Tue, Sep 1, 2020 at 11:17 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Yu-cheng Yu:
>
> > On 9/1/2020 10:50 AM, Florian Weimer wrote:
> >> * Yu-cheng Yu:
> >>
> >>> Like other arch_prctl()'s, this parameter was 'unsigned long'
> >>> earlier. The idea was, since this arch_prctl is only implemented for
> >>> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
> >>> it back to 'unsigned long'.
> >> What about x32?  In general, long is rather problematic for x32.
> >
> > The problem is the size of 'long', right?
> > Because this parameter is passed in a register, and only the lower
> > bits are used, x32 works as well.
>
> The userspace calling convention leaves the upper 32-bit undefined.
> Therefore, this only works by accident if the kernel does not check that
> the upper 32-bit are zero, which is probably a kernel bug.
>
> It's unclear to me what you are trying to accomplish.  Why do you want
> to use unsigned long here?  The correct type appears to be unsigned int.
> This correctly expresses that the upper 32 bits of the register do not
> matter.
>

unsigned int is the correct type since only the lower 32 bits are used.

-- 
H.J.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 18:17                                     ` Florian Weimer
  2020-09-01 18:19                                       ` H.J. Lu
@ 2020-09-01 18:24                                       ` Yu, Yu-cheng
  1 sibling, 0 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-01 18:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: H.J. Lu, Dave Martin, Dave Hansen, Andy Lutomirski, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/1/2020 11:17 AM, Florian Weimer wrote:
> * Yu-cheng Yu:
> 
>> On 9/1/2020 10:50 AM, Florian Weimer wrote:
>>> * Yu-cheng Yu:
>>>
>>>> Like other arch_prctl()'s, this parameter was 'unsigned long'
>>>> earlier. The idea was, since this arch_prctl is only implemented for
>>>> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
>>>> it back to 'unsigned long'.
>>> What about x32?  In general, long is rather problematic for x32.
>>
>> The problem is the size of 'long', right?
>> Because this parameter is passed in a register, and only the lower
>> bits are used, x32 works as well.
> 
> The userspace calling convention leaves the upper 32-bit undefined.
> Therefore, this only works by accident if the kernel does not check that
> the upper 32-bit are zero, which is probably a kernel bug.
> 
> It's unclear to me what you are trying to accomplish.  Why do you want
> to use unsigned long here?  The correct type appears to be unsigned int.
> This correctly expresses that the upper 32 bits of the register do not
> matter.

Yes, you are right.  I will make it 'unsigned int'.

Thanks,
Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-01 18:11                                 ` Dave Hansen
@ 2020-09-02 13:58                                   ` Dave Martin
       [not found]                                   ` <46dffdfd-92f8-0f05-6164-945f217b0958@intel.com>
  1 sibling, 0 replies; 89+ messages in thread
From: Dave Martin @ 2020-09-02 13:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Yu, Yu-cheng, H.J. Lu, Florian Weimer, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Tue, Sep 01, 2020 at 11:11:37AM -0700, Dave Hansen wrote:
> On 9/1/20 10:45 AM, Andy Lutomirski wrote:
> >>> For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
> >>> family of calls.  One or two additional arch-specific mmap flags are
> >>> sufficient for now.
> >>>
> >>> Is x86 definitely not going to fit within those calls?
> >> That can work for x86.  Andy, what if we create PROT_SHSTK, which can
> >> been seen only from the user.  Once in kernel, it is translated to
> >> VM_SHSTK.  One question for mremap/mprotect is, do we allow a normal
> >> data area to become shadow stack?
> > I'm unconvinced that we want to use a somewhat precious PROT_ or VM_
> > bit for this.  Using a flag bit makes sense if we expect anyone to
> > ever map an fd or similar as a shadow stack, but that seems a bit odd
> > in the first place.  To me, it seems more logical for a shadow stack
> > to be a special sort of mapping with a special vm_ops, not a normal
> > mapping with a special flag set.  Although I realize that we want
> > shadow stacks to work like anonymous memory with respect to fork().
> > Dave?
> 
> I actually don't like the idea of *creating* mappings much.
> 
> I think the pkey model has worked out pretty well where we separate
> creating the mapping from doing something *to* it, like changing
> protections.  For instance, it would be nice if we could preserve things
> like using hugetlbfs or heck even doing KSM for shadow stacks.
> 
> If we're *creating* mappings, we've pretty much ruled out things like
> hugetlbfs.
> 
> Something like mprotect_shstk() would allow an implementation today that
> only works on anonymous memory *and* sets up a special vm_ops.  But, the
> same exact ABI could do wonky stuff in the future if we decided we
> wanted to do shadow stacks on DAX or hugetlbfs or whatever.
> 
> I don't really like the idea of PROT_SHSTK those are plumbed into a
> bunch of interfaces.  But, I also can't deny that it seems to be working
> fine for the arm64 folks.

Note, there are some rough edges, such as what happens when someone
calls mprotect() on memory marked with PROT_BTI.  Unless the caller
knows whether PROT_BTI should be set for that page, the flag may get
unintentionally cleared.  Since the flag only applies to text pages
though, it's not _that_ much of a concern.  Software that deals with
writable text pages is also usually involved in generating the code and
so will know about PROT_BTI.  That's was the theory anyway.

In the longer term, it might be preferable to have a mprotect2() that
can leave some flags unmodified, and that doesn't silently ignore
unknown flags (at least one of mmap or mprotect does; I don't recall
which).  We attempt didn't go this far, for now.

For arm64 it seemed fairly natural for the BTI flag to be a PROT_ flag,
but I don't know enough detail about x86 shstk to know whether it's a
natural fit there.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
       [not found]                                   ` <46dffdfd-92f8-0f05-6164-945f217b0958@intel.com>
@ 2020-09-08 17:57                                     ` Dave Hansen
  2020-09-08 18:25                                       ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-09-08 17:57 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/8/20 10:50 AM, Yu, Yu-cheng wrote:
> What about this:
> 
> - Do not add any new syscall or arch_prctl for creating a new shadow stack.
> 
> - Add a new arch_prctl that can turn an anonymous mapping to a shadow
> stack mapping.
> 
> This allows the application to do whatever is necessary.  It can even
> allow GDB or JIT code to create or fix a call stack.

Fine with me.  But, it's going to effectively be

	arch_prctl(PR_CONVERT_TO_SHS..., addr, len);

when it could just as easily be:

	madvise(addr, len, MADV_SHSTK...);

Or a new syscall.  The only question in my mind is whether we want to do
something generic that we can use for other similar things in the
future, like:

	madvise2(addr, len, flags, MADV2_SHSTK...);

I don't really feel strongly about it, though.  Could you please share
your logic on why you want a prctl() as opposed to a whole new syscall?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-08 17:57                                     ` Dave Hansen
@ 2020-09-08 18:25                                       ` Yu, Yu-cheng
  2020-09-09 22:08                                         ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-08 18:25 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/8/2020 10:57 AM, Dave Hansen wrote:
> On 9/8/20 10:50 AM, Yu, Yu-cheng wrote:
>> What about this:
>>
>> - Do not add any new syscall or arch_prctl for creating a new shadow stack.
>>
>> - Add a new arch_prctl that can turn an anonymous mapping to a shadow
>> stack mapping.
>>
>> This allows the application to do whatever is necessary.  It can even
>> allow GDB or JIT code to create or fix a call stack.
> 
> Fine with me.  But, it's going to effectively be
> 
> 	arch_prctl(PR_CONVERT_TO_SHS..., addr, len);
> 
> when it could just as easily be:
> 
> 	madvise(addr, len, MADV_SHSTK...);
> 
> Or a new syscall.  The only question in my mind is whether we want to do
> something generic that we can use for other similar things in the
> future, like:
> 
> 	madvise2(addr, len, flags, MADV2_SHSTK...);
> 
> I don't really feel strongly about it, though.  Could you please share
> your logic on why you want a prctl() as opposed to a whole new syscall?
> 

A new syscall is more intrusive, I think.  When creating a new shadow 
stack, the kernel also installs a restore token on the top of the new 
shadow stack, and it is somewhat x86-specific.  So far no other arch's 
need this.

Yes, madvise is better if the kernel only needs to change the mapping. 
The application itself can create the restore token before calling 
madvise().

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-08 18:25                                       ` Yu, Yu-cheng
@ 2020-09-09 22:08                                         ` Yu, Yu-cheng
  2020-09-09 22:59                                           ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-09 22:08 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/8/2020 11:25 AM, Yu, Yu-cheng wrote:
> On 9/8/2020 10:57 AM, Dave Hansen wrote:
>> On 9/8/20 10:50 AM, Yu, Yu-cheng wrote:
>>> What about this:
>>>
>>> - Do not add any new syscall or arch_prctl for creating a new shadow 
>>> stack.
>>>
>>> - Add a new arch_prctl that can turn an anonymous mapping to a shadow
>>> stack mapping.
>>>
>>> This allows the application to do whatever is necessary.  It can even
>>> allow GDB or JIT code to create or fix a call stack.
>>
>> Fine with me.  But, it's going to effectively be
>>
>>     arch_prctl(PR_CONVERT_TO_SHS..., addr, len);
>>
>> when it could just as easily be:
>>
>>     madvise(addr, len, MADV_SHSTK...);
>>
>> Or a new syscall.  The only question in my mind is whether we want to do
>> something generic that we can use for other similar things in the
>> future, like:
>>
>>     madvise2(addr, len, flags, MADV2_SHSTK...);
>>
>> I don't really feel strongly about it, though.  Could you please share
>> your logic on why you want a prctl() as opposed to a whole new syscall?
>>
> 
> A new syscall is more intrusive, I think.  When creating a new shadow 
> stack, the kernel also installs a restore token on the top of the new 
> shadow stack, and it is somewhat x86-specific.  So far no other arch's 
> need this.
> 
> Yes, madvise is better if the kernel only needs to change the mapping. 
> The application itself can create the restore token before calling 
> madvise().

After looking at this more, I found the changes are more similar to 
mprotect() than madvise().  We are going to change an anonymous mapping 
to a read-only mapping, and add the VM_SHSTK flag to it.  Would an 
x86-specific mprotect(PROT_SHSTK) make more sense?

One alternative would be requiring a read-only mapping for 
madvise(MADV_SHSTK).  But that is inconvenient for the application.

Yu-cheng

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 22:08                                         ` Yu, Yu-cheng
@ 2020-09-09 22:59                                           ` Dave Hansen
  2020-09-09 23:07                                             ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-09-09 22:59 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/9/20 3:08 PM, Yu, Yu-cheng wrote:
> After looking at this more, I found the changes are more similar to
> mprotect() than madvise().  We are going to change an anonymous mapping
> to a read-only mapping, and add the VM_SHSTK flag to it.  Would an
> x86-specific mprotect(PROT_SHSTK) make more sense?
> 
> One alternative would be requiring a read-only mapping for
> madvise(MADV_SHSTK).  But that is inconvenient for the application.

Why?  It's just:

	mmap()/malloc();
	mprotect(PROT_READ);
	madvise(MADV_SHSTK);

vs.

	mmap()/malloc();
	mprotect(PROT_SHSTK);

I'm not sure a single syscall counts as inconvenient.

I don't quite think we should use a PROT_ bit for this.  It seems like
the kind of thing that could be fragile and break existing expectations.
 I don't care _that_ strongly though.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 22:59                                           ` Dave Hansen
@ 2020-09-09 23:07                                             ` Yu, Yu-cheng
  2020-09-09 23:11                                               ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-09 23:07 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/9/2020 3:59 PM, Dave Hansen wrote:
> On 9/9/20 3:08 PM, Yu, Yu-cheng wrote:
>> After looking at this more, I found the changes are more similar to
>> mprotect() than madvise().  We are going to change an anonymous mapping
>> to a read-only mapping, and add the VM_SHSTK flag to it.  Would an
>> x86-specific mprotect(PROT_SHSTK) make more sense?
>>
>> One alternative would be requiring a read-only mapping for
>> madvise(MADV_SHSTK).  But that is inconvenient for the application.
> 
> Why?  It's just:
> 
> 	mmap()/malloc();
> 	mprotect(PROT_READ);
> 	madvise(MADV_SHSTK);
> 
> vs.
> 
> 	mmap()/malloc();
> 	mprotect(PROT_SHSTK);
> 
> I'm not sure a single syscall counts as inconvenient.
> 
> I don't quite think we should use a PROT_ bit for this.  It seems like
> the kind of thing that could be fragile and break existing expectations.
>   I don't care _that_ strongly though.
> 
What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should 
that be rejected?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 23:07                                             ` Yu, Yu-cheng
@ 2020-09-09 23:11                                               ` Dave Hansen
  2020-09-09 23:25                                                 ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-09-09 23:11 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/9/20 4:07 PM, Yu, Yu-cheng wrote:
> What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should
> that be rejected?

It doesn't matter to me.  Even if it's readable, it _stops_ being even
directly readable after it's a shadow stack, right?  I don't think
writes are special in any way.  If anything, we *want* it to be writable
because that indicates that it can be written to, and we will want to
write to it soon.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 23:11                                               ` Dave Hansen
@ 2020-09-09 23:25                                                 ` Yu, Yu-cheng
  2020-09-09 23:29                                                   ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-09 23:25 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/9/2020 4:11 PM, Dave Hansen wrote:
> On 9/9/20 4:07 PM, Yu, Yu-cheng wrote:
>> What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should
>> that be rejected?
> 
> It doesn't matter to me.  Even if it's readable, it _stops_ being even
> directly readable after it's a shadow stack, right?  I don't think
> writes are special in any way.  If anything, we *want* it to be writable
> because that indicates that it can be written to, and we will want to
> write to it soon.
> 
But in a PROT_WRITE mapping, all the pte's have _PAGE_BIT_RW set.  To 
change them to shadow stack, we need to clear that bit from the pte's.
That will be like mprotect_fixup()/change_protection_range().


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 23:25                                                 ` Yu, Yu-cheng
@ 2020-09-09 23:29                                                   ` Dave Hansen
  2020-09-09 23:45                                                     ` Yu, Yu-cheng
  2020-09-11 22:59                                                     ` Yu-cheng Yu
  0 siblings, 2 replies; 89+ messages in thread
From: Dave Hansen @ 2020-09-09 23:29 UTC (permalink / raw)
  To: Yu, Yu-cheng, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/9/20 4:25 PM, Yu, Yu-cheng wrote:
> On 9/9/2020 4:11 PM, Dave Hansen wrote:
>> On 9/9/20 4:07 PM, Yu, Yu-cheng wrote:
>>> What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should
>>> that be rejected?
>>
>> It doesn't matter to me.  Even if it's readable, it _stops_ being even
>> directly readable after it's a shadow stack, right?  I don't think
>> writes are special in any way.  If anything, we *want* it to be writable
>> because that indicates that it can be written to, and we will want to
>> write to it soon.
>>
> But in a PROT_WRITE mapping, all the pte's have _PAGE_BIT_RW set.  To
> change them to shadow stack, we need to clear that bit from the pte's.
> That will be like mprotect_fixup()/change_protection_range().

The page table hardware bits don't matter.  The user-visible protection
effects matter.

For instance, we have PROT_EXEC, which *CLEARS* a hardware NX PTE bit.
The PROT_ permissions are independent of the hardware.

I don't think the interface should be influenced at *all* by what whacko
PTE bit combinations we have to set to get the behavior.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 23:29                                                   ` Dave Hansen
@ 2020-09-09 23:45                                                     ` Yu, Yu-cheng
  2020-09-11 22:59                                                     ` Yu-cheng Yu
  1 sibling, 0 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-09 23:45 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/9/2020 4:29 PM, Dave Hansen wrote:
> On 9/9/20 4:25 PM, Yu, Yu-cheng wrote:
>> On 9/9/2020 4:11 PM, Dave Hansen wrote:
>>> On 9/9/20 4:07 PM, Yu, Yu-cheng wrote:
>>>> What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should
>>>> that be rejected?
>>>
>>> It doesn't matter to me.  Even if it's readable, it _stops_ being even
>>> directly readable after it's a shadow stack, right?  I don't think
>>> writes are special in any way.  If anything, we *want* it to be writable
>>> because that indicates that it can be written to, and we will want to
>>> write to it soon.
>>>
>> But in a PROT_WRITE mapping, all the pte's have _PAGE_BIT_RW set.  To
>> change them to shadow stack, we need to clear that bit from the pte's.
>> That will be like mprotect_fixup()/change_protection_range().
> 
> The page table hardware bits don't matter.  The user-visible protection
> effects matter.
> 
> For instance, we have PROT_EXEC, which *CLEARS* a hardware NX PTE bit.
> The PROT_ permissions are independent of the hardware.

Same for shadow stack here.  We consider shadow stack "writable", but we 
want to clear _PAGE_BIT_RW, which is set for the PTEs of the other type 
of "writable" mapping.  To change a writable data mapping to a writable 
shadow stack mapping, we need to call change_protection_range().

> 
> I don't think the interface should be influenced at *all* by what whacko
> PTE bit combinations we have to set to get the behavior.
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-09 23:29                                                   ` Dave Hansen
  2020-09-09 23:45                                                     ` Yu, Yu-cheng
@ 2020-09-11 22:59                                                     ` Yu-cheng Yu
  2020-09-14 14:50                                                       ` [NEEDS-REVIEW] " Dave Hansen
  1 sibling, 1 reply; 89+ messages in thread
From: Yu-cheng Yu @ 2020-09-11 22:59 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Wed, 2020-09-09 at 16:29 -0700, Dave Hansen wrote:
> On 9/9/20 4:25 PM, Yu, Yu-cheng wrote:
> > On 9/9/2020 4:11 PM, Dave Hansen wrote:
> > > On 9/9/20 4:07 PM, Yu, Yu-cheng wrote:
> > > > What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should
> > > > that be rejected?
> > > 
> > > It doesn't matter to me.  Even if it's readable, it _stops_ being even
> > > directly readable after it's a shadow stack, right?  I don't think
> > > writes are special in any way.  If anything, we *want* it to be writable
> > > because that indicates that it can be written to, and we will want to
> > > write to it soon.
> > > 
> > But in a PROT_WRITE mapping, all the pte's have _PAGE_BIT_RW set.  To
> > change them to shadow stack, we need to clear that bit from the pte's.
> > That will be like mprotect_fixup()/change_protection_range().
> 
> The page table hardware bits don't matter.  The user-visible protection
> effects matter.
> 
> For instance, we have PROT_EXEC, which *CLEARS* a hardware NX PTE bit.
> The PROT_ permissions are independent of the hardware.
> 
> I don't think the interface should be influenced at *all* by what whacko
> PTE bit combinations we have to set to get the behavior.

Here are the changes if we take the mprotect(PROT_SHSTK) approach.
Any comments/suggestions?

---
 arch/x86/include/uapi/asm/mman.h | 26 +++++++++++++++++++++++++-
 mm/mprotect.c                    | 11 +++++++++++
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index d4a8d0424bfb..024f006fcfe8 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -4,6 +4,8 @@
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
 
+#define PROT_SHSTK	0x10		/* shadow stack pages */
+
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 /*
  * Take the 4 protection key bits out of the vma->vm_flags
@@ -19,13 +21,35 @@
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
 
-#define arch_calc_vm_prot_bits(prot, key) (		\
+#define pkey_vm_prot_bits(prot, key) (			\
 		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
 		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
 		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+#else
+#define pkey_vm_prot_bits(prot, key)
 #endif
 
+#define shstk_vm_prot_bits(prot) ( \
+		(static_cpu_has(X86_FEATURE_SHSTK) && (prot & PROT_SHSTK)) ? \
+		VM_SHSTK : 0)
+
+#define arch_calc_vm_prot_bits(prot, key) \
+		(pkey_vm_prot_bits(prot, key) | shstk_vm_prot_bits(prot))
+
 #include <asm-generic/mman.h>
 
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+{
+	unsigned long supported = PROT_READ | PROT_EXEC | PROT_SEM;
+
+	if (static_cpu_has(X86_FEATURE_SHSTK) && (prot & PROT_SHSTK))
+		supported |= PROT_SHSTK;
+	else
+		supported |= PROT_WRITE;
+
+	return (prot & ~supported) == 0;
+}
+#define arch_validate_prot arch_validate_prot
+
 #endif /* _ASM_X86_MMAN_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a8edbcb3af99..520bd8caa005 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -571,6 +571,17 @@ static int do_mprotect_pkey(unsigned long start, size_t
len,
 				goto out;
 		}
 	}
+
+	/*
+	 * Only anonymous mapping is suitable for shadow stack.
+	 */
+	if (prot & PROT_SHSTK) {
+		if (vma->vm_file) {
+			error = -EINVAL;
+			goto out;
+		}
+	}
+
 	if (start > vma->vm_start)
 		prev = vma;
 
-- 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-11 22:59                                                     ` Yu-cheng Yu
@ 2020-09-14 14:50                                                       ` Dave Hansen
  2020-09-14 18:31                                                         ` Andy Lutomirski
       [not found]                                                         ` <bf2ab309-f8c4-83da-1c0a-5684e5bc5c82@intel.com>
  0 siblings, 2 replies; 89+ messages in thread
From: Dave Hansen @ 2020-09-14 14:50 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
...
> Here are the changes if we take the mprotect(PROT_SHSTK) approach.
> Any comments/suggestions?

I still don't like it. :)

I'll also be much happier when there's a proper changelog to accompany
this which also spells out the alternatives any why they suck so much.

> diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
> index d4a8d0424bfb..024f006fcfe8 100644
> --- a/arch/x86/include/uapi/asm/mman.h
> +++ b/arch/x86/include/uapi/asm/mman.h
> @@ -4,6 +4,8 @@
>  
>  #define MAP_32BIT	0x40		/* only give out 32bit addresses */
>  
> +#define PROT_SHSTK	0x10		/* shadow stack pages */
> +
>  #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  /*
>   * Take the 4 protection key bits out of the vma->vm_flags
> @@ -19,13 +21,35 @@
>  		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
>  		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
>  
> -#define arch_calc_vm_prot_bits(prot, key) (		\
> +#define pkey_vm_prot_bits(prot, key) (			\
>  		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
>  		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
>  		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
>  		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
> +#else
> +#define pkey_vm_prot_bits(prot, key)
>  #endif

My inner compiler doesn't think this will compile:

	( | shstk_vm_prot_bits(prot))


> +#define shstk_vm_prot_bits(prot) ( \
> +		(static_cpu_has(X86_FEATURE_SHSTK) && (prot & PROT_SHSTK)) ? \
> +		VM_SHSTK : 0)

Why do you need to filter PROT_SHSTK twice.  Won't the prot passed in
here be filtered by arch_validate_prot()?

> +#define arch_calc_vm_prot_bits(prot, key) \
> +		(pkey_vm_prot_bits(prot, key) | shstk_vm_prot_bits(prot))
> +

IMNHO, this is eminently more readable if you do:

#define arch_calc_vm_prot_bits(prot, key)	\
		(shstk_vm_prot_bits(prot))	\
		  pkey_vm_prot_bits(prot, key))

BTW, can these be static inlines?  I forget if I had a good reason for
making them #defines.

> +static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
> +{
> +	unsigned long supported = PROT_READ | PROT_EXEC | PROT_SEM;
> +
> +	if (static_cpu_has(X86_FEATURE_SHSTK) && (prot & PROT_SHSTK))
> +		supported |= PROT_SHSTK;
> +	else
> +		supported |= PROT_WRITE;

I generally like to make the common case dirt simple to understand.
That would probably be:

	unsigned long supported = PROT_READ | PROT_WRITE |
				  PROT_EXEC | PROT_SEM;

	if (static_cpu_has(X86_FEATURE_SHSTK) && (prot & PROT_SHSTK)) {
		supported |= PROT_SHSTK;
		// Comment about why SHSTK and WRITE
		// are mutually exclusive.

		supported &= ~PROT_WRITE;
	}

>  #endif /* _ASM_X86_MMAN_H */
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index a8edbcb3af99..520bd8caa005 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -571,6 +571,17 @@ static int do_mprotect_pkey(unsigned long start, size_t
> len,
>  				goto out;
>  		}
>  	}
> +
> +	/*
> +	 * Only anonymous mapping is suitable for shadow stack.
> +	 */

Why?

> +	if (prot & PROT_SHSTK) {
> +		if (vma->vm_file) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +	}

You can also save a couple of lines there.  The two conditions are
pretty small.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-14 14:50                                                       ` [NEEDS-REVIEW] " Dave Hansen
@ 2020-09-14 18:31                                                         ` Andy Lutomirski
  2020-09-14 20:44                                                           ` Yu, Yu-cheng
                                                                             ` (2 more replies)
       [not found]                                                         ` <bf2ab309-f8c4-83da-1c0a-5684e5bc5c82@intel.com>
  1 sibling, 3 replies; 89+ messages in thread
From: Andy Lutomirski @ 2020-09-14 18:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Andy Lutomirski, Dave Martin, H.J. Lu,
	Florian Weimer, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

> On Sep 14, 2020, at 7:50 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
> ...
>> Here are the changes if we take the mprotect(PROT_SHSTK) approach.
>> Any comments/suggestions?
>
> I still don't like it. :)
>
> I'll also be much happier when there's a proper changelog to accompany
> this which also spells out the alternatives any why they suck so much.
>

Let’s take a step back here. Ignoring the precise API, what exactly is
a shadow stack from the perspective of a Linux user program?

The simplest answer is that it’s just memory that happens to have
certain protections.  This enables all kinds of shenanigans.  A
program could map a memfd twice, once as shadow stack and once as
non-shadow-stack, and change its control flow.  Similarly, a program
could mprotect its shadow stack, modify it, and mprotect it back.  In
some threat models, though could be seen as a WRSS bypass.  (Although
if an attacker can coerce a process to call mprotect(), the game is
likely mostly over anyway.)

But we could be more restrictive, or perhaps we could allow user code
to opt into more restrictions.  For example, we could have shadow
stacks be special memory that cannot be written from usermode by any
means other than ptrace() and friends, WRSS, and actual shadow stack
usage.

What is the goal?

No matter what we do, the effects of calling vfork() are going to be a
bit odd with SHSTK enabled.  I suppose we could disallow this, but
that seems likely to cause its own issues.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-14 18:31                                                         ` Andy Lutomirski
@ 2020-09-14 20:44                                                           ` Yu, Yu-cheng
  2020-09-14 21:14                                                           ` Dave Hansen
  2021-09-14  1:33                                                           ` [NEEDS-REVIEW] " Edgecombe, Rick P
  2 siblings, 0 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-14 20:44 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/14/2020 11:31 AM, Andy Lutomirski wrote:
>> On Sep 14, 2020, at 7:50 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
>> ...
>>> Here are the changes if we take the mprotect(PROT_SHSTK) approach.
>>> Any comments/suggestions?
>>
>> I still don't like it. :)
>>
>> I'll also be much happier when there's a proper changelog to accompany
>> this which also spells out the alternatives any why they suck so much.
>>
> 
> Let’s take a step back here. Ignoring the precise API, what exactly is
> a shadow stack from the perspective of a Linux user program?
> 
> The simplest answer is that it’s just memory that happens to have
> certain protections.  This enables all kinds of shenanigans.  A
> program could map a memfd twice, once as shadow stack and once as
> non-shadow-stack, and change its control flow.  Similarly, a program
> could mprotect its shadow stack, modify it, and mprotect it back.  In

What if we do the following:

- If the mapping has VM_SHARED, it cannot be turned to shadow stack. 
Shadow stack cannot be shared anyway.

- Only allow an anonymous mapping to be converted to shadow stack, but 
not the other way.

> some threat models, though could be seen as a WRSS bypass.  (Although
> if an attacker can coerce a process to call mprotect(), the game is
> likely mostly over anyway.)
> 
> But we could be more restrictive, or perhaps we could allow user code
> to opt into more restrictions.  For example, we could have shadow
> stacks be special memory that cannot be written from usermode by any
> means other than ptrace() and friends, WRSS, and actual shadow stack
> usage.
> 
> What is the goal?

There primary goal is to allocate/mmap a shadow stack from user space.

> 
> No matter what we do, the effects of calling vfork() are going to be a
> bit odd with SHSTK enabled.  I suppose we could disallow this, but
> that seems likely to cause its own issues.
> 

Do you mean vfork() has issues with call/return?  That is taken care of 
in GLIBC.  Or do you mean it has issues with mprotect(PROT_SHSTK)?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-14 18:31                                                         ` Andy Lutomirski
  2020-09-14 20:44                                                           ` Yu, Yu-cheng
@ 2020-09-14 21:14                                                           ` Dave Hansen
  2020-09-16 13:52                                                             ` Andy Lutomirski
  2021-09-14  1:33                                                           ` [NEEDS-REVIEW] " Edgecombe, Rick P
  2 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-09-14 21:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu-cheng Yu, Dave Martin, H.J. Lu, Florian Weimer, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/14/20 11:31 AM, Andy Lutomirski wrote:
> No matter what we do, the effects of calling vfork() are going to be a
> bit odd with SHSTK enabled.  I suppose we could disallow this, but
> that seems likely to cause its own issues.

What's odd about it?  If you're a vfork()'d child, you can't touch the
stack at all, right?  If you do, you or your parent will probably die a
horrible death.

The extra shadow stacks sanity checks means we'll probably see shadow
stack exceptions instead of the slightly more chaotic death without them.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
       [not found]                                                         ` <bf2ab309-f8c4-83da-1c0a-5684e5bc5c82@intel.com>
@ 2020-09-15 19:08                                                           ` Yu-cheng Yu
  2020-09-15 19:24                                                             ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Yu-cheng Yu @ 2020-09-15 19:08 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On Mon, 2020-09-14 at 17:12 -0700, Yu, Yu-cheng wrote:
> On 9/14/2020 7:50 AM, Dave Hansen wrote:
> > On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
> > ...
> > > Here are the changes if we take the mprotect(PROT_SHSTK) approach.
> > > Any comments/suggestions?
> > 
> > I still don't like it. :)
> > 
> > I'll also be much happier when there's a proper changelog to accompany
> > this which also spells out the alternatives any why they suck so much.

[...]

I revised it.  If this turns out needing more work/discussion, we can split it
out from the shadow stack series.

Thanks,
Yu-cheng

------

From 114f3108207e3eb15fe13b3ff43d1ac273b5072a Mon Sep 17 00:00:00 2001
From: Yu-cheng Yu <yu-cheng.yu@intel.com>
Date: Thu, 10 Sep 2020 16:41:30 -0700
Subject: [PATCH] mm: Introduce PROT_SHSTK for shadow stack

There are three possible options to create a shadow stack allocation API:
an arch_prctl, a new syscall, or adding PROT_SHSTK to mmap()/mprotect().
Each has its advantages and compromises.

An arch_prctl() is the least intrusive.  However, the existing x86
arch_prctl() takes only two parameters.  Multiple parameters must be
passed in a memory buffer.  There is a proposal to pass more parameters in
registers [1], but no active discussion on that.

A new syscall minimizes compatibility issues and offers an extensible frame
work to other architectures, but this will likely result in some overlap of
mmap()/mprotect().

The introduction of PROT_SHSTK to mmap()/mprotect() takes advantage of
existing APIs.  The x86-specific PROT_SHSTK is translated to VM_SHSTK, and
a shadow stack mapping is created without reinventing the wheel.  There are
potential pitfalls though.  The most obvious one would be using this as a
bypass to shadow stack protection.  However, the attacker would have to get
to the syscall first.

Since arch_calc_vm_prot_bits() is modified, I have moved arch_vm_get_page
_prot() and arch_calc_vm_prot_bits() to x86/include/asm/mman.h.
This will be more consistent with other architectures.

[1] https://lore.kernel.org/lkml/20200828121624.108243-1-hjl.tools@gmail.com/

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/mman.h      | 63 ++++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/mman.h | 28 ++------------
 mm/mprotect.c                    | 10 +++++
 3 files changed, 77 insertions(+), 24 deletions(-)
 create mode 100644 arch/x86/include/asm/mman.h

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
new file mode 100644
index 000000000000..2f19e429c9e2
--- /dev/null
+++ b/arch/x86/include/asm/mman.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_MMAN_H
+#define _ASM_X86_MMAN_H
+
+#include <asm/cpufeature.h>
+#include <uapi/asm/mman.h>
+
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define pkey_vm_prot_bits(prot, key) (			\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+#else
+#define pkey_vm_prot_bits(prot, key) (0)
+#endif
+
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+	unsigned long pkey)
+{
+	unsigned long vm_prot_bits = pkey_vm_prot_bits(prot, pkey);
+
+	vm_prot_bits |= ((prot & PROT_SHSTK) ? VM_SHSTK : 0);
+	return vm_prot_bits;
+}
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+{
+	unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
+
+	if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER) &&
+	    static_cpu_has(X86_FEATURE_SHSTK) && (prot & PROT_SHSTK)) {
+		supported |= PROT_SHSTK;
+
+		/*
+		 * A shadow stack mapping is indirectly writable by only
+		 * the CALL and WRUSS instructions, but not other write
+		 * instructions).  PROT_SHSTK and PROT_WRITE are mutually
+		 * exclusive.
+		 */
+		supported &= ~PROT_WRITE;
+	}
+
+	return (prot & ~supported) == 0;
+}
+#define arch_validate_prot arch_validate_prot
+
+#endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index d4a8d0424bfb..39bb7db344a6 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -1,31 +1,11 @@
 /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _ASM_X86_MMAN_H
-#define _ASM_X86_MMAN_H
+#ifndef _UAPI_ASM_X86_MMAN_H
+#define _UAPI_ASM_X86_MMAN_H
 
 #define MAP_32BIT	0x40		/* only give out 32bit addresses */
 
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
-/*
- * Take the 4 protection key bits out of the vma->vm_flags
- * value and turn them in to the bits that we can put in
- * to a pte.
- *
- * Only override these if Protection Keys are available
- * (which is only on 64-bit).
- */
-#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
-		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
-		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
-		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
-		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
-
-#define arch_calc_vm_prot_bits(prot, key) (		\
-		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
-		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
-		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
-		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
-#endif
+#define PROT_SHSTK	0x10		/* shadow stack pages */
 
 #include <asm-generic/mman.h>
 
-#endif /* _ASM_X86_MMAN_H */
+#endif /* _UAPI_ASM_X86_MMAN_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a8edbcb3af99..27a1b7a4e89c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -571,6 +571,16 @@ static int do_mprotect_pkey(unsigned long start, size_t
len,
 				goto out;
 		}
 	}
+
+	/*
+	 * Only anonymous mapping is suitable for shadow stack.  A function
+	 * call stack should not be backed by a file.
+	 */
+	if ((prot & PROT_SHSTK) && (vma->vm_file)) {
+		error = -EINVAL;
+		goto out;
+	}
+
 	if (start > vma->vm_start)
 		prev = vma;
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-15 19:08                                                           ` Yu-cheng Yu
@ 2020-09-15 19:24                                                             ` Dave Hansen
  2020-09-15 20:16                                                               ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2020-09-15 19:24 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/15/20 12:08 PM, Yu-cheng Yu wrote:
> On Mon, 2020-09-14 at 17:12 -0700, Yu, Yu-cheng wrote:
>> On 9/14/2020 7:50 AM, Dave Hansen wrote:
>>> On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
>>> ...
>>>> Here are the changes if we take the mprotect(PROT_SHSTK) approach.
>>>> Any comments/suggestions?
>>> I still don't like it. :)
>>>
>>> I'll also be much happier when there's a proper changelog to accompany
>>> this which also spells out the alternatives any why they suck so much.
> [...]
> 
> I revised it.  If this turns out needing more work/discussion, we can split it
> out from the shadow stack series.

Where does that leave things?  You only get shadow stacks for
single-threaded apps which have the ELF bits set?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-15 19:24                                                             ` Dave Hansen
@ 2020-09-15 20:16                                                               ` Yu, Yu-cheng
  0 siblings, 0 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-15 20:16 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/15/2020 12:24 PM, Dave Hansen wrote:
> On 9/15/20 12:08 PM, Yu-cheng Yu wrote:
>> On Mon, 2020-09-14 at 17:12 -0700, Yu, Yu-cheng wrote:
>>> On 9/14/2020 7:50 AM, Dave Hansen wrote:
>>>> On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
>>>> ...
>>>>> Here are the changes if we take the mprotect(PROT_SHSTK) approach.
>>>>> Any comments/suggestions?
>>>> I still don't like it. :)
>>>>
>>>> I'll also be much happier when there's a proper changelog to accompany
>>>> this which also spells out the alternatives any why they suck so much.
>> [...]
>>
>> I revised it.  If this turns out needing more work/discussion, we can split it
>> out from the shadow stack series.
> 
> Where does that leave things?  You only get shadow stacks for
> single-threaded apps which have the ELF bits set?
> 

As long as the system supports shadow stack, any application can 
mmap()/mprotect() a shadow stack.  A pthread can allocate a shadow stack 
too.  However, only shadow stack-enabled programs can activate/use the 
shadow stack.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-14 21:14                                                           ` Dave Hansen
@ 2020-09-16 13:52                                                             ` Andy Lutomirski
  2020-09-16 19:25                                                               ` Yu, Yu-cheng
  0 siblings, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2020-09-16 13:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Yu-cheng Yu, Dave Martin, H.J. Lu,
	Florian Weimer, X86 ML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, open list:DOCUMENTATION, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Weijiang Yang

On Mon, Sep 14, 2020 at 2:14 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/14/20 11:31 AM, Andy Lutomirski wrote:
> > No matter what we do, the effects of calling vfork() are going to be a
> > bit odd with SHSTK enabled.  I suppose we could disallow this, but
> > that seems likely to cause its own issues.
>
> What's odd about it?  If you're a vfork()'d child, you can't touch the
> stack at all, right?  If you do, you or your parent will probably die a
> horrible death.
>

An evil program could vfork(), have the child do a bunch of returns
and a bunch of calls, and exit.  The net effect would be to change the
parent's shadow stack contents.  In a sufficiently strict model, this
is potentially problematic.

The question is: how much do we want to protect userspace from itself?

--Andy

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-16 13:52                                                             ` Andy Lutomirski
@ 2020-09-16 19:25                                                               ` Yu, Yu-cheng
  0 siblings, 0 replies; 89+ messages in thread
From: Yu, Yu-cheng @ 2020-09-16 19:25 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Dave Martin, H.J. Lu, Florian Weimer, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar,
	Vedvyas Shanbhogue, Weijiang Yang

On 9/16/2020 6:52 AM, Andy Lutomirski wrote:
> On Mon, Sep 14, 2020 at 2:14 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 9/14/20 11:31 AM, Andy Lutomirski wrote:
>>> No matter what we do, the effects of calling vfork() are going to be a
>>> bit odd with SHSTK enabled.  I suppose we could disallow this, but
>>> that seems likely to cause its own issues.
>>
>> What's odd about it?  If you're a vfork()'d child, you can't touch the
>> stack at all, right?  If you do, you or your parent will probably die a
>> horrible death.
>>
> 
> An evil program could vfork(), have the child do a bunch of returns
> and a bunch of calls, and exit.  The net effect would be to change the
> parent's shadow stack contents.  In a sufficiently strict model, this
> is potentially problematic.

When a vfork child returns, its parent's shadow stack pointer is where 
it was before the child starts.  To move the shadow stack pointer and 
re-use the content left by the child, the parent needs to use CALL, RET, 
INCSSP, or RSTORSSP.  This seems to be difficult.

> 
> The question is: how much do we want to protect userspace from itself?
 >

If any issue comes up, people can always find ways to counter it.

> --Andy
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2020-09-14 18:31                                                         ` Andy Lutomirski
  2020-09-14 20:44                                                           ` Yu, Yu-cheng
  2020-09-14 21:14                                                           ` Dave Hansen
@ 2021-09-14  1:33                                                           ` Edgecombe, Rick P
  2021-09-14  9:53                                                             ` Borislav Petkov
  2021-09-20 16:48                                                             ` Andy Lutomirski
  2 siblings, 2 replies; 89+ messages in thread
From: Edgecombe, Rick P @ 2021-09-14  1:33 UTC (permalink / raw)
  To: Lutomirski, Andy, Hansen, Dave
  Cc: bsingharora, hpa, esyr, peterz, rdunlap, keescook, Yu, Yu-cheng,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, linux-arch,
	kcc, bp, oleg, hjl.tools, pavel, linux-doc, Yang, Weijiang, arnd,
	Moreira, Joao, tglx, mike.kravetz, x86, tarasmadan, Dave.Martin,
	vedvyas.shanbhogue, mingo, Shankar, Ravi V, corbet, linux-kernel,
	linux-api, gorcunov

On Mon, 2020-09-14 at 11:31 -0700, Andy Lutomirski wrote:
> > On Sep 14, 2020, at 7:50 AM, Dave Hansen <dave.hansen@intel.com>
> > wrote:
> > 
> > On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
> > ...
> > > Here are the changes if we take the mprotect(PROT_SHSTK)
> > > approach.
> > > Any comments/suggestions?
> > 
> > I still don't like it. :)
> > 
> > I'll also be much happier when there's a proper changelog to
> > accompany
> > this which also spells out the alternatives any why they suck so
> > much.
> > 
> 
> Let’s take a step back here. Ignoring the precise API, what exactly
> is
> a shadow stack from the perspective of a Linux user program?
> 
> The simplest answer is that it’s just memory that happens to have
> certain protections.  This enables all kinds of shenanigans.  A
> program could map a memfd twice, once as shadow stack and once as
> non-shadow-stack, and change its control flow.  Similarly, a program
> could mprotect its shadow stack, modify it, and mprotect it back.  In
> some threat models, though could be seen as a WRSS bypass.  (Although
> if an attacker can coerce a process to call mprotect(), the game is
> likely mostly over anyway.)
> 
> But we could be more restrictive, or perhaps we could allow user code
> to opt into more restrictions.  For example, we could have shadow
> stacks be special memory that cannot be written from usermode by any
> means other than ptrace() and friends, WRSS, and actual shadow stack
> usage.
> 
> What is the goal?
> 
> No matter what we do, the effects of calling vfork() are going to be
> a
> bit odd with SHSTK enabled.  I suppose we could disallow this, but
> that seems likely to cause its own issues.

Hi,

Resurrecting this old thread to highlight a consequence of the design
change that came out of it. I am going to be taking over this series
from Yu-cheng, and wanted to check if people would be interested in re-
visiting this interface.

The consequence I wanted to highlight, is that making userspace be
responsible for mapping memory as shadow stack, also requires moving
the writing of the restore token to userspace for glibc ucontext
operations. Since these operations involve creating/pivoting to new
stacks in userspace, ucontext cet support involves also creating a new
shadow stack. For normal thread stacks, the kernel has always done the
shadow stack allocation and so it is never writable (in the normal
sense) from userspace. But after this change makecontext() now first
has to mmap() writable memory, then write the restore token, then
mprotect() it as shadow stack. See the glibc changes to support
PROT_SHADOW_STACK here[0].

The writable window leaves an opening for an attacker to create an
arbitrary shadow stack that could be pivoted to later by tweaking the
ucontext_t structure. To try to see how much this matters, we have done
a small test that uses this window to ROP from writes in another
thread during the makecontext()/setcontext() window. (offensive work
credit to Joao on CC). This would require a real app to already to be
using ucontext in the course of normal runtime.

The original prctl solution prevents this case since the kernel did the
allocation and restore token setup, but of course it had other issues.
The other ideas discussed previously were a new syscall, or some sort
of new madvise() operation that could be involved in setting up shadow
stack, such that it is never writable in userspace. Or, simpler but
uglier, tweaking the existing PROT_SHADOW_STACK based solution by
making core mmap code write a restore token.

Those are still probably not enough to stop all ROP given extreme
threat models, as Andy alluded. But to me, this one seemed maybe worth
trying to improve. Especially thinking to avoid future ABI changes if
it becomes a problem. Would anyone like to see attempts to tighten this
up by revisiting this shadow stack allocation interface?

Thanks,

Rick

[0] 
https://gitlab.com/x86-glibc/glibc/-/commit/98eba0b829a1884c7d7bf76b0b39725919a9b6af#92c86763df6845c5b29664ac2ff40277678fd2c5_0_1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2021-09-14  1:33                                                           ` [NEEDS-REVIEW] " Edgecombe, Rick P
@ 2021-09-14  9:53                                                             ` Borislav Petkov
  2021-09-20 16:48                                                             ` Andy Lutomirski
  1 sibling, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2021-09-14  9:53 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Lutomirski, Andy, Hansen, Dave, bsingharora, hpa, esyr, peterz,
	rdunlap, keescook, Yu, Yu-cheng, dave.hansen, linux-mm, fweimer,
	nadav.amit, jannh, linux-arch, kcc, oleg, hjl.tools, pavel,
	linux-doc, Yang, Weijiang, arnd, Moreira, Joao, tglx,
	mike.kravetz, x86, tarasmadan, Dave.Martin, vedvyas.shanbhogue,
	mingo, Shankar, Ravi V, corbet, linux-kernel, linux-api,
	gorcunov

On Tue, Sep 14, 2021 at 01:33:02AM +0000, Edgecombe, Rick P wrote:
> The original prctl solution prevents this case since the kernel did the
> allocation and restore token setup, but of course it had other issues.
> The other ideas discussed previously were a new syscall, or some sort
> of new madvise() operation that could be involved in setting up shadow
> stack, such that it is never writable in userspace.

If I had to choose - and this is only my 2¢ anyway - I'd opt for this
until there's a really good reason for allowing shstk programs to fiddle
with their own shstk. Maybe there is but allowing them to do that sounds
to me like: "ew, why do we go to all this trouble to have shadow stacks
if programs would be allowed to fumble with it themselves? Might as well
not do shadow stacks at all."

And if/when there is a good reason, the API should be defined and
discussed properly at first, before we expose it to luserspace, ofc.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2021-09-14  1:33                                                           ` [NEEDS-REVIEW] " Edgecombe, Rick P
  2021-09-14  9:53                                                             ` Borislav Petkov
@ 2021-09-20 16:48                                                             ` Andy Lutomirski
  2021-09-23 23:32                                                               ` Edgecombe, Rick P
  1 sibling, 1 reply; 89+ messages in thread
From: Andy Lutomirski @ 2021-09-20 16:48 UTC (permalink / raw)
  To: Rick P Edgecombe, Dave Hansen
  Cc: Balbir Singh, H. Peter Anvin, Eugene Syromiatnikov,
	Peter Zijlstra (Intel),
	Randy Dunlap, Kees Cook, Yu-cheng Yu, Dave Hansen, linux-mm,
	Florian Weimer, Nadav Amit, Jann Horn, linux-arch, kcc,
	Borislav Petkov, Oleg Nesterov, H.J. Lu, Pavel Machek, linux-doc,
	Weijiang Yang, Arnd Bergmann, Moreira, Joao, Thomas Gleixner,
	Mike Kravetz, the arch/x86 maintainers, tarasmadan, Dave Martin,
	vedvyas.shanbhogue, Ingo Molnar, Shankar, Ravi V,
	Jonathan Corbet, Linux Kernel Mailing List, Linux API,
	Cyrill Gorcunov



On Mon, Sep 13, 2021, at 6:33 PM, Edgecombe, Rick P wrote:
> On Mon, 2020-09-14 at 11:31 -0700, Andy Lutomirski wrote:
> > > On Sep 14, 2020, at 7:50 AM, Dave Hansen <dave.hansen@intel.com>
> > > wrote:
> > > 
> > > On 9/11/20 3:59 PM, Yu-cheng Yu wrote:
> > > ...
> > > > Here are the changes if we take the mprotect(PROT_SHSTK)
> > > > approach.
> > > > Any comments/suggestions?
> > > 
> > > I still don't like it. :)
> > > 
> > > I'll also be much happier when there's a proper changelog to
> > > accompany
> > > this which also spells out the alternatives any why they suck so
> > > much.
> > > 
> > 
> > Let’s take a step back here. Ignoring the precise API, what exactly
> > is
> > a shadow stack from the perspective of a Linux user program?
> > 
> > The simplest answer is that it’s just memory that happens to have
> > certain protections.  This enables all kinds of shenanigans.  A
> > program could map a memfd twice, once as shadow stack and once as
> > non-shadow-stack, and change its control flow.  Similarly, a program
> > could mprotect its shadow stack, modify it, and mprotect it back.  In
> > some threat models, though could be seen as a WRSS bypass.  (Although
> > if an attacker can coerce a process to call mprotect(), the game is
> > likely mostly over anyway.)
> > 
> > But we could be more restrictive, or perhaps we could allow user code
> > to opt into more restrictions.  For example, we could have shadow
> > stacks be special memory that cannot be written from usermode by any
> > means other than ptrace() and friends, WRSS, and actual shadow stack
> > usage.
> > 
> > What is the goal?
> > 
> > No matter what we do, the effects of calling vfork() are going to be
> > a
> > bit odd with SHSTK enabled.  I suppose we could disallow this, but
> > that seems likely to cause its own issues.
> 
> Hi,
> 
> Resurrecting this old thread to highlight a consequence of the design
> change that came out of it. I am going to be taking over this series
> from Yu-cheng, and wanted to check if people would be interested in re-
> visiting this interface.
> 
> The consequence I wanted to highlight, is that making userspace be
> responsible for mapping memory as shadow stack, also requires moving
> the writing of the restore token to userspace for glibc ucontext
> operations. Since these operations involve creating/pivoting to new
> stacks in userspace, ucontext cet support involves also creating a new
> shadow stack. For normal thread stacks, the kernel has always done the
> shadow stack allocation and so it is never writable (in the normal
> sense) from userspace. But after this change makecontext() now first
> has to mmap() writable memory, then write the restore token, then
> mprotect() it as shadow stack. See the glibc changes to support
> PROT_SHADOW_STACK here[0].
> 
> The writable window leaves an opening for an attacker to create an
> arbitrary shadow stack that could be pivoted to later by tweaking the
> ucontext_t structure. To try to see how much this matters, we have done
> a small test that uses this window to ROP from writes in another
> thread during the makecontext()/setcontext() window. (offensive work
> credit to Joao on CC). This would require a real app to already to be
> using ucontext in the course of normal runtime.

My general opinion here (take this with a grain of salt -- I haven't paged back in every single detail) is that the kernel should make it straightforward for a libc to do the right thing without nasty races, cross-thread coordination, or unnecessary permission to write to the stack.  I *also* think that it should be possible for userspace to manage its own shadow stack allocation if it wants to, since I'm sure there will be JIT or green thread or other use cases that want to do crazy things that we fail to anticipate with in-kernel magic.

So perhaps we should keep the explicit allocation and free operations, have a way to opt-in to WRSS being flipped on, but also do our best to have API that handle the known cases well.

Does that make sense?  Can we have both approaches work in the same kernel?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack
  2021-09-20 16:48                                                             ` Andy Lutomirski
@ 2021-09-23 23:32                                                               ` Edgecombe, Rick P
  0 siblings, 0 replies; 89+ messages in thread
From: Edgecombe, Rick P @ 2021-09-23 23:32 UTC (permalink / raw)
  To: Lutomirski, Andy, Hansen, Dave
  Cc: bsingharora, hpa, esyr, peterz, rdunlap, keescook, Yu, Yu-cheng,
	dave.hansen, linux-mm, fweimer, nadav.amit, jannh, kcc,
	linux-arch, bp, oleg, hjl.tools, Yang, Weijiang, linux-doc,
	pavel, arnd, Moreira, Joao, tglx, mike.kravetz, x86, tarasmadan,
	Dave.Martin, vedvyas.shanbhogue, Shankar, Ravi V, mingo, corbet,
	linux-kernel, linux-api, gorcunov

On Mon, 2021-09-20 at 09:48 -0700, Andy Lutomirski wrote:
> My general opinion here (take this with a grain of salt -- I haven't
> paged back in every single detail) is that the kernel should make it
> straightforward for a libc to do the right thing without nasty races,
> cross-thread coordination, or unnecessary permission to write to the
> stack.  I *also* think that it should be possible for userspace to
> manage its own shadow stack allocation if it wants to, since I'm sure
> there will be JIT or green thread or other use cases that want to do
> crazy things that we fail to anticipate with in-kernel magic.
> 
> So perhaps we should keep the explicit allocation and free
> operations, have a way to opt-in to WRSS being flipped on, but also
> do our best to have API that handle the known cases well.
> 
> Does that make sense?  Can we have both approaches work in the same
> kernel?

I think so. I'll take a look at adding a prctl to enable WRSS. Since
there already is ARCH_X86_CET_DISABLE to disable CET, it doesn't seem
like it should escalate anything. And ARCH_X86_CET_LOCK can prevent
turning it on if desired.

Thanks,

Rick

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2021-09-23 23:32 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-25  0:25 [PATCH v11 00/25] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 01/25] Documentation/x86: Add CET description Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 02/25] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 03/25] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 04/25] x86/cet: Add control-protection fault handler Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 05/25] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 06/25] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 07/25] x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 08/25] x86/mm: Introduce _PAGE_COW Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 09/25] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 10/25] x86/mm: Update pte_modify for _PAGE_COW Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 11/25] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY_HW to _PAGE_COW Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 12/25] mm: Introduce VM_SHSTK for shadow stack memory Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 13/25] x86/mm: Shadow Stack page fault error checking Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 14/25] x86/mm: Update maybe_mkwrite() for shadow stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 15/25] mm: Fixup places that call pte_mkwrite() directly Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 16/25] mm: Add guard pages around a shadow stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 17/25] mm/mmap: Add shadow stack pages to memory accounting Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 18/25] mm: Update can_follow_write_pte() for shadow stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 19/25] mm: Re-introduce do_mmap_pgoff() Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 20/25] x86/cet/shstk: User-mode shadow stack support Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 21/25] x86/cet/shstk: Handle signals for shadow stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 22/25] binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND properties Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 23/25] ELF: Introduce arch_setup_elf_property() Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 24/25] x86/cet/shstk: Handle thread shadow stack Yu-cheng Yu
2020-08-25  0:25 ` [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for " Yu-cheng Yu
2020-08-25  0:36   ` Andy Lutomirski
2020-08-25 18:43     ` Yu, Yu-cheng
2020-08-25 19:19       ` Dave Hansen
2020-08-25 21:04         ` Yu, Yu-cheng
2020-08-25 23:20           ` Dave Hansen
2020-08-25 23:34             ` Yu, Yu-cheng
2020-08-26 16:46               ` Dave Martin
2020-08-26 16:51                 ` Florian Weimer
2020-08-26 17:04                   ` Andy Lutomirski
2020-08-26 18:49                     ` Yu, Yu-cheng
2020-08-26 19:43                       ` H.J. Lu
2020-08-26 19:57                       ` Dave Hansen
2020-08-27 13:26                         ` H.J. Lu
2020-09-01 10:28                           ` Dave Martin
2020-09-01 17:23                             ` Yu, Yu-cheng
2020-09-01 17:45                               ` Andy Lutomirski
2020-09-01 18:11                                 ` Dave Hansen
2020-09-02 13:58                                   ` Dave Martin
     [not found]                                   ` <46dffdfd-92f8-0f05-6164-945f217b0958@intel.com>
2020-09-08 17:57                                     ` Dave Hansen
2020-09-08 18:25                                       ` Yu, Yu-cheng
2020-09-09 22:08                                         ` Yu, Yu-cheng
2020-09-09 22:59                                           ` Dave Hansen
2020-09-09 23:07                                             ` Yu, Yu-cheng
2020-09-09 23:11                                               ` Dave Hansen
2020-09-09 23:25                                                 ` Yu, Yu-cheng
2020-09-09 23:29                                                   ` Dave Hansen
2020-09-09 23:45                                                     ` Yu, Yu-cheng
2020-09-11 22:59                                                     ` Yu-cheng Yu
2020-09-14 14:50                                                       ` [NEEDS-REVIEW] " Dave Hansen
2020-09-14 18:31                                                         ` Andy Lutomirski
2020-09-14 20:44                                                           ` Yu, Yu-cheng
2020-09-14 21:14                                                           ` Dave Hansen
2020-09-16 13:52                                                             ` Andy Lutomirski
2020-09-16 19:25                                                               ` Yu, Yu-cheng
2021-09-14  1:33                                                           ` [NEEDS-REVIEW] " Edgecombe, Rick P
2021-09-14  9:53                                                             ` Borislav Petkov
2021-09-20 16:48                                                             ` Andy Lutomirski
2021-09-23 23:32                                                               ` Edgecombe, Rick P
     [not found]                                                         ` <bf2ab309-f8c4-83da-1c0a-5684e5bc5c82@intel.com>
2020-09-15 19:08                                                           ` Yu-cheng Yu
2020-09-15 19:24                                                             ` Dave Hansen
2020-09-15 20:16                                                               ` Yu, Yu-cheng
2020-08-26 17:08                   ` Dave Martin
2020-08-27 13:18                     ` Florian Weimer
2020-08-27 13:28                       ` H.J. Lu
2020-08-27 13:36                         ` Florian Weimer
2020-08-27 14:07                           ` H.J. Lu
2020-08-27 14:08                             ` H.J. Lu
2020-09-01 17:49                               ` Yu, Yu-cheng
2020-09-01 17:50                                 ` Florian Weimer
2020-09-01 17:58                                   ` Yu, Yu-cheng
2020-09-01 18:17                                     ` Florian Weimer
2020-09-01 18:19                                       ` H.J. Lu
2020-09-01 18:24                                       ` Yu, Yu-cheng
2020-08-27 18:13                           ` Yu, Yu-cheng
2020-08-27 18:56                             ` Andy Lutomirski
2020-08-27 19:33                               ` Yu, Yu-cheng
2020-08-27 19:37                               ` H.J. Lu
2020-08-28  1:35                                 ` Andy Lutomirski
2020-08-28  1:44                                   ` H.J. Lu
2020-08-28  6:23                                     ` Florian Weimer
2020-08-28 11:37                                       ` H.J. Lu
2020-08-28 17:39                                         ` Andy Lutomirski
2020-08-28 17:45                                           ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).