All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection
@ 2021-08-04  4:32 ira.weiny
  2021-08-04  4:32 ` [PATCH V7 01/18] x86/pkeys: Create pkeys_common.h ira.weiny
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

NOTE: x86 maintainers, I'm submitting this for ack/review by Dave Hansen and
Dan Williams.  Feel free to ignore it but we have had a lot of internal debate
on a number of design decisions so we would like to have the remaining reviews
public such that everyone can see the remaining debate/decisions.

Furthermore, this gives a public reference for Rick to build other PKS use
cases on.


PKS/PMEM Stray write protection
===============================

This series is broken into 2 parts.

	1) Introduce Protection Key Supervisor (PKS)
	2) Use PKS to protect PMEM from stray writes

Introduce Protection Key Supervisor (PKS)
-----------------------------------------

PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to pages beyond the normal paging protections.  PKS works in a
similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.

Also like PKU, a page mapping is assigned to a domain by setting pkey bits in
the page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections.

Other keys, (1-15) are statically allocated by kernel users adding an entry to
'enum pks_pkey_consumers' and adding a corresponding default value in
consumer_defaults in create_initial_pkrs_value().  This patch series allocates
a single key for use by persistent memory stray write protection.  When the
number of users grows larger the sharing of keys will need to be resolved
depending on the needs of the users at that time.

More usage details can be found in the documentation.

The following are key attributes of PKS.

	1) Fast switching of permissions
		1a) Prevents access without page table manipulations
		1b) No TLB flushes required
	2) Works on a per thread basis

PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


Use PKS to protect PMEM from stray writes
-----------------------------------------

DAX leverages the direct-map to enable 'struct page' services for PMEM.  Given
that PMEM capacity may be an order of magnitude higher capacity than System RAM
it presents a large vulnerability surface to stray writes.  Such a stray write
becomes a silent data corruption bug.

Given that PMEM access from the kernel is limited to a constrained set of
locations (PMEM driver, Filesystem-DAX, and direct-I/O), it is amenable to PKS
protection.  Set up an infrastructure for extra device access protection. Then
implement the protection using the new Protection Keys Supervisor (PKS) on
architectures which support it.

Because PMEM pages are all associated with a struct dev_pagemap the flag of
protecting memory can be stored there.  All PMEM is protected by the same pkey.
So a single flag is all that is needed to indicate protection.

General access in the kernel is supported by modifying the kmap infrastructure
which can detect if a page is PMEM and pks protected.  If so kmap_local_page()
and kmap_atomic() can enable access until their unmap's are called.

Because PKS is a thread local mechanism and because kmap was never really
intended to create a long term mapping,

This implementation avoids supporting the kmap()/kunmap() for a number of
reasons.  First, kmap was never really intended to create long term mappings.
Second, no known kernel users of pmem use kmap.  Third, PKS is a thread local
mechanism.

Originally this series modified many of the kmap call sites to indicate they
were thread local.[1]  And an attempt to support kmap()[2] was made.  But now
that kmap_local_page() has been developed[3] and in more wide spread use,
kmap() should be safe to leave unsupported and is considered an invalid access.

Handling invalid access to these pages is configurable via a new module
parameter memremap.pks_fault_mode.  2 modes are suported.

	'relaxed' (default) -- WARN_ONCE, disable the protection and allow
	                       access

	'strict' -- prevent any unguarded access to a protected dev_pagemap
		    range

The fault handler detects the PMEM fault and applies the above configuration to
the faulting thread.  The kmap call is a special case.  It is considered an
invalid access but uses the configuration early before any access such that the
kmap code path can be better evaluated and fixed.


[1] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/

[2] https://lore.kernel.org/lkml/87mtycqcjf.fsf@nanos.tec.linutronix.de/

[3] https://lore.kernel.org/lkml/20210128061503.1496847-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210210062221.3023586-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210205170030.856723-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210217024826.3466046-1-ira.weiny@intel.com/

[4] https://lore.kernel.org/lkml/20201106232908.364581-1-ira.weiny@intel.com/

[5] https://lore.kernel.org/lkml/20210322053020.2287058-1-ira.weiny@intel.com/

[6] https://lore.kernel.org/lkml/20210331191405.341999-1-ira.weiny@intel.com/


Fenghua Yu (1):
  x86/pks: Add PKS kernel API

Ira Weiny (16):
  x86/pkeys: Create pkeys_common.h
  x86/fpu: Refactor arch_set_user_pkey_access()
  x86/pks: Add additional PKEY helper macros
  x86/pks: Add PKS defines and Kconfig options
  x86/pks: Add PKS setup code
  x86/fault: Adjust WARN_ON for PKey fault
  x86/pks: Preserve the PKRS MSR on context switch
  x86/entry: Preserve PKRS MSR across exceptions
  x86/pks: Introduce pks_abandon_protections()
  x86/pks: Add PKS Test code
  memremap_pages: Add access protection via supervisor Protection Keys
    (PKS)
  memremap_pages: Add memremap.pks_fault_mode
  kmap: Add stray access protection for devmap pages
  dax: Stray access protection for dax_direct_access()
  nvdimm/pmem: Enable stray access protection
  devdax: Enable stray access protection

Rick Edgecombe (1):
  x86/pks: Add PKS fault callbacks

 .../admin-guide/kernel-parameters.txt         |  14 +
 Documentation/core-api/protection-keys.rst    | 153 +++-
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/calling.h                      |  26 +
 arch/x86/entry/common.c                       |  56 ++
 arch/x86/entry/entry_64.S                     |  22 +-
 arch/x86/entry/entry_64_compat.S              |   6 +-
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/msr-index.h              |   1 +
 arch/x86/include/asm/pgtable_types.h          |  12 +
 arch/x86/include/asm/pkeys.h                  |   2 +
 arch/x86/include/asm/pkeys_common.h           |  19 +
 arch/x86/include/asm/pkru.h                   |  16 +-
 arch/x86/include/asm/pks.h                    |  67 ++
 arch/x86/include/asm/processor-flags.h        |   2 +
 arch/x86/include/asm/processor.h              |  19 +-
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/kernel/cpu/common.c                  |   2 +
 arch/x86/kernel/fpu/xstate.c                  |  22 +-
 arch/x86/kernel/head_64.S                     |   7 +-
 arch/x86/kernel/process.c                     |   3 +
 arch/x86/kernel/process_64.c                  |   3 +
 arch/x86/mm/fault.c                           |  82 +-
 arch/x86/mm/pkeys.c                           | 277 +++++-
 drivers/dax/device.c                          |   2 +
 drivers/dax/super.c                           |  54 ++
 drivers/md/dm-writecache.c                    |   8 +-
 drivers/nvdimm/pmem.c                         |  55 +-
 fs/dax.c                                      |   8 +
 fs/fuse/virtio_fs.c                           |   2 +
 include/linux/dax.h                           |   8 +
 include/linux/highmem-internal.h              |   5 +
 include/linux/memremap.h                      |   1 +
 include/linux/mm.h                            |  88 ++
 include/linux/pgtable.h                       |   4 +
 include/linux/pkeys.h                         |  36 +
 include/linux/sched.h                         |   7 +
 init/init_task.c                              |   3 +
 kernel/entry/common.c                         |  14 +-
 kernel/fork.c                                 |   3 +
 lib/Kconfig.debug                             |  13 +
 lib/Makefile                                  |   3 +
 lib/pks/Makefile                              |   3 +
 lib/pks/pks_test.c                            | 864 ++++++++++++++++++
 mm/Kconfig                                    |  26 +
 mm/memremap.c                                 | 158 ++++
 tools/testing/selftests/x86/Makefile          |   2 +-
 tools/testing/selftests/x86/test_pks.c        | 157 ++++
 49 files changed, 2261 insertions(+), 86 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h
 create mode 100644 arch/x86/include/asm/pks.h
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V7 01/18] x86/pkeys: Create pkeys_common.h
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

	1. A single control register
	2. The same number of keys
	3. The same number of bits in the register per key
	4. Access and Write disable in the same bit locations

Given the above, share all the macros that synthesize and manipulate
register values between the two features.  Share these defines by moving
them into a new header, change their names to reflect the common use,
and include the header where needed.

Also while editing the code remove the use of 'we' from comments being
touched.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/pkeys_common.h | 11 +++++++++++
 arch/x86/include/asm/pkru.h         | 18 ++++++------------
 arch/x86/kernel/fpu/xstate.c        |  8 ++++----
 arch/x86/mm/pkeys.c                 | 14 ++++++--------
 4 files changed, 27 insertions(+), 24 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index 000000000000..f3277717faeb
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_COMMON_H
+#define _ASM_X86_PKEYS_COMMON_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h
index ccc539faa5bb..a74325b0d1df 100644
--- a/arch/x86/include/asm/pkru.h
+++ b/arch/x86/include/asm/pkru.h
@@ -3,10 +3,7 @@
 #define _ASM_X86_PKRU_H
 
 #include <asm/fpu/xstate.h>
-
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include <asm/pkeys_common.h>
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -18,18 +15,15 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	/*
-	 * Access-disable disables writes too so we need to check
-	 * both bits here.
-	 */
-	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+	/* Access-disable disables writes too so check both bits here. */
+	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u32 read_pkru(void)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8def1b7f8fb..6af0c80ad425 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -933,11 +933,11 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
 		return -EINVAL;
 
-	/* Set the bits we need in PKRU:  */
+	/* Set the bits needed in PKRU:  */
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKRU_AD_BIT;
+		new_pkru_bits |= PKR_AD_BIT;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKRU_WD_BIT;
+		new_pkru_bits |= PKR_WD_BIT;
 
 	/* Shift the bits in to the correct place in PKRU for pkey: */
 	pkey_shift = pkey * PKRU_BITS_PER_PKEY;
@@ -945,7 +945,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 
 	/* Get old PKRU and mask off any old bits in place: */
 	old_pkru = read_pkru();
-	old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
 
 	/* Write old part along with new part: */
 	write_pkru(old_pkru | new_pkru_bits);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e44e938885b7..aa7042f272fb 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -110,19 +110,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
 	return vma_pkey(vma);
 }
 
-#define PKRU_AD_KEY(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
-
 /*
  * Make the default PKRU value (at execve() time) as restrictive
  * as possible.  This ensures that any threads clone()'d early
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
-		      PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
-		      PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
-		      PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
-		      PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);
+u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
+		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
+		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
+		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
+		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
 
 static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
@@ -155,7 +153,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 	 * up immediately if someone attempts to disable access
 	 * or writes to pkey 0.
 	 */
-	if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT))
+	if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT))
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access()
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
  2021-08-04  4:32 ` [PATCH V7 01/18] x86/pkeys: Create pkeys_common.h ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-11-25 14:23   ` Thomas Gleixner
  2021-08-04  4:32 ` [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros ira.weiny
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Both PKU and PKS update their register values in the same way.  They can
therefore share the update code.

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Use that helper in arch_set_user_pkey_access().

Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
 arch/x86/mm/pkeys.c          | 23 +++++++++++++++++++++++
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 5c7bcaa79623..597f19e4525b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -133,4 +133,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 6af0c80ad425..4f95ab38a23c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -915,8 +915,7 @@ EXPORT_SYMBOL_GPL(get_xsave_addr);
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			      unsigned long init_val)
 {
-	u32 old_pkru, new_pkru_bits = 0;
-	int pkey_shift;
+	u32 pkru;
 
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
@@ -933,22 +932,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
 		return -EINVAL;
 
-	/* Set the bits needed in PKRU:  */
-	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKR_AD_BIT;
-	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKR_WD_BIT;
-
-	/* Shift the bits in to the correct place in PKRU for pkey: */
-	pkey_shift = pkey * PKRU_BITS_PER_PKEY;
-	new_pkru_bits <<= pkey_shift;
-
-	/* Get old PKRU and mask off any old bits in place: */
-	old_pkru = read_pkru();
-	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-	/* Write old part along with new part: */
-	write_pkru(old_pkru | new_pkru_bits);
+	pkru = read_pkru();
+	pkru = update_pkey_val(pkru, pkey, init_val);
+	write_pkru(pkru);
 
 	return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index aa7042f272fb..ca2e20b18645 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -190,3 +190,26 @@ static __init int setup_init_pkru(char *opt)
 	return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Replace disable bits for @pkey with values from @flags
+ *
+ * Kernel users use the same flags as user space:
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+	/*  Mask out old bit values */
+	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+	/*  Or in new values */
+	if (flags & PKEY_DISABLE_ACCESS)
+		pk_reg |= PKR_AD_BIT << pkey_shift;
+	if (flags & PKEY_DISABLE_WRITE)
+		pk_reg |= PKR_WD_BIT << pkey_shift;
+
+	return pk_reg;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
  2021-08-04  4:32 ` [PATCH V7 01/18] x86/pkeys: Create pkeys_common.h ira.weiny
  2021-08-04  4:32 ` [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-11-25 14:25   ` Thomas Gleixner
  2021-08-04  4:32 ` [PATCH V7 04/18] x86/pks: Add PKS defines and Kconfig options ira.weiny
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Avoid open coding shift and mask operations by defining and using helper
macros for PKey operations.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/pkeys_common.h | 6 +++++-
 arch/x86/include/asm/pkru.h         | 6 ++----
 arch/x86/mm/pkeys.c                 | 8 +++-----
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index f3277717faeb..8a3c6d2e6a8a 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -6,6 +6,10 @@
 #define PKR_WD_BIT 0x2
 #define PKR_BITS_PER_PKEY 2
 
-#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+#define PKR_PKEY_SHIFT(pkey) (pkey * PKR_BITS_PER_PKEY)
+#define PKR_PKEY_MASK(pkey)  (((1 << PKR_BITS_PER_PKEY) - 1) << PKR_PKEY_SHIFT(pkey))
+
+#define PKR_AD_KEY(pkey)     (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
+#define PKR_WD_KEY(pkey)     (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
 
 #endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h
index a74325b0d1df..fb44ff542028 100644
--- a/arch/x86/include/asm/pkru.h
+++ b/arch/x86/include/asm/pkru.h
@@ -15,15 +15,13 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
-	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
+	return !(pkru & PKR_AD_KEY(pkey));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
 	/* Access-disable disables writes too so check both bits here. */
-	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
+	return !(pkru & (PKR_AD_KEY(pkey) | PKR_WD_KEY(pkey)));
 }
 
 static inline u32 read_pkru(void)
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index ca2e20b18645..75437aa8fc56 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
  */
 u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
 {
-	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
-
 	/*  Mask out old bit values */
-	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+	pk_reg &= ~PKR_PKEY_MASK(pkey);
 
 	/*  Or in new values */
 	if (flags & PKEY_DISABLE_ACCESS)
-		pk_reg |= PKR_AD_BIT << pkey_shift;
+		pk_reg |= PKR_AD_KEY(pkey);
 	if (flags & PKEY_DISABLE_WRITE)
-		pk_reg |= PKR_WD_BIT << pkey_shift;
+		pk_reg |= PKR_WD_KEY(pkey);
 
 	return pk_reg;
 }
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 04/18] x86/pks: Add PKS defines and Kconfig options
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (2 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 05/18] x86/pks: Add PKS setup code ira.weiny
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Define the PKS CPU feature bits.

Add the Kconfig ARCH_HAS_SUPERVISOR_PKEYS to indicate to kernel
consumers that an architecture supports pkeys.

Introduce ARCH_ENABLE_SUPERVISOR_PKEYS to allow architectures to avoid
PKS code unless a kernel consumers is configured.

ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first kernel use case
sets it.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/Kconfig                            | 1 +
 arch/x86/include/asm/cpufeatures.h          | 1 +
 arch/x86/include/asm/disabled-features.h    | 8 +++++++-
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 mm/Kconfig                                  | 4 ++++
 5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 49270655e827..d0a7d19aa245 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1837,6 +1837,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_SUPERVISOR_PKEYS
 	help
 	  Memory Protection Keys provides a mechanism for enforcing
 	  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..80c357f638fd 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -365,6 +365,7 @@
 #define X86_FEATURE_MOVDIR64B		(16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD		(16*32+29) /* ENQCMD and ENQCMDS instructions */
 #define X86_FEATURE_SGX_LC		(16*32+30) /* Software Guard Extensions Launch Control */
+#define X86_FEATURE_PKS			(16*32+31) /* Protection Keys for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV	(17*32+ 0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..66fdad8f3941 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+# define DISABLE_PKS		0
+#else
+# define DISABLE_PKS		(1<<(X86_FEATURE_PKS & 31))
+#endif
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57	0
 #else
@@ -85,7 +91,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK19	0
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/mm/Kconfig b/mm/Kconfig
index 40a9bfcd5062..e0d29c655ade 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -818,6 +818,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+	bool
+config ARCH_ENABLE_SUPERVISOR_PKEYS
+	bool
 
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 05/18] x86/pks: Add PKS setup code
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (3 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 04/18] x86/pks: Add PKS defines and Kconfig options ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-11-25 15:15   ` Thomas Gleixner
  2021-08-04  4:32 ` [PATCH V7 06/18] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Fenghua Yu, Hansen, Dave,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	H. Peter Anvin, Rick Edgecombe, x86, linux-kernel, nvdimm,
	linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Add setup code and the lowest level of PKS MSR write support.  Pkeys
values are allocated statically via the pks_pkey_consumers enumeration.
create_initial_pkrs_value() builds the initial protection values for
each pkey.  Users who need a default value other than Access Disabled
should update consumer_defaults[].

The PKRS value is cached per-cpu to avoid the overhead of the MSR write
if the value has not changed.

That said, it should be noted that the underlying WRMSR(MSR_IA32_PKRS)
is not serializing but still maintains ordering properties similar to
WRPKRU.  The current SDM section on PKRS needs updating but should be
the same as that of WRPKRU.  So to quote from the WRPKRU text:

	WRPKRU will never execute transiently. Memory accesses affected
	by PKRU register will not execute (even transiently) until all
	prior executions of WRPKRU have completed execution and updated
	the PKRU register.

write_pkrs() contributed by Peter Zijlstra.
create_initial_pkrs_value() contributed by Dave Hansen

setup_pks() is an internal x86 function call.  Introduce asm/pks.h to
declare functions and internal structures such as this.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Co-developed-by: "Hansen, Dave" <dave.hansen@intel.com>
Signed-off-by: "Hansen, Dave" <dave.hansen@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Create a dynamic pkrs_initial_value in early init code.
	Clean up comments
	Add comment to macro guard
---
 arch/x86/include/asm/msr-index.h    |  1 +
 arch/x86/include/asm/pkeys_common.h |  4 ++
 arch/x86/include/asm/pks.h          | 15 ++++++
 arch/x86/kernel/cpu/common.c        |  2 +
 arch/x86/mm/pkeys.c                 | 75 +++++++++++++++++++++++++++++
 include/linux/pkeys.h               |  8 +++
 6 files changed, 105 insertions(+)
 create mode 100644 arch/x86/include/asm/pks.h

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..c986eb1f36a9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -767,6 +767,7 @@
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 
+#define MSR_IA32_PKRS			0x000006E1
 
 #define MSR_TSX_FORCE_ABORT		0x0000010F
 
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index 8a3c6d2e6a8a..079a8be9686b 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -2,14 +2,18 @@
 #ifndef _ASM_X86_PKEYS_COMMON_H
 #define _ASM_X86_PKEYS_COMMON_H
 
+#define PKR_RW_BIT 0x0
 #define PKR_AD_BIT 0x1
 #define PKR_WD_BIT 0x2
 #define PKR_BITS_PER_PKEY 2
 
+#define PKS_NUM_PKEYS 16
+
 #define PKR_PKEY_SHIFT(pkey) (pkey * PKR_BITS_PER_PKEY)
 #define PKR_PKEY_MASK(pkey)  (((1 << PKR_BITS_PER_PKEY) - 1) << PKR_PKEY_SHIFT(pkey))
 
 #define PKR_AD_KEY(pkey)     (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
 #define PKR_WD_KEY(pkey)     (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
+#define PKR_VALUE(pkey, val) (val << PKR_PKEY_SHIFT(pkey))
 
 #endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
new file mode 100644
index 000000000000..5d7067ada8fb
--- /dev/null
+++ b/arch/x86/include/asm/pks.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKS_H
+#define _ASM_X86_PKS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void setup_pks(void);
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void setup_pks(void) { }
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 64b805bd6a54..abb32bd32f53 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -59,6 +59,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/uv/uv.h>
 #include <asm/sigframe.h>
+#include <asm/pks.h>
 
 #include "cpu.h"
 
@@ -1590,6 +1591,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	setup_pks();
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 75437aa8fc56..fbffbced81b5 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -211,3 +211,78 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
 
 	return pk_reg;
 }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+static DEFINE_PER_CPU(u32, pkrs_cache);
+u32 __read_mostly pkrs_init_value;
+
+/*
+ * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
+ * be checked quickly.
+ *
+ * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
+ * serializing but still maintains ordering properties similar to WRPKRU.
+ * The current SDM section on PKRS needs updating but should be the same as
+ * that of WRPKRU.  So to quote from the WRPKRU text:
+ *
+ *     WRPKRU will never execute transiently. Memory accesses
+ *     affected by PKRU register will not execute (even transiently)
+ *     until all prior executions of WRPKRU have completed execution
+ *     and updated the PKRU register.
+ */
+void write_pkrs(u32 new_pkrs)
+{
+	u32 *pkrs;
+
+	if (!static_cpu_has(X86_FEATURE_PKS))
+		return;
+
+	pkrs = get_cpu_ptr(&pkrs_cache);
+	if (*pkrs != new_pkrs) {
+		*pkrs = new_pkrs;
+		wrmsrl(MSR_IA32_PKRS, new_pkrs);
+	}
+	put_cpu_ptr(pkrs);
+}
+
+/*
+ * Build a default PKRS value from the array specified by consumers
+ */
+static int __init create_initial_pkrs_value(void)
+{
+	/* All users get Access Disabled unless changed below */
+	u8 consumer_defaults[PKS_NUM_PKEYS] = {
+		[0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT
+	};
+	int i;
+
+	consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
+
+	/* Ensure the number of consumers is less than the number of keys */
+	BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
+
+	pkrs_init_value = 0;
+
+	/* Fill the defaults for the consumers */
+	for (i = 0; i < PKS_NUM_PKEYS; i++)
+		pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);
+
+	return 0;
+}
+early_initcall(create_initial_pkrs_value);
+
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+void setup_pks(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	write_pkrs(pkrs_init_value);
+	cr4_set_bits(X86_CR4_PKS);
+}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 6beb26b7151d..580238388f0c 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -46,4 +46,12 @@ static inline bool arch_pkeys_enabled(void)
 
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+enum pks_pkey_consumers {
+	PKS_KEY_DEFAULT = 0, /* Must be 0 for default PTE values */
+	PKS_KEY_NR_CONSUMERS
+};
+extern u32 pkrs_init_value;
+#endif
+
 #endif /* _LINUX_PKEYS_H */
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 06/18] x86/fault: Adjust WARN_ON for PKey fault
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (4 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 05/18] x86/pks: Add PKS setup code ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Fenghua Yu, Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Previously if a Protection key fault occurred it indicated something
very wrong because user page mappings are not supposed to be in the
kernel address space.

Now PKey faults may happen on kernel mappings if the feature is enabled.

Remove the warning in the fault path and allow the oops to occur without
extra debugging if PKS is enabled.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/mm/fault.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b2eefdefc108..e133c0ed72a0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1141,11 +1141,15 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
 	/*
-	 * Protection keys exceptions only happen on user pages.  We
-	 * have no user pages in the kernel portion of the address
-	 * space, so do not expect them here.
+	 * X86_PF_PK (Protection key exceptions) may occur on kernel addresses
+	 * when PKS (PKeys Supervisor) is enabled.
+	 *
+	 * However, if PKS is not enabled WARN if this exception is seen
+	 * because there are no user pages in the kernel portion of the address
+	 * space.
 	 */
-	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
+		     (hw_error_code & X86_PF_PK));
 
 #ifdef CONFIG_X86_32
 	/*
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (5 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 06/18] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-11-25 15:25   ` Thomas Gleixner
  2021-08-04  4:32 ` [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct.  Initialize all tasks with
the INIT_PKRS_VALUE and call pkrs_write_current() to set the MSR to the
saved task value on schedule in.

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Move definitions from asm/processor.h to asm/pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Change pks_init_task()/pks_sched_in() to functions
	s/pks_sched_in/pks_write_current to be used more generically
	later in the series
---
 arch/x86/include/asm/pks.h       |  4 ++++
 arch/x86/include/asm/processor.h | 19 ++++++++++++++++++-
 arch/x86/kernel/process.c        |  3 +++
 arch/x86/kernel/process_64.c     |  3 +++
 arch/x86/mm/pkeys.c              | 16 ++++++++++++++++
 5 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 5d7067ada8fb..e7727086cec2 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -5,10 +5,14 @@
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
 void setup_pks(void);
+void pkrs_write_current(void);
+void pks_init_task(struct task_struct *task);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void setup_pks(void) { }
+static inline void pkrs_write_current(void) { }
+static inline void pks_init_task(struct task_struct *task) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f3020c54e2cb..a6cb7d152c62 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -502,6 +502,12 @@ struct thread_struct {
 	unsigned long		cr2;
 	unsigned long		trap_nr;
 	unsigned long		error_code;
+
+#ifdef	CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* Saved Protection key register for supervisor mappings */
+	u32			saved_pkrs;
+#endif
+
 #ifdef CONFIG_VM86
 	/* Virtual 86 mode info */
 	struct vm86		*vm86;
@@ -768,7 +774,18 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task)		(task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+/*
+ * Early task gets full permissions, the restrictive value is set in
+ * pks_init_task()
+ */
+#define INIT_THREAD  {					\
+	.saved_pkrs = 0,				\
+}
+#else
+#define INIT_THREAD  { }
+#endif
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..c792ac5f33a2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,6 +43,7 @@
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
 #include <asm/frame.h>
+#include <asm/pks.h>
 
 #include "process.h"
 
@@ -223,6 +224,8 @@ void flush_thread(void)
 
 	fpu_flush_thread();
 	pkru_flush_thread();
+
+	pks_init_task(tsk);
 }
 
 void disable_TSC(void)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ec0d836a13b1..8bd1f039e5bf 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,6 +59,7 @@
 /* Not included via unistd.h */
 #include <asm/unistd_32_ia32.h>
 #endif
+#include <asm/pks.h>
 
 #include "process.h"
 
@@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in();
 
+	pkrs_write_current();
+
 	return prev_p;
 }
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index fbffbced81b5..eca01dc8d7ac 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -284,5 +284,21 @@ void setup_pks(void)
 	write_pkrs(pkrs_init_value);
 	cr4_set_bits(X86_CR4_PKS);
 }
+;
+
+/*
+ * PKRS is only temporarily changed during specific code paths.  Only a
+ * preemption during these windows away from the default value would
+ * require updating the MSR.  write_pkrs() handles this optimization.
+ */
+void pkrs_write_current(void)
+{
+	write_pkrs(current->thread.saved_pkrs);
+}
+
+void pks_init_task(struct task_struct *task)
+{
+	task->thread.saved_pkrs = pkrs_init_value;
+}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (6 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-11-13  0:50   ` Ira Weiny
  2021-11-25 14:12   ` Thomas Gleixner
  2021-08-04  4:32 ` [PATCH V7 09/18] x86/pks: Add PKS kernel API ira.weiny
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Thomas Gleixner, Andy Lutomirski,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]  However, Andy
came up with a way to hide additional values on the stack which could be
accessed as "extended_pt_regs".[3]  This method allows for; any place
which has struct pt_regs can get access to the extra information; no
extra information is added to irq_state; and pt_regs is left intact for
compatibility with outside tools like BPF.

To simplify, the assembly code only adds space on the stack.  The
setting or use of any needed values are left to the C code.  While some
entry points may not use this space it is still added where ever pt_regs
is passed to the C code for consistency.

Each nested exception gets another copy of this extended space allowing
for any number of levels of exception handling.

In the assembly, a macro is defined to allow a central place to add
space for other uses should the need arise.

Finally export pkrs_{save|restore}_irq to the common code to allow
it to preserve the current task's PKRS in the new extended pt_regs if
enabled.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch..

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7:
	Rebased to 5.14 entry code
	declare write_pkrs() in pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Remove unnecessary INIT_PKRS_VALUE def
	s/pkrs_save_set_irq/pkrs_save_irq/
		The inital value for exceptions is best managed
		completely within the pkey code.
---
 arch/x86/entry/calling.h               | 26 +++++++++++++
 arch/x86/entry/common.c                | 54 ++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S              | 22 ++++++-----
 arch/x86/entry/entry_64_compat.S       |  6 +--
 arch/x86/include/asm/pks.h             | 18 +++++++++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/kernel/head_64.S              |  7 ++--
 arch/x86/mm/fault.c                    |  3 ++
 include/linux/pkeys.h                  | 11 +++++-
 kernel/entry/common.c                  | 14 ++++++-
 10 files changed, 143 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..a2f94677c3fd 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,6 +63,32 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
+/*
+ * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
+ * @cfunc:		C function to be called
+ *
+ * This will ensure that extended_ptregs is added and removed as needed during
+ * a call into C code.
+ */
+.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* add space for extended_pt_regs */
+	subq    $EXTENDED_PT_REGS_SIZE, %rsp
+#endif
+	.if \annotate_retpoline_safe == 1
+		ANNOTATE_RETPOLINE_SAFE
+	.endif
+	call	\cfunc
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* remove space for extended_pt_regs */
+	addq    $EXTENDED_PT_REGS_SIZE, %rsp
+#endif
+.endm
+
+.macro call_ext_ptregs cfunc
+	__call_ext_ptregs \cfunc, annotate_retpoline_safe=0
+.endm
+
 .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..a0d1d5519dba 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include <linux/nospec.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/pkeys.h>
 
 #ifdef CONFIG_XEN_PV
 #include <xen/xen-ops.h>
@@ -34,6 +35,7 @@
 #include <asm/io_bitmap.h>
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
+#include <asm/pks.h>
 
 #ifdef CONFIG_X86_64
 
@@ -252,6 +254,56 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code)
+{
+	struct extended_pt_regs *ept_regs = extended_pt_regs(regs);
+
+	if (cpu_feature_enabled(X86_FEATURE_PKS) && (error_code & X86_PF_PK))
+		pr_alert("PKRS: 0x%x\n", ept_regs->thread_pkrs);
+}
+
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory save the current
+ * thread value and set the PKRS value to be used during the exception.
+ */
+void pkrs_save_irq(struct pt_regs *regs)
+{
+	struct extended_pt_regs *ept_regs;
+
+	BUILD_BUG_ON(sizeof(struct extended_pt_regs)
+			!= EXTENDED_PT_REGS_SIZE
+				+ sizeof(struct pt_regs));
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	ept_regs = extended_pt_regs(regs);
+	ept_regs->thread_pkrs = current->thread.saved_pkrs;
+	write_pkrs(pkrs_init_value);
+}
+
+void pkrs_restore_irq(struct pt_regs *regs)
+{
+	struct extended_pt_regs *ept_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	ept_regs = extended_pt_regs(regs);
+	write_pkrs(ept_regs->thread_pkrs);
+	current->thread.saved_pkrs = ept_regs->thread_pkrs;
+}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+		/* Normally called by irqentry_exit, restore pkrs here */
+		pkrs_restore_irq(regs);
 		irqentry_exit_cond_resched();
 		instrumentation_end();
 		restore_inhcall(inhcall);
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..1c390975a3de 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -332,7 +332,7 @@ SYM_CODE_END(ret_from_fork)
 		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	.endif
 
-	call	\cfunc
+	call_ext_ptregs \cfunc
 
 	jmp	error_return
 .endm
@@ -435,7 +435,7 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
-	call	\cfunc
+	call_ext_ptregs \cfunc
 
 	jmp	paranoid_exit
 
@@ -496,7 +496,7 @@ SYM_CODE_START(\asmsym)
 	 * stack.
 	 */
 	movq	%rsp, %rdi		/* pt_regs pointer */
-	call	vc_switch_off_ist
+	call_ext_ptregs vc_switch_off_ist
 	movq	%rax, %rsp		/* Switch to new stack */
 
 	UNWIND_HINT_REGS
@@ -507,7 +507,7 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
-	call	kernel_\cfunc
+	call_ext_ptregs kernel_\cfunc
 
 	/*
 	 * No need to switch back to the IST stack. The current stack is either
@@ -542,7 +542,7 @@ SYM_CODE_START(\asmsym)
 	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
 	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
 	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
-	call	\cfunc
+	call_ext_ptregs \cfunc
 
 	jmp	paranoid_exit
 
@@ -781,7 +781,7 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
 	UNWIND_HINT_REGS
 
-	call	xen_pv_evtchn_do_upcall
+	call_ext_ptregs xen_pv_evtchn_do_upcall
 
 	jmp	error_return
 SYM_CODE_END(exc_xen_hypervisor_callback)
@@ -987,7 +987,7 @@ SYM_CODE_START_LOCAL(error_entry)
 	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
-	call	sync_regs
+	call_ext_ptregs sync_regs
 	movq	%rax, %rsp			/* switch stack */
 	ENCODE_FRAME_POINTER
 	pushq	%r12
@@ -1042,7 +1042,7 @@ SYM_CODE_START_LOCAL(error_entry)
 	 * as if we faulted immediately after IRET.
 	 */
 	mov	%rsp, %rdi
-	call	fixup_bad_iret
+	call_ext_ptregs fixup_bad_iret
 	mov	%rax, %rsp
 	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
@@ -1148,7 +1148,7 @@ SYM_CODE_START(asm_exc_nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	exc_nmi
+	call_ext_ptregs exc_nmi
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1184,6 +1184,8 @@ SYM_CODE_START(asm_exc_nmi)
 	 * +---------------------------------------------------------+
 	 * | pt_regs                                                 |
 	 * +---------------------------------------------------------+
+	 * | (Optionally) extended_pt_regs                           |
+	 * +---------------------------------------------------------+
 	 *
 	 * The "original" frame is used by hardware.  Before re-enabling
 	 * NMIs, we need to be done with it, and we need to leave enough
@@ -1360,7 +1362,7 @@ end_repeat_nmi:
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	exc_nmi
+	call_ext_ptregs exc_nmi
 
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..53254d29d5c7 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -136,7 +136,7 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
 .Lsysenter_flags_fixed:
 
 	movq	%rsp, %rdi
-	call	do_SYSENTER_32
+	call_ext_ptregs do_SYSENTER_32
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -253,7 +253,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi
-	call	do_fast_syscall_32
+	call_ext_ptregs do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -410,6 +410,6 @@ SYM_CODE_START(entry_INT80_compat)
 	cld
 
 	movq	%rsp, %rdi
-	call	do_int80_syscall_32
+	call_ext_ptregs do_int80_syscall_32
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index e7727086cec2..76960ec71b4b 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -4,15 +4,33 @@
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
+struct extended_pt_regs {
+	u32 thread_pkrs;
+	/* Keep stack 8 byte aligned */
+	u32 pad;
+	struct pt_regs pt_regs;
+};
+
 void setup_pks(void);
 void pkrs_write_current(void);
 void pks_init_task(struct task_struct *task);
+void write_pkrs(u32 new_pkrs);
+
+static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs)
+{
+	return container_of(regs, struct extended_pt_regs, pt_regs);
+}
+
+void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void setup_pks(void) { }
 static inline void pkrs_write_current(void) { }
 static inline void pks_init_task(struct task_struct *task) { }
+static inline void write_pkrs(u32 new_pkrs) { }
+static inline void show_extended_regs_oops(struct pt_regs *regs,
+					   unsigned long error_code) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..4a41fc4cf028 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -53,4 +53,6 @@
 # define X86_CR3_PTI_PCID_USER_BIT	11
 #endif
 
+#define EXTENDED_PT_REGS_SIZE 8
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d8b3ebd2bb85..90e76178b6b4 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -319,8 +319,7 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
 	movq	initial_vc_handler(%rip), %rax
-	ANNOTATE_RETPOLINE_SAFE
-	call	*%rax
+	__call_ext_ptregs *%rax, annotate_retpoline_safe=1
 
 	/* Unwind pt_regs */
 	POP_REGS
@@ -397,7 +396,7 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	UNWIND_HINT_REGS
 
 	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
-	call do_early_exception
+	call_ext_ptregs do_early_exception
 
 	decl early_recursion_flag(%rip)
 	jmp restore_regs_and_return_to_kernel
@@ -421,7 +420,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
 	/* Call C handler */
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
-	call    do_vc_no_ghcb
+	call_ext_ptregs do_vc_no_ghcb
 
 	/* Unwind pt_regs */
 	POP_REGS
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e133c0ed72a0..a4ce7cef0260 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -32,6 +32,7 @@
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
+#include <asm/pks.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -547,6 +548,8 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 		 (error_code & X86_PF_PK)    ? "protection keys violation" :
 					       "permissions violation");
 
+	show_extended_regs_oops(regs, error_code);
+
 	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
 		struct desc_ptr idt, gdt;
 		u16 ldtr, tr;
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 580238388f0c..76eb19a37942 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -52,6 +52,15 @@ enum pks_pkey_consumers {
 	PKS_KEY_NR_CONSUMERS
 };
 extern u32 pkrs_init_value;
-#endif
+
+void pkrs_save_irq(struct pt_regs *regs);
+void pkrs_restore_irq(struct pt_regs *regs);
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pkrs_save_irq(struct pt_regs *regs) { }
+static inline void pkrs_restore_irq(struct pt_regs *regs) { }
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bf16395b9e13..aa0b1e8dd742 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/pkeys.h>
 
 #include "common.h"
 
@@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		instrumentation_end();
 
 		ret.exit_rcu = true;
-		return ret;
+		goto done;
 	}
 
 	/*
@@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 
+done:
+	pkrs_save_irq(regs);
 	return ret;
 }
 
@@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		irqentry_exit_to_user_mode(regs);
-	} else if (!regs_irqs_disabled(regs)) {
+		return;
+	}
+
+	pkrs_restore_irq(regs);
+
+	if (!regs_irqs_disabled(regs)) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -458,11 +466,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	ftrace_nmi_enter();
 	instrumentation_end();
 
+	pkrs_save_irq(regs);
 	return irq_state;
 }
 
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
+	pkrs_restore_irq(regs);
 	instrumentation_begin();
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 09/18] x86/pks: Add PKS kernel API
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (7 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 10/18] x86/pks: Introduce pks_abandon_protections() ira.weiny
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Fenghua Yu, Sean Christopherson, Ira Weiny, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Peter Zijlstra, Andy Lutomirski,
	H. Peter Anvin, Rick Edgecombe, x86, linux-kernel, nvdimm,
	linux-mm

From: Fenghua Yu <fenghua.yu@intel.com>

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.  Violating those
protections creates a fault which by default will oops.

Each kernel user defines a PKS_KEY_* key value which identifies a PKS
domain to be used exclusively by that kernel user.  This API is then
used to control which pages are part of that domain and the current
threads protection of those pages.

4 new functions are added pks_enabled(), pks_mk_noaccess(),
pks_mk_readonly(), and pks_mk_readwrite().  2 new macros are added
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.  This includes how to reserve a key and set the default
permissions on that key.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>

---
Change for V7
	Add pks_enabled() to allow users more dynamic choice on PKS use.
	Update documentation for key allocation
	Remove dynamic key allocation, keys will be allocated statically
	now.
	Add expected CPU generation support to documentation
---
 Documentation/core-api/protection-keys.rst | 121 ++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h       |  12 ++
 arch/x86/mm/pkeys.c                        |  66 +++++++++++
 include/linux/pgtable.h                    |   4 +
 include/linux/pkeys.h                      |  14 +++
 5 files changed, 199 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..6420a60666fc 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,30 @@
 Memory Protection Keys
 ======================
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.
 
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And it will be available in future
+non-server Intel parts and future AMD processors.
 
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+Protection Keys for Supervisor pages (PKS) is available in the SDM since May
+2020.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per-CPU registers for each type
+of page User and Supervisor.  Each of these 32-bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +35,11 @@ even though there is theoretically space in the PAE PTEs.  These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-========
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+============================
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +106,80 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+Similar to user space pkeys, supervisor pkeys allow additional protections to
+be defined for a supervisor mappings.  Unlike user space pkeys, violations of
+these protections result in a a kernel oops.
+
+Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's
+Sapphire Rapids (and later) "Scalable Processor" Server CPUs.  It will also be
+available in future non-server Intel parts.
+
+Also qemu has some support as well: https://www.qemu.org/2021/04/30/qemu-6-0-0/
+
+Kernel users intending to use PKS support should depend on
+ARCH_HAS_SUPERVISOR_PKEYS, and add their config to ARCH_ENABLE_SUPERVISOR_PKEYS
+to turn on this support within the core.
+
+Users reserve a key value by adding an entry to the enum pks_pkey_consumers and
+defining the initial protections in the consumer_defaults[] array.
+
+For example to configure a key for 'MY_FEATURE' with a default of Write
+Disabled.
+
+::
+
+        enum pks_pkey_consumers
+        {
+	        PKS_KEY_DEFAULT,
+	        PKS_KEY_MY_FEATURE,
+	        PKS_KEY_NR_CONSUMERS
+        }
+
+        ...
+        consumer_defaults[PKS_KEY_DEFAULT]     = 0;
+        consumer_defaults[PKS_KEY_MY_FEATURE]  = PKR_DISABLE_WRITE;
+        ...
+
+The following interface is used to manipulate the 'protection domain' defined
+by a pkey within the kernel.  Setting a pkey value in a supervisor PTE adds
+this additional protection to the page.
+
+::
+
+        #define PAGE_KERNEL_PKEY(pkey)
+        #define _PAGE_KEY(pkey)
+        bool pks_enabled(void);
+        void pks_mk_noaccess(int pkey);
+        void pks_mk_readonly(int pkey);
+        void pks_mk_readwrite(int pkey);
+
+pks_enabled() allows users to know if PKS is configured and available on the
+current running system.
+
+Kernel users must set the pkey in the page table entries for the mappings they
+want to protect.  This can be done with PAGE_KERNEL_PKEY() or _PAGE_KEY().
+
+The pks_mk*() family of calls allow indinvidual threads to change the
+protections for the domain identified by the pkey parameter.  3 states are
+available: pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which
+set the access to none, read, and read/write respectively.
+
+The interface sets (Access Disabled (AD=1)) for all keys not in use.
+
+It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
+but still maintains ordering properties similar to WRPKRU.  Thus it is safe to
+immediately use a mapping when the pks_mk*() functions return.
+
+Older versions of the SDM on PKRS may be wrong with regard to this
+serialization.  The text should be the same as that of WRPKRU.  From the WRPKRU
+text:
+
+	WRPKRU will never execute transiently. Memory accesses
+	affected by PKRU register will not execute (even transiently)
+	until all prior executions of WRPKRU have completed execution
+	and updated the PKRU register.
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..3f866e730456 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,12 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+#else
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
 #else
@@ -226,6 +232,12 @@ enum page_cache_mode {
 #define PAGE_KERNEL_IO		__pgprot_mask(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE	__pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define PAGE_KERNEL_PKEY(pkey)	__pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+#else
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 /*         xwr */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index eca01dc8d7ac..146a665d1bf3 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -3,6 +3,9 @@
  * Intel Memory Protection Keys management
  * Copyright (c) 2015, Intel Corporation.
  */
+#undef pr_fmt
+#define pr_fmt(fmt) "x86/pkeys: " fmt
+
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
@@ -10,6 +13,7 @@
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
 #include <asm/mmu_context.h>            /* vma_pkey()                   */
+#include <asm/pks.h>
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
@@ -301,4 +305,66 @@ void pks_init_task(struct task_struct *task)
 	task->thread.saved_pkrs = pkrs_init_value;
 }
 
+bool pks_enabled(void)
+{
+	return cpu_feature_enabled(X86_FEATURE_PKS);
+}
+
+/*
+ * Do not call this directly, see pks_mk*() below.
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ *
+ */
+static inline void pks_update_protection(int pkey, unsigned long protection)
+{
+	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
+						     pkey, protection);
+	pkrs_write_current();
+}
+
+/**
+ * pks_mk_noaccess() - Disable all access to the domain
+ * @pkey the pkey for which the access should change.
+ *
+ * Disable all access to the domain specified by pkey.  This is not a global
+ * update and only affects the current running thread.
+ */
+void pks_mk_noaccess(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+EXPORT_SYMBOL_GPL(pks_mk_noaccess);
+
+/**
+ * pks_mk_readonly() - Make the domain Read only
+ * @pkey the pkey for which the access should change.
+ *
+ * Allow read access to the domain specified by pkey.  This is not a global
+ * update and only affects the current running thread.
+ */
+void pks_mk_readonly(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_WRITE);
+}
+EXPORT_SYMBOL_GPL(pks_mk_readonly);
+
+/**
+ * pks_mk_readwrite() - Make the domain Read/Write
+ * @pkey the pkey for which the access should change.
+ *
+ * Allow all access, read and write, to the domain specified by pkey.  This is
+ * not a global update and only affects the current running thread.
+ */
+void pks_mk_readwrite(int pkey)
+{
+	pks_update_protection(pkey, 0);
+}
+EXPORT_SYMBOL_GPL(pks_mk_readwrite);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index d147480cdefc..eba1a9f9d124 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1526,6 +1526,10 @@ static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
+#ifndef PAGE_KERNEL_PKEY
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 /*
  * Page Table Modification bits for pgtbl_mod_mask.
  *
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 76eb19a37942..b9919ed4d300 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -56,11 +56,25 @@ extern u32 pkrs_init_value;
 void pkrs_save_irq(struct pt_regs *regs);
 void pkrs_restore_irq(struct pt_regs *regs);
 
+bool pks_enabled(void);
+void pks_mk_noaccess(int pkey);
+void pks_mk_readonly(int pkey);
+void pks_mk_readwrite(int pkey);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pkrs_save_irq(struct pt_regs *regs) { }
 static inline void pkrs_restore_irq(struct pt_regs *regs) { }
 
+static inline bool pks_enabled(void)
+{
+	return false;
+}
+
+static inline void pks_mk_noaccess(int pkey) {}
+static inline void pks_mk_readonly(int pkey) {}
+static inline void pks_mk_readwrite(int pkey) {}
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 10/18] x86/pks: Introduce pks_abandon_protections()
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (8 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 09/18] x86/pks: Add PKS kernel API ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 11/18] x86/pks: Add PKS Test code ira.weiny
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Unanticipated access to PMEM by otherwise working kernel code would be
very disruptive to otherwise working systems.  Such access could be
through valid uses such as kmap().  In this use case PMEM protections
will require the ability to abandon all protections of a pkey on all
threads system wide.

Introduce pks_abandon_protections() to allow a user to mask off
protection values.  This will filter through all the threads of the
system as they are scheduled in and in the immediate case override the
value should running threads PKS fault.

Update pkrs_write_current(), pks_init_task(), and
pkrs_{save|restore}_irq() to account for pkrs_pkey_allowed_mask.

Add handle_abandoned_pks_value() to adjust any already running threads
which may fault on an abandoned pkey.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	New patch
	Significant internal review from Dan Williams and Rick Edgecombe
---
 Documentation/core-api/protection-keys.rst |  7 +++-
 arch/x86/entry/common.c                    |  6 ++-
 arch/x86/include/asm/pks.h                 |  5 +++
 arch/x86/mm/fault.c                        | 24 ++++++-----
 arch/x86/mm/pkeys.c                        | 49 ++++++++++++++++++++++
 include/linux/pkeys.h                      |  2 +
 6 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 6420a60666fc..202088634fa3 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -157,6 +157,7 @@ this additional protection to the page.
         void pks_mk_noaccess(int pkey);
         void pks_mk_readonly(int pkey);
         void pks_mk_readwrite(int pkey);
+        void pks_abandon_protections(int pkey);
 
 pks_enabled() allows users to know if PKS is configured and available on the
 current running system.
@@ -169,7 +170,11 @@ protections for the domain identified by the pkey parameter.  3 states are
 available: pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which
 set the access to none, read, and read/write respectively.
 
-The interface sets (Access Disabled (AD=1)) for all keys not in use.
+The interface sets Access Disabled for all keys not in use.  The
+pks_abandon_protections() call reduces the protections for the specified key to
+be fully accessible thus abandoning the protections of the key.  There is no
+way to reverse this.  As such pks_abandon_protections() is intended to provide
+a 'relief valve' if the PKS protections should prove too restrictive.
 
 It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
 but still maintains ordering properties similar to WRPKRU.  Thus it is safe to
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a0d1d5519dba..717091910ebc 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -37,6 +37,8 @@
 #include <asm/irq_stack.h>
 #include <asm/pks.h>
 
+extern u32 pkrs_pkey_allowed_mask;
+
 #ifdef CONFIG_X86_64
 
 static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
@@ -287,7 +289,7 @@ void pkrs_save_irq(struct pt_regs *regs)
 
 	ept_regs = extended_pt_regs(regs);
 	ept_regs->thread_pkrs = current->thread.saved_pkrs;
-	write_pkrs(pkrs_init_value);
+	write_pkrs(pkrs_init_value & pkrs_pkey_allowed_mask);
 }
 
 void pkrs_restore_irq(struct pt_regs *regs)
@@ -298,8 +300,8 @@ void pkrs_restore_irq(struct pt_regs *regs)
 		return;
 
 	ept_regs = extended_pt_regs(regs);
-	write_pkrs(ept_regs->thread_pkrs);
 	current->thread.saved_pkrs = ept_regs->thread_pkrs;
+	pkrs_write_current();
 }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 76960ec71b4b..ed293ef4509e 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -22,6 +22,7 @@ static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs)
 }
 
 void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code);
+int handle_abandoned_pks_value(struct pt_regs *regs);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
@@ -31,6 +32,10 @@ static inline void pks_init_task(struct task_struct *task) { }
 static inline void write_pkrs(u32 new_pkrs) { }
 static inline void show_extended_regs_oops(struct pt_regs *regs,
 					   unsigned long error_code) { }
+static inline int handle_abandoned_pks_value(struct pt_regs *regs)
+{
+	return 0;
+}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a4ce7cef0260..bf3353d8e011 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1143,16 +1143,20 @@ static void
 do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
-	/*
-	 * X86_PF_PK (Protection key exceptions) may occur on kernel addresses
-	 * when PKS (PKeys Supervisor) is enabled.
-	 *
-	 * However, if PKS is not enabled WARN if this exception is seen
-	 * because there are no user pages in the kernel portion of the address
-	 * space.
-	 */
-	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
-		     (hw_error_code & X86_PF_PK));
+	if (hw_error_code & X86_PF_PK) {
+		/*
+		 * X86_PF_PK (Protection key exceptions) may occur on kernel
+		 * addresses when PKS (PKeys Supervisor) is enabled.
+		 *
+		 * However, if PKS is not enabled WARN if this exception is
+		 * seen because there are no user pages in the kernel portion
+		 * of the address space.
+		 */
+		WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS));
+
+		if (handle_abandoned_pks_value(regs))
+			return;
+	}
 
 #ifdef CONFIG_X86_32
 	/*
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 146a665d1bf3..56d37840186b 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -221,6 +221,26 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
 static DEFINE_PER_CPU(u32, pkrs_cache);
 u32 __read_mostly pkrs_init_value;
 
+/*
+ * Define a mask of pkeys which are allowed, ie have not been abandoned.
+ * Default is all keys are allowed.
+ */
+#define PKRS_ALLOWED_MASK_DEFAULT 0xffffffff
+u32 __read_mostly pkrs_pkey_allowed_mask;
+
+int handle_abandoned_pks_value(struct pt_regs *regs)
+{
+	struct extended_pt_regs *ept_regs;
+	u32 old;
+
+	ept_regs = extended_pt_regs(regs);
+	old = ept_regs->thread_pkrs;
+	ept_regs->thread_pkrs &= pkrs_pkey_allowed_mask;
+
+	/* If something changed retry the fault */
+	return (ept_regs->thread_pkrs != old);
+}
+
 /*
  * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
  * be checked quickly.
@@ -267,6 +287,7 @@ static int __init create_initial_pkrs_value(void)
 	BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
 
 	pkrs_init_value = 0;
+	pkrs_pkey_allowed_mask = PKRS_ALLOWED_MASK_DEFAULT;
 
 	/* Fill the defaults for the consumers */
 	for (i = 0; i < PKS_NUM_PKEYS; i++)
@@ -297,12 +318,14 @@ void setup_pks(void)
  */
 void pkrs_write_current(void)
 {
+	current->thread.saved_pkrs &= pkrs_pkey_allowed_mask;
 	write_pkrs(current->thread.saved_pkrs);
 }
 
 void pks_init_task(struct task_struct *task)
 {
 	task->thread.saved_pkrs = pkrs_init_value;
+	task->thread.saved_pkrs &= pkrs_pkey_allowed_mask;
 }
 
 bool pks_enabled(void)
@@ -367,4 +390,30 @@ void pks_mk_readwrite(int pkey)
 }
 EXPORT_SYMBOL_GPL(pks_mk_readwrite);
 
+/**
+ * pks_abandon_protections() - Force readwrite (no protections) for the
+ *                             specified pkey
+ * @pkey The pkey to force
+ *
+ * Force the value of the pkey to readwrite (no protections) thus abandoning
+ * protections for this key.  This is a permanent change and has no
+ * coresponding reversal function.
+ *
+ * This also updates the current running thread.
+ */
+void pks_abandon_protections(int pkey)
+{
+	u32 old_mask, new_mask;
+
+	do {
+		old_mask = READ_ONCE(pkrs_pkey_allowed_mask);
+		new_mask = update_pkey_val(old_mask, pkey, 0);
+	} while (unlikely(
+		 cmpxchg(&pkrs_pkey_allowed_mask, old_mask, new_mask) != old_mask));
+
+	/* Update the local thread as well. */
+	pks_update_protection(pkey, 0);
+}
+EXPORT_SYMBOL_GPL(pks_abandon_protections);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index b9919ed4d300..4d22ccd971fc 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -60,6 +60,7 @@ bool pks_enabled(void);
 void pks_mk_noaccess(int pkey);
 void pks_mk_readonly(int pkey);
 void pks_mk_readwrite(int pkey);
+void pks_abandon_protections(int pkey);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
@@ -74,6 +75,7 @@ static inline bool pks_enabled(void)
 static inline void pks_mk_noaccess(int pkey) {}
 static inline void pks_mk_readonly(int pkey) {}
 static inline void pks_mk_readwrite(int pkey) {}
+static inline void pks_abandon_protections(int pkey) {}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 11/18] x86/pks: Add PKS Test code
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (9 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 10/18] x86/pks: Introduce pks_abandon_protections() ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 12/18] x86/pks: Add PKS fault callbacks ira.weiny
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

The core PKS functionality provides an interface for kernel users to
reserve a key and set up page tables with that key.

Define test code under CONFIG_PKS_TEST which exercises the core
functionality of PKS via a debugfs entry.  Basic checks can be triggered
on boot with a kernel command line option while both basic and
preemption checks can be triggered with separate debugfs values.  [See
the comment at the top of pks_test.c for details on the values which can
be used and what tests they run.]

CONFIG_PKS_TEST enables ARCH_ENABLE_SUPERVISOR_PKEYS but can not
co-exist with any GENERAL_PKS_USER.  This is because the test code
iterates through all the keys and is pretty much not useful in general
kernel configs.  General PKS users should select GENERAL_PKS_USER which
will disable PKS_TEST as well as enable ARCH_ENABLE_SUPERVISOR_PKEYS.

A PKey is not reserved for this test and the test code defines its own
PKS_KEY_PKS_TEST.

To test pks_abandon_protections() each test requires the thread to be
re-run after resetting the abandoned mask value.  Do this by allowing
the test code access to the abandoned mask value.

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Add testing for pks_abandon_protections()
	Adjust pkrs_init_value
	Adjust for new defines
	Clean up comments
        Adjust test for static allocation of pkeys
        Use lookup_address() instead of follow_pte()
		follow_pte only works on IO and raw PFN mappings, use
		lookup_address() instead.  lookup_address() is
		constrained to architectures which support it.
---
 Documentation/core-api/protection-keys.rst |   6 +-
 arch/x86/include/asm/pks.h                 |  18 +
 arch/x86/mm/fault.c                        |   8 +
 arch/x86/mm/pkeys.c                        |  18 +-
 lib/Kconfig.debug                          |  13 +
 lib/Makefile                               |   3 +
 lib/pks/Makefile                           |   3 +
 lib/pks/pks_test.c                         | 864 +++++++++++++++++++++
 mm/Kconfig                                 |   5 +-
 tools/testing/selftests/x86/Makefile       |   2 +-
 tools/testing/selftests/x86/test_pks.c     | 157 ++++
 11 files changed, 1092 insertions(+), 5 deletions(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 202088634fa3..8cf7eaaed3e5 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -122,8 +122,8 @@ available in future non-server Intel parts.
 Also qemu has some support as well: https://www.qemu.org/2021/04/30/qemu-6-0-0/
 
 Kernel users intending to use PKS support should depend on
-ARCH_HAS_SUPERVISOR_PKEYS, and add their config to ARCH_ENABLE_SUPERVISOR_PKEYS
-to turn on this support within the core.
+ARCH_HAS_SUPERVISOR_PKEYS, and add their config to GENERAL_PKS_USER to turn on
+this support within the core.
 
 Users reserve a key value by adding an entry to the enum pks_pkey_consumers and
 defining the initial protections in the consumer_defaults[] array.
@@ -188,3 +188,5 @@ text:
 	affected by PKRU register will not execute (even transiently)
 	until all prior executions of WRPKRU have completed execution
 	and updated the PKRU register.
+
+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index ed293ef4509e..e28413cc410d 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -39,4 +39,22 @@ static inline int handle_abandoned_pks_value(struct pt_regs *regs)
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+
+#ifdef CONFIG_PKS_TEST
+
+#define __static_or_pks_test
+
+bool pks_test_callback(struct pt_regs *regs);
+
+#else /* !CONFIG_PKS_TEST */
+
+#define __static_or_pks_test static
+
+static inline bool pks_test_callback(struct pt_regs *regs)
+{
+	return false;
+}
+
+#endif /* CONFIG_PKS_TEST */
+
 #endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index bf3353d8e011..3780ed0f9597 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1154,6 +1154,14 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		 */
 		WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS));
 
+		/*
+		 * If a protection key exception occurs it could be because a PKS test
+		 * is running.  If so, pks_test_callback() will clear the protection
+		 * mechanism and return true to indicate the fault was handled.
+		 */
+		if (pks_test_callback(regs))
+			return;
+
 		if (handle_abandoned_pks_value(regs))
 			return;
 	}
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 56d37840186b..c7358662ec07 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -218,7 +218,7 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
-static DEFINE_PER_CPU(u32, pkrs_cache);
+__static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);
 u32 __read_mostly pkrs_init_value;
 
 /*
@@ -289,6 +289,22 @@ static int __init create_initial_pkrs_value(void)
 	pkrs_init_value = 0;
 	pkrs_pkey_allowed_mask = PKRS_ALLOWED_MASK_DEFAULT;
 
+	/*
+	 * PKS_TEST is mutually exclusive to any real users of PKS so define a PKS_TEST
+	 * appropriate value.
+	 *
+	 * NOTE: PKey 0 must still be fully permissive for normal kernel mappings to
+	 * work correctly.
+	 */
+	if (IS_ENABLED(CONFIG_PKS_TEST)) {
+		pkrs_init_value = (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+				   PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+				   PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+				   PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+				   PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15));
+		return 0;
+	}
+
 	/* Fill the defaults for the consumers */
 	for (i = 0; i < PKS_NUM_PKEYS; i++)
 		pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 831212722924..28579084649d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2650,6 +2650,19 @@ config HYPERV_TESTING
 	help
 	  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TEST
+	bool "PKey (S)upervisor testing"
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	depends on !GENERAL_PKS_USER
+	help
+	  Select this option to enable testing of PKS core software and
+	  hardware.  The PKS core provides a mechanism to allocate keys as well
+	  as maintain the protection settings across context switches.
+
+	  Answer N if you don't know what supervisor keys are.
+
+	  If unsure, say N.
+
 endmenu # "Kernel Testing and Coverage"
 
 source "Documentation/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 5efd1b435a37..fc31f2d6d8e4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -360,3 +360,6 @@ obj-$(CONFIG_CMDLINE_KUNIT_TEST) += cmdline_kunit.o
 obj-$(CONFIG_SLUB_KUNIT_TEST) += slub_kunit.o
 
 obj-$(CONFIG_GENERIC_LIB_DEVMEM_IS_ALLOWED) += devmem_is_allowed.o
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..9daccba4f7c4
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TEST) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..679edd487360
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,864 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2020 Intel Corporation. All rights reserved.
+ *
+ * Implement PKS testing
+ * Access to run this test can be with a command line parameter
+ * ("pks-test-on-boot") or more detailed tests can be triggered through:
+ *
+ *    /sys/kernel/debug/x86/run_pks
+ *
+ *  debugfs controls are:
+ *
+ *  '0' -- Run access tests with a single pkey
+ *  '1' -- Set up the pkey register with no access for the pkey allocated to
+ *         this fd
+ *  '2' -- Check that the pkey register updated in '1' is still the same.
+ *         (To be used after a forced context switch.)
+ *  '3' -- Allocate all pkeys possible and run tests on each pkey allocated.
+ *         DEFAULT when run at boot.
+ *  '4' -- The same as '0' with additional kernel debugging
+ *  '5' -- The same as '3' with additional kernel debugging
+ *  '6' -- Test abandoning a pkey
+ *  '9' -- Set up and fault on a PKS protected page.  This will crash the
+ *         kernel and requires the option to be specified 2 times in a row.
+ *
+ *  Closing the fd will cleanup and release the pkey, to exercise context
+ *  switch testing a user space program is provided in:
+ *
+ *          .../tools/testing/selftests/x86/test_pks.c
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/debugfs.h>
+#include <linux/delay.h>
+#include <linux/entry-common.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mman.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/percpu-defs.h>
+#include <linux/pgtable.h>
+#include <linux/pkeys.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#include <asm/ptrace.h>       /* for struct pt_regs */
+#include <asm/pkeys_common.h>
+#include <asm/processor.h>
+#include <asm/pks.h>
+
+/*
+ * PKS testing uses all pkeys but define 1 key to use for some tests.  Any
+ * value from [1-PKS_NUM_PKEYS) will work.
+ */
+#define PKS_KEY_PKS_TEST 1
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+#define RUN_ALLOCATE            "0"
+#define ARM_CTX_SWITCH          "1"
+#define CHECK_CTX_SWITCH        "2"
+#define RUN_ALLOCATE_ALL        "3"
+#define RUN_ALLOCATE_DEBUG      "4"
+#define RUN_ALLOCATE_ALL_DEBUG  "5"
+#define RUN_DISABLE_TEST        "6"
+#define RUN_CRASH_TEST          "9"
+
+/* The testing needs some knowledge of the internals */
+DECLARE_PER_CPU(u32, pkrs_cache);
+extern u32 pkrs_pkey_allowed_mask;
+
+/*
+ * run_on_boot default '= false' which checkpatch complains about initializing;
+ * so don't
+ */
+static bool run_on_boot;
+static struct dentry *pks_test_dentry;
+static bool run_9;
+
+/*
+ * The following globals must be protected for brief periods while the fault
+ * handler checks/updates them.
+ */
+static DEFINE_SPINLOCK(test_lock);
+static int test_armed_key;
+static unsigned long prev_cnt;
+static unsigned long fault_cnt;
+
+struct pks_test_ctx {
+	bool pass;
+	bool pks_cpu_enabled;
+	bool debug;
+	int pkey;
+	char data[64];
+};
+static struct pks_test_ctx *test_exception_ctx;
+
+static bool check_pkey_val(u32 pk_reg, int pkey, u32 expected)
+{
+	pk_reg = (pk_reg & PKR_PKEY_MASK(pkey)) >> PKR_PKEY_SHIFT(pkey);
+	return (pk_reg == expected);
+}
+
+/*
+ * Check if the register @pkey value matches @expected value
+ *
+ * Both the cached and actual MSR must match.
+ */
+static bool check_pkrs(int pkey, u32 expected)
+{
+	bool ret = true;
+	u64 pkrs;
+	u32 *tmp_cache;
+
+	tmp_cache = get_cpu_ptr(&pkrs_cache);
+	if (!check_pkey_val(*tmp_cache, pkey, expected))
+		ret = false;
+	put_cpu_ptr(tmp_cache);
+
+	rdmsrl(MSR_IA32_PKRS, pkrs);
+	if (!check_pkey_val(pkrs, pkey, expected))
+		ret = false;
+
+	return ret;
+}
+
+static void check_exception(u32 thread_pkrs)
+{
+	/* Check the thread saved state */
+	if (!check_pkey_val(thread_pkrs, test_armed_key, PKEY_DISABLE_WRITE)) {
+		pr_err("     FAIL: checking ept_regs->thread_pkrs\n");
+		test_exception_ctx->pass = false;
+	}
+
+	/* Check the exception state */
+	if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: PKRS cache and MSR\n");
+		test_exception_ctx->pass = false;
+	}
+
+	/*
+	 * Ensure an update can occur during exception without affecting the
+	 * interrupted thread.  The interrupted thread is checked after
+	 * exception...
+	 */
+	pks_mk_readwrite(test_armed_key);
+	if (!check_pkrs(test_armed_key, 0)) {
+		pr_err("     FAIL: exception did not change register to 0\n");
+		test_exception_ctx->pass = false;
+	}
+	pks_mk_noaccess(test_armed_key);
+	if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: exception did not change register to 0x%x\n",
+			PKEY_DISABLE_ACCESS);
+		test_exception_ctx->pass = false;
+	}
+}
+
+/**
+ * pks_test_callback() is exported so that the fault handler can detect
+ * and report back status of intentional faults.
+ *
+ * NOTE: It clears the protection key from the page such that the fault handler
+ * will not re-trigger.
+ */
+bool pks_test_callback(struct pt_regs *regs)
+{
+	struct extended_pt_regs *ept_regs = extended_pt_regs(regs);
+	bool armed = (test_armed_key != 0);
+
+	if (test_exception_ctx) {
+		check_exception(ept_regs->thread_pkrs);
+		/*
+		 * Stop this check directly within the exception because the
+		 * fault handler clean up code will call again while checking
+		 * the PMD entry and there is no need to check this again.
+		 */
+		test_exception_ctx = NULL;
+	}
+
+	if (armed) {
+		/* Enable read and write to stop faults */
+		ept_regs->thread_pkrs = update_pkey_val(ept_regs->thread_pkrs,
+							test_armed_key, 0);
+		fault_cnt++;
+	}
+
+	return armed;
+}
+
+static bool exception_caught(void)
+{
+	bool ret = (fault_cnt != prev_cnt);
+
+	prev_cnt = fault_cnt;
+	return ret;
+}
+
+static void report_pkey_settings(void *info)
+{
+	u8 pkey;
+	unsigned long long msr = 0;
+	unsigned int cpu = smp_processor_id();
+	struct pks_test_ctx *ctx = info;
+
+	rdmsrl(MSR_IA32_PKRS, msr);
+
+	pr_info("for CPU %d : 0x%llx\n", cpu, msr);
+
+	if (ctx->debug) {
+		for (pkey = 0; pkey < PKS_NUM_PKEYS; pkey++) {
+			int ad, wd;
+
+			ad = (msr >> PKR_PKEY_SHIFT(pkey)) & PKEY_DISABLE_ACCESS;
+			wd = (msr >> PKR_PKEY_SHIFT(pkey)) & PKEY_DISABLE_WRITE;
+			pr_info("   %u: A:%d W:%d\n", pkey, ad, wd);
+		}
+	}
+}
+
+enum pks_access_mode {
+	PKS_TEST_NO_ACCESS,
+	PKS_TEST_RDWR,
+	PKS_TEST_RDONLY
+};
+
+static char *get_mode_str(enum pks_access_mode mode)
+{
+	switch (mode) {
+	case PKS_TEST_NO_ACCESS:
+		return "No Access";
+	case PKS_TEST_RDWR:
+		return "Read Write";
+	case PKS_TEST_RDONLY:
+		return "Read Only";
+	default:
+		pr_err("BUG in test invalid mode\n");
+		break;
+	}
+
+	return "";
+}
+
+struct pks_access_test {
+	enum pks_access_mode mode;
+	bool write;
+	bool exception;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+	/*  disable both */
+	{ PKS_TEST_NO_ACCESS, true,  true },
+	{ PKS_TEST_NO_ACCESS, false, true },
+
+	/*  enable both */
+	{ PKS_TEST_RDWR, true,  false },
+	{ PKS_TEST_RDWR, false, false },
+
+	/*  enable read only */
+	{ PKS_TEST_RDONLY, true,  true },
+	{ PKS_TEST_RDONLY, false, false },
+};
+
+static int test_it(struct pks_test_ctx *ctx, struct pks_access_test *test,
+		   void *ptr, bool forced_sched)
+{
+	bool exception;
+	int ret = 0;
+
+	spin_lock(&test_lock);
+	WRITE_ONCE(test_armed_key, ctx->pkey);
+
+	if (test->write)
+		memcpy(ptr, ctx->data, 8);
+	else
+		memcpy(ctx->data, ptr, 8);
+
+	exception = exception_caught();
+
+	WRITE_ONCE(test_armed_key, 0);
+	spin_unlock(&test_lock);
+
+	/*
+	 * After a forced schedule the allowed mask should be applied on
+	 * sched_in and therefore no exception should ever be seen.
+	 */
+	if (forced_sched && exception) {
+		pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s; sched TRUE\n",
+			get_mode_str(test->mode),
+			test->write ? "TRUE" : "FALSE",
+			test->exception ? "TRUE" : "FALSE",
+			exception ? "TRUE" : "FALSE");
+		ret = -EFAULT;
+	} else if (test->exception != exception) {
+		pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s\n",
+			get_mode_str(test->mode),
+			test->write ? "TRUE" : "FALSE",
+			test->exception ? "TRUE" : "FALSE",
+			exception ? "TRUE" : "FALSE");
+		ret = -EFAULT;
+	}
+
+	return ret;
+}
+
+static int run_access_test(struct pks_test_ctx *ctx,
+			   struct pks_access_test *test,
+			   void *ptr,
+			   bool forced_sched)
+{
+	switch (test->mode) {
+	case PKS_TEST_NO_ACCESS:
+		pks_mk_noaccess(ctx->pkey);
+		break;
+	case PKS_TEST_RDWR:
+		pks_mk_readwrite(ctx->pkey);
+		break;
+	case PKS_TEST_RDONLY:
+		pks_mk_readonly(ctx->pkey);
+		break;
+	default:
+		pr_err("BUG in test invalid mode\n");
+		break;
+	}
+
+	return test_it(ctx, test, ptr, forced_sched);
+}
+
+static void *alloc_test_page(int pkey)
+{
+	return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL_PKEY(pkey), 0,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+static void test_mem_access(struct pks_test_ctx *ctx)
+{
+	int i, rc;
+	u8 pkey;
+	void *ptr = NULL;
+	pte_t *ptep = NULL;
+	unsigned int level;
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		ctx->pass = false;
+		return;
+	}
+
+	ptep = lookup_address((unsigned long)ptr, &level);
+	if (!ptep) {
+		pr_err("Failed to lookup address???\n");
+		ctx->pass = false;
+		goto done;
+	}
+
+	pr_info("lookup address ptr %p ptep %p\n",
+		ptr, ptep);
+
+	pkey = pte_flags_pkey(ptep->pte);
+	pr_info("ptep flags 0x%lx pkey %u\n",
+		(unsigned long)ptep->pte, pkey);
+
+	if (pkey != ctx->pkey) {
+		pr_err("invalid pkey found: %u, test_pkey: %u\n",
+			pkey, ctx->pkey);
+		ctx->pass = false;
+		goto done;
+	}
+
+	if (!ctx->pks_cpu_enabled) {
+		pr_err("not CPU enabled; skipping access tests...\n");
+		ctx->pass = true;
+		goto done;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+		rc = run_access_test(ctx, &pkey_test_ary[i], ptr, false);
+
+		/*  only save last error is fine */
+		if (rc)
+			ctx->pass = false;
+	}
+
+done:
+	vfree(ptr);
+}
+
+static void pks_run_test(struct pks_test_ctx *ctx)
+{
+	ctx->pass = true;
+
+	pr_info("\n");
+	pr_info("\n");
+	pr_info("     ***** BEGIN: Testing (CPU enabled : %s) *****\n",
+		ctx->pks_cpu_enabled ? "TRUE" : "FALSE");
+
+	if (ctx->pks_cpu_enabled)
+		on_each_cpu(report_pkey_settings, ctx, 1);
+
+	pr_info("           BEGIN: pkey %d Testing\n", ctx->pkey);
+	test_mem_access(ctx);
+	pr_info("           END: PAGE_KERNEL_PKEY Testing : %s\n",
+		ctx->pass ? "PASS" : "FAIL");
+
+	pr_info("     ***** END: Testing *****\n");
+	pr_info("\n");
+	pr_info("\n");
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	struct pks_test_ctx *ctx = file->private_data;
+	char buf[32];
+	unsigned int len;
+
+	if (!ctx)
+		len = sprintf(buf, "not run\n");
+	else
+		len = sprintf(buf, "%s\n", ctx->pass ? "PASS" : "FAIL");
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static struct pks_test_ctx *alloc_ctx(u8 pkey)
+{
+	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+	if (!ctx) {
+		pr_err("Failed to allocate memory for test context\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pkey = pkey;
+	ctx->pks_cpu_enabled = cpu_feature_enabled(X86_FEATURE_PKS);
+	sprintf(ctx->data, "%s", "DEADBEEF");
+	return ctx;
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+	kfree(ctx);
+}
+
+static void run_exception_test(void)
+{
+	void *ptr = NULL;
+	bool pass = true;
+	struct pks_test_ctx *ctx;
+
+	pr_info("     ***** BEGIN: exception checking\n");
+
+	ctx = alloc_ctx(PKS_KEY_PKS_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("     FAIL: no context\n");
+		pass = false;
+		goto result;
+	}
+	ctx->pass = true;
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("     FAIL: no vmalloc page\n");
+		pass = false;
+		goto free_context;
+	}
+
+	pks_mk_readonly(ctx->pkey);
+
+	spin_lock(&test_lock);
+	WRITE_ONCE(test_exception_ctx, ctx);
+	WRITE_ONCE(test_armed_key, ctx->pkey);
+
+	memcpy(ptr, ctx->data, 8);
+
+	if (!exception_caught()) {
+		pr_err("     FAIL: did not get an exception\n");
+		pass = false;
+	}
+
+	/*
+	 * NOTE The exception code has to enable access (b00) to keep the fault
+	 * from looping forever.  Therefore full access is seen here rather
+	 * than write disabled.
+	 *
+	 * Furthermore, check_exception() disabled access during the exception
+	 * so this is testing that the thread value was restored back to the
+	 * thread value.
+	 */
+	if (!check_pkrs(test_armed_key, 0)) {
+		pr_err("     FAIL: PKRS not restored\n");
+		pass = false;
+	}
+
+	if (!ctx->pass)
+		pass = false;
+
+	WRITE_ONCE(test_armed_key, 0);
+	spin_unlock(&test_lock);
+
+	vfree(ptr);
+free_context:
+	free_ctx(ctx);
+result:
+	pr_info("     ***** END: exception checking : %s\n",
+		 pass ? "PASS" : "FAIL");
+}
+
+static struct pks_access_test abandon_test_ary[] = {
+	/*  disable both */
+	{ PKS_TEST_NO_ACCESS, true,  false },
+	{ PKS_TEST_NO_ACCESS, false, false },
+
+	/*  enable both */
+	{ PKS_TEST_RDWR, true,  false },
+	{ PKS_TEST_RDWR, false, false },
+
+	/*  enable read only */
+	{ PKS_TEST_RDONLY, true,  false },
+	{ PKS_TEST_RDONLY, false, false },
+};
+
+static DEFINE_SPINLOCK(abandoned_test_lock);
+struct shared_data {
+	struct pks_test_ctx *ctx;
+	void *kmap_addr;
+	struct pks_access_test *test;
+	bool thread_running;
+	bool sched_thread;
+};
+
+static int abandoned_test_main(void *d)
+{
+	struct shared_data *data = d;
+	struct pks_test_ctx *ctx = data->ctx;
+
+	spin_lock(&abandoned_test_lock);
+	data->thread_running = true;
+	spin_unlock(&abandoned_test_lock);
+
+	while (!kthread_should_stop()) {
+		spin_lock(&abandoned_test_lock);
+		if (data->kmap_addr) {
+			pr_info("     Thread ->saved_pkrs Before 0x%x (%d)\n",
+				current->thread.saved_pkrs, ctx->pkey);
+			if (data->sched_thread)
+				msleep(20);
+			if (run_access_test(ctx, data->test, data->kmap_addr,
+					    data->sched_thread))
+				ctx->pass = false;
+			pr_info("     Thread Remote ->saved_pkrs After 0x%x (%d)\n",
+				current->thread.saved_pkrs, ctx->pkey);
+			data->kmap_addr = NULL;
+		}
+		spin_unlock(&abandoned_test_lock);
+	}
+
+	return 0;
+}
+
+static void run_abandon_pkey_test(struct pks_test_ctx *ctx,
+				  struct pks_access_test *test,
+				  void *ptr,
+				  bool sched_thread)
+{
+	struct task_struct *other_task;
+	struct shared_data data;
+	bool running = false;
+
+	pr_info("checking...  mode %s; write %s\n",
+			get_mode_str(test->mode), test->write ? "TRUE" : "FALSE");
+
+	pkrs_pkey_allowed_mask = 0xffffffff;
+
+	memset(&data, 0, sizeof(data));
+	data.ctx = ctx;
+	data.thread_running = false;
+	data.sched_thread = sched_thread;
+	other_task = kthread_run(abandoned_test_main, &data, "PKRS abandoned test");
+	if (IS_ERR(other_task)) {
+		pr_err("     FAIL: Failed to start thread\n");
+		ctx->pass = false;
+		return;
+	}
+
+	while (!running) {
+		spin_lock(&abandoned_test_lock);
+		running = data.thread_running;
+		spin_unlock(&abandoned_test_lock);
+	}
+
+	spin_lock(&abandoned_test_lock);
+	pr_info("Local ->saved_pkrs Before 0x%x (%d)\n",
+		current->thread.saved_pkrs, ctx->pkey);
+	pks_abandon_protections(ctx->pkey);
+	data.test = test;
+	data.kmap_addr = ptr;
+	spin_unlock(&abandoned_test_lock);
+
+	while (data.kmap_addr)
+		msleep(20);
+
+	pr_info("Local ->saved_pkrs After 0x%x (%d)\n",
+		current->thread.saved_pkrs, ctx->pkey);
+
+	kthread_stop(other_task);
+}
+
+static void run_abandoned_test(void)
+{
+	struct pks_test_ctx *ctx;
+	bool pass = true;
+	void *ptr;
+	int i;
+
+	pr_info("     ***** BEGIN: abandoned pkey checking\n");
+
+	ctx = alloc_ctx(PKS_KEY_PKS_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("     FAIL: no context\n");
+		pass = false;
+		goto result;
+	}
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("     FAIL: no vmalloc page\n");
+		pass = false;
+		goto free_context;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(abandon_test_ary); i++) {
+		ctx->pass = true;
+		run_abandon_pkey_test(ctx, &abandon_test_ary[i], ptr, false);
+		/* sticky fail */
+		if (!ctx->pass)
+			pass = ctx->pass;
+
+		ctx->pass = true;
+		run_abandon_pkey_test(ctx, &abandon_test_ary[i], ptr, true);
+		/* sticky fail */
+		if (!ctx->pass)
+			pass = ctx->pass;
+	}
+
+	/* Force re-enable all keys */
+	pkrs_pkey_allowed_mask = 0xffffffff;
+
+	vfree(ptr);
+free_context:
+	free_ctx(ctx);
+result:
+	pr_info("     ***** END: abandoned pkey checking : %s\n",
+		 pass ? "PASS" : "FAIL");
+}
+
+static void run_all(bool debug)
+{
+	struct pks_test_ctx *ctx[PKS_NUM_PKEYS];
+	static char name[PKS_NUM_PKEYS][64];
+	int i;
+
+	for (i = 1; i < PKS_NUM_PKEYS; i++) {
+		sprintf(name[i], "pks ctx %d", i);
+		ctx[i] = alloc_ctx(i);
+		if (!IS_ERR(ctx[i]))
+			ctx[i]->debug = debug;
+	}
+
+	for (i = 1; i < PKS_NUM_PKEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			pks_run_test(ctx[i]);
+	}
+
+	for (i = 1; i < PKS_NUM_PKEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			free_ctx(ctx[i]);
+	}
+
+	run_exception_test();
+
+	run_abandoned_test();
+}
+
+static void crash_it(void)
+{
+	struct pks_test_ctx *ctx;
+	void *ptr;
+
+	pr_warn("     ***** BEGIN: Unhandled fault test *****\n");
+
+	ctx = alloc_ctx(PKS_KEY_PKS_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to allocate context???\n");
+		return;
+	}
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		ctx->pass = false;
+		return;
+	}
+
+	pks_mk_noaccess(ctx->pkey);
+
+	spin_lock(&test_lock);
+	WRITE_ONCE(test_armed_key, 0);
+	/* This purposely faults */
+	memcpy(ptr, ctx->data, 8);
+	spin_unlock(&test_lock);
+
+	vfree(ptr);
+	free_ctx(ctx);
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+			      size_t count, loff_t *ppos)
+{
+	char buf[2];
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (copy_from_user(buf, user_buf, 1))
+		return -EFAULT;
+	buf[1] = '\0';
+
+	/*
+	 * WARNING: Test "9" will crash the kernel.
+	 *
+	 * Arm the test and print a warning.  A second "9" will run the test.
+	 */
+	if (!strcmp(buf, RUN_CRASH_TEST)) {
+		if (run_9) {
+			crash_it();
+			run_9 = false;
+		} else {
+			pr_warn("CAUTION: Test 9 will crash in the kernel.\n");
+			pr_warn("         Specify 9 a second time to run\n");
+			pr_warn("         run any other test to clear\n");
+			run_9 = true;
+		}
+	} else {
+		run_9 = false;
+	}
+
+	/*
+	 * Test "3" will test allocating all keys. Do it first without
+	 * using "ctx".
+	 */
+	if (!strcmp(buf, RUN_ALLOCATE_ALL))
+		run_all(false);
+	if (!strcmp(buf, RUN_ALLOCATE_ALL_DEBUG))
+		run_all(true);
+
+	if (!strcmp(buf, RUN_DISABLE_TEST))
+		run_abandoned_test();
+
+	/*
+	 * This context is only required if the file is held open for the below
+	 * tests.  Otherwise the context just get's freed in pks_release_file.
+	 */
+	if (!ctx) {
+		ctx = alloc_ctx(PKS_KEY_PKS_TEST);
+		if (IS_ERR(ctx))
+			return -ENOMEM;
+		file->private_data = ctx;
+	}
+
+	if (!strcmp(buf, RUN_ALLOCATE)) {
+		ctx->debug = false;
+		pks_run_test(ctx);
+	}
+	if (!strcmp(buf, RUN_ALLOCATE_DEBUG)) {
+		ctx->debug = true;
+		pks_run_test(ctx);
+	}
+
+	/* start of context switch test */
+	if (!strcmp(buf, ARM_CTX_SWITCH)) {
+		unsigned long reg_pkrs;
+		int access;
+
+		/* Ensure a known state to test context switch */
+		pks_mk_readwrite(ctx->pkey);
+
+		rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+		access = (reg_pkrs >> PKR_PKEY_SHIFT(ctx->pkey)) &
+			  PKEY_ACCESS_MASK;
+		pr_info("Context switch armed : pkey %d: 0x%x reg: 0x%lx\n",
+			ctx->pkey, access, reg_pkrs);
+	}
+
+	/* After context switch msr should be restored */
+	if (!strcmp(buf, CHECK_CTX_SWITCH) && ctx->pks_cpu_enabled) {
+		unsigned long reg_pkrs;
+		int access;
+
+		rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+		access = (reg_pkrs >> PKR_PKEY_SHIFT(ctx->pkey)) &
+			  PKEY_ACCESS_MASK;
+		if (access != 0) {
+			ctx->pass = false;
+			pr_err("Context switch check failed: pkey %d: 0x%x reg: 0x%lx\n",
+				ctx->pkey, access, reg_pkrs);
+		} else {
+			pr_err("Context switch check passed: pkey %d: 0x%x reg: 0x%lx\n",
+				ctx->pkey, access, reg_pkrs);
+		}
+	}
+
+	return count;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (!ctx)
+		return 0;
+
+	free_ctx(ctx);
+	return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+	.read = pks_read_file,
+	.write = pks_write_file,
+	.llseek = default_llseek,
+	.release = pks_release_file,
+};
+
+static int __init parse_pks_test_options(char *str)
+{
+	run_on_boot = true;
+
+	return 0;
+}
+early_param("pks-test-on-boot", parse_pks_test_options);
+
+static int __init pks_test_init(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_PKS)) {
+		if (run_on_boot)
+			run_all(true);
+
+		pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+						      NULL, &fops_init_pks);
+	}
+
+	return 0;
+}
+late_initcall(pks_test_init);
+
+static void __exit pks_test_exit(void)
+{
+	debugfs_remove(pks_test_dentry);
+	pr_info("test exit\n");
+}
diff --git a/mm/Kconfig b/mm/Kconfig
index e0d29c655ade..ea6ffee69f55 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -820,8 +820,11 @@ config ARCH_HAS_PKEYS
 	bool
 config ARCH_HAS_SUPERVISOR_PKEYS
 	bool
+config GENERAL_PKS_USER
+	def_bool n
 config ARCH_ENABLE_SUPERVISOR_PKEYS
-	bool
+	def_bool y
+	depends on PKS_TEST || GENERAL_PKS_USER
 
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index b4142cd1c5c2..b2f852f0e7e1 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
 			check_initial_reg_state sigreturn iopl ioperm \
 			test_vsyscall mov_ss_trap \
-			syscall_arg_fault fsgsbase_restore sigaltstack
+			syscall_arg_fault fsgsbase_restore sigaltstack test_pks
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
new file mode 100644
index 000000000000..c12b38760c9c
--- /dev/null
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2020 Intel Corporation. All rights reserved.
+ *
+ * User space tool to test PKS operations.  Accesses test code through
+ * <debugfs>/x86/run_pks when CONFIG_PKS_TEST is enabled.
+ *
+ */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h>
+#include <assert.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stdbool.h>
+
+#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks"
+
+#define RUN_ALLOCATE            "0"
+#define SETUP_CTX_SWITCH        "1"
+#define CHECK_CTX_SWITCH        "2"
+#define RUN_ALLOCATE_ALL        "3"
+#define RUN_ALLOCATE_DEBUG      "4"
+#define RUN_ALLOCATE_ALL_DEBUG  "5"
+#define RUN_DISABLE_TEST        "6"
+#define RUN_CRASH_TEST          "9"
+
+int main(int argc, char *argv[])
+{
+	cpu_set_t cpuset;
+	char result[32];
+	pid_t pid;
+	int fd;
+	int setup_done[2];
+	int switch_done[2];
+	int cpu = 0;
+	int rc = 0;
+	int c;
+	bool debug = false;
+
+	while (1) {
+		int option_index = 0;
+		static struct option long_options[] = {
+			{"debug",  no_argument,	  0,  0 },
+			{0,	   0,		  0,  0 }
+		};
+
+		c = getopt_long(argc, argv, "", long_options, &option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 0:
+			debug = true;
+			break;
+		}
+	}
+
+	if (optind < argc)
+		cpu = strtoul(argv[optind], NULL, 0);
+
+	if (cpu >= sysconf(_SC_NPROCESSORS_ONLN)) {
+		printf("CPU %d is invalid\n", cpu);
+		cpu = sysconf(_SC_NPROCESSORS_ONLN) - 1;
+		printf("   running on max CPU: %d\n", cpu);
+	}
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	/* Two processes run on CPU 0 so that they go through context switch. */
+	sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
+
+	if (pipe(setup_done))
+		printf("Failed to create pipe\n");
+	if (pipe(switch_done))
+		printf("Failed to create pipe\n");
+
+	pid = fork();
+	if (pid == 0) {
+		char done = 'y';
+
+		fd = open(PKS_TEST_FILE, O_RDWR);
+		if (fd < 0) {
+			printf("cannot open %s\n", PKS_TEST_FILE);
+			return -1;
+		}
+
+		cpu = sched_getcpu();
+		printf("Child running on cpu %d...\n", cpu);
+
+		/* Allocate test_pkey1 and run test. */
+		if (debug)
+			write(fd, RUN_ALLOCATE_DEBUG, 1);
+		else
+			write(fd, RUN_ALLOCATE, 1);
+
+		/* Arm for context switch test */
+		write(fd, SETUP_CTX_SWITCH, 1);
+
+		printf("   tell parent to go\n");
+		write(setup_done[1], &done, sizeof(done));
+
+		/* Context switch out... */
+		printf("   Waiting for parent...\n");
+		read(switch_done[0], &done, sizeof(done));
+
+		/* Check msr restored */
+		printf("Checking result\n");
+		write(fd, CHECK_CTX_SWITCH, 1);
+
+		read(fd, result, 10);
+		printf("   #PF, context switch, pkey allocation and free tests: %s\n", result);
+		if (!strncmp(result, "PASS", 10)) {
+			rc = -1;
+			done = 'F';
+		}
+
+		/* Signal result */
+		write(setup_done[1], &done, sizeof(done));
+	} else {
+		char done = 'y';
+
+		read(setup_done[0], &done, sizeof(done));
+		cpu = sched_getcpu();
+		printf("Parent running on cpu %d\n", cpu);
+
+		fd = open(PKS_TEST_FILE, O_RDWR);
+		if (fd < 0) {
+			printf("cannot open %s\n", PKS_TEST_FILE);
+			return -1;
+		}
+
+		/* run test with alternate pkey */
+		if (debug)
+			write(fd, RUN_ALLOCATE_DEBUG, 1);
+		else
+			write(fd, RUN_ALLOCATE, 1);
+
+		printf("   Signaling child.\n");
+		write(switch_done[1], &done, sizeof(done));
+
+		/* Wait for result */
+		read(setup_done[0], &done, sizeof(done));
+		if (done == 'F')
+			rc = -1;
+	}
+
+	close(fd);
+
+	return rc;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 12/18] x86/pks: Add PKS fault callbacks
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (10 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 11/18] x86/pks: Add PKS Test code ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-11 21:18   ` Edgecombe, Rick P
  2021-08-04  4:32 ` [PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS) ira.weiny
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Rick Edgecombe, Ira Weiny, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Fenghua Yu, x86, linux-kernel, nvdimm, linux-mm

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Some PKS keys will want special handling on accesses that violate their
permissions. One of these is PMEM which will want to have a mode that
logs the access violation, disables protection, and continues rather
than oops the machine.

Since PKS faults do not provide the actual key that faulted, this
information needs to be recovered by walking the page tables and
extracting it from the leaf entry.

This infrastructure could be used to implement abandoned pkeys, but adds
support in a separate call such that abandoned pkeys are handled more
quickly by skipping the page table walk.

In pkeys.c, define a new api for setting callbacks for individual pkeys.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
Changes for V7:
	New patch
---
 Documentation/core-api/protection-keys.rst | 27 +++++++++++-
 arch/x86/include/asm/pks.h                 |  7 +++
 arch/x86/mm/fault.c                        | 51 ++++++++++++++++++++++
 arch/x86/mm/pkeys.c                        | 13 ++++++
 include/linux/pkeys.h                      |  2 +
 5 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 8cf7eaaed3e5..bbf81b12e67d 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -113,7 +113,8 @@ Kernel API for PKS support
 
 Similar to user space pkeys, supervisor pkeys allow additional protections to
 be defined for a supervisor mappings.  Unlike user space pkeys, violations of
-these protections result in a a kernel oops.
+these protections result in a a kernel oops unless a PKS fault handler is
+provided which handles the fault.
 
 Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's
 Sapphire Rapids (and later) "Scalable Processor" Server CPUs.  It will also be
@@ -145,6 +146,30 @@ Disabled.
         consumer_defaults[PKS_KEY_MY_FEATURE]  = PKR_DISABLE_WRITE;
         ...
 
+
+Users may also provide a fault handler which can handle a fault differently
+than an oops.  Continuing our example from above if 'MY_FEATURE' wanted to
+define a handler they can do so by adding the coresponding entry to the
+pks_key_callbacks array.
+
+::
+
+        #ifdef CONFIG_MY_FEATURE
+        bool my_feature_pks_fault_callback(unsigned long address, bool write)
+        {
+                if (my_feature_fault_is_ok)
+                        return true;
+                return false;
+        }
+        #endif
+
+        static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
+                [PKS_KEY_DEFAULT]            = NULL,
+        #ifdef CONFIG_MY_FEATURE
+                [PKS_KEY_PGMAP_PROTECTION]   = my_feature_pks_fault_callback,
+        #endif
+        };
+
 The following interface is used to manipulate the 'protection domain' defined
 by a pkey within the kernel.  Setting a pkey value in a supervisor PTE adds
 this additional protection to the page.
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index e28413cc410d..3de5089d379d 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -23,6 +23,7 @@ static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs)
 
 void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code);
 int handle_abandoned_pks_value(struct pt_regs *regs);
+bool handle_pks_key_callback(unsigned long address, bool write, u16 key);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
@@ -36,6 +37,12 @@ static inline int handle_abandoned_pks_value(struct pt_regs *regs)
 {
 	return 0;
 }
+static inline bool handle_pks_key_fault(struct pt_regs *regs,
+					unsigned long hw_error_code,
+					unsigned long address)
+{
+	return false;
+}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3780ed0f9597..7a8c807006c7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1134,6 +1134,54 @@ bool fault_in_kernel_space(unsigned long address)
 	return address >= TASK_SIZE_MAX;
 }
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+bool handle_pks_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+			  unsigned long address)
+{
+	bool write = (hw_error_code & X86_PF_WRITE);
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+	pte_t pte;
+
+	pgd = READ_ONCE(*(init_mm.pgd + pgd_index(address)));
+	if (!pgd_present(pgd))
+		return false;
+
+	p4d = READ_ONCE(*p4d_offset(&pgd, address));
+	if (!p4d_present(p4d))
+		return false;
+
+	if (p4d_large(p4d))
+		return handle_pks_key_callback(address, write,
+					       pte_flags_pkey(p4d_val(p4d)));
+
+	pud = READ_ONCE(*pud_offset(&p4d, address));
+	if (!pud_present(pud))
+		return false;
+
+	if (pud_large(pud))
+		return handle_pks_key_callback(address, write,
+					      pte_flags_pkey(pud_val(pud)));
+
+	pmd = READ_ONCE(*pmd_offset(&pud, address));
+	if (!pmd_present(pmd))
+		return false;
+
+	if (pmd_large(pmd))
+		return handle_pks_key_callback(address, write,
+					      pte_flags_pkey(pmd_val(pmd)));
+
+	pte = READ_ONCE(*pte_offset_kernel(&pmd, address));
+	if (!pte_present(pte))
+		return false;
+
+	return handle_pks_key_callback(address, write,
+				      pte_flags_pkey(pte_val(pte)));
+}
+#endif
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1164,6 +1212,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 
 		if (handle_abandoned_pks_value(regs))
 			return;
+
+		if (handle_pks_key_fault(regs, hw_error_code, address))
+			return;
 	}
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index c7358662ec07..f0166725a128 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -241,6 +241,19 @@ int handle_abandoned_pks_value(struct pt_regs *regs)
 	return (ept_regs->thread_pkrs != old);
 }
 
+static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 };
+
+bool handle_pks_key_callback(unsigned long address, bool write, u16 key)
+{
+	if (key > PKS_KEY_NR_CONSUMERS)
+		return false;
+
+	if (pks_key_callbacks[key])
+		return pks_key_callbacks[key](address, write);
+
+	return false;
+}
+
 /*
  * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
  * be checked quickly.
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 4d22ccd971fc..549fa01d7da3 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -62,6 +62,8 @@ void pks_mk_readonly(int pkey);
 void pks_mk_readwrite(int pkey);
 void pks_abandon_protections(int pkey);
 
+typedef bool (*pks_key_callback)(unsigned long address, bool write);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pkrs_save_irq(struct pt_regs *regs) { }
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS)
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (11 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 12/18] x86/pks: Add PKS fault callbacks ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode ira.weiny
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.  Set up an
infrastructure for thread local access protection. Then implement the
protection using the new Protection Keys Supervisor (PKS) on
architectures that support it.

To enable this extra protection memremap_pages users should check for
protection support via pgmap_protection_enabled() and if enabled specify
(PGMAP_PROTECTION) in (struct dev_pagemap)->flags to request that
access protection.

NOTE: The name of pgmap_protection_enable() and PGMAP_PROTECTION were
specifically chosen to isolate the implementation of the protection from
higher level users.

Kernel code intending to access this memory can do so through 4 new
calls.  pgmap_mk_{readwrite,noaccess}() and
__pgmap_mk_{readwrite,noaccess}() calls.

The pgmap_mk_*() take a page parameter and the __pgmap_mk_*() calls
directly take the dev_pagemap objects.  pgmap_mk_*() take care of
checking if the page is a page map managed page and are safe to any user
who has a reference on the page.

All changes in the protections must be through the above calls.  They
abstract the protection implementation (currently the PKS api) from the
upper layer users.

Furthermore, the calls are nestable by the use of a per task reference
count.  This ensures that the first call to re-enable protection does
not 'break' the last access of the device memory.

NOTE: There are no code paths which directly nest these calls.  For this
reason multiple reviewers, including Dan and Thomas, asked why this
reference counting was needed at this level rather than in a higher
level call such as kmap_{atomic,local_page}().  The reason is that
pmgmap_mk_read_write() can nest with kmap_{atomic,local_page}().
Therefore push this reference counting to the lower level.

Access to device memory during exceptions (#PF) is expected only from
user faults.  Therefore there is no need to maintain the reference count
when entering or exiting exceptions.  However, reference counting will
occur during the exception.  Recall that protection is automatically
enabled during exceptions by the PKS core.[1]

A default of (NVDIMM_PFN && ARCH_HAS_SUPERVISOR_PKEYS) was suggested but
logically that is the same as saying default 'yes' because both
NVDIMM_PFN and ARCH_HAS_SUPERVISOR_PKEYS are required.  Therefore a
default of 'yes' was used.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

[1] https://lore.kernel.org/lkml/20210401225833.566238-9-ira.weiny@intel.com/

---
Changes for V7
	Add __pgmap_mk_*() calls to allow users who have a dev_pagemap
		to call directly into that layer of the API
	Add pgmap_protection_enabled() and fail memremap_pages() if
		protection is requested and pgmap_protection_enabled()
		is false
	s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
		This helps to isolate the implementation details of the
		protection from the higher layers.
	s/dev_page_access_ref/pgmap_prot_count
	s/DEV_PAGEMAP_PROTECTION/DEVMAP_ACCESS_PROTECTION
	Adjust Kconfig dependency and default
	Address feedback from Dan Williams
		Add requirement comment to devmap_protected
		Make pgmap_mk_* static inline
		Change to devmap_protected
		Change config to DEV_PAGEMAP_PROTECTION
	Remove dynamic key use from memremap
		This greatly simplifies turning on PKS when requested by
		the remapping code
		#define a static key for pmem use
---
 arch/x86/mm/pkeys.c      |  3 +-
 include/linux/memremap.h |  1 +
 include/linux/mm.h       | 62 ++++++++++++++++++++++++++++++++++
 include/linux/pkeys.h    |  1 +
 include/linux/sched.h    |  7 ++++
 init/init_task.c         |  3 ++
 kernel/fork.c            |  3 ++
 mm/Kconfig               | 18 ++++++++++
 mm/memremap.c            | 73 ++++++++++++++++++++++++++++++++++++++++
 9 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f0166725a128..cdebc2018888 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -294,7 +294,8 @@ static int __init create_initial_pkrs_value(void)
 	};
 	int i;
 
-	consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
+	consumer_defaults[PKS_KEY_DEFAULT]          = PKR_RW_BIT;
+	consumer_defaults[PKS_KEY_PGMAP_PROTECTION] = PKR_AD_BIT;
 
 	/* Ensure the number of consumers is less than the number of keys */
 	BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..53dc97823418 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -90,6 +90,7 @@ struct dev_pagemap_ops {
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
+#define PGMAP_PROTECTION	(1 << 1)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..d3c1a3ecca87 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1198,6 +1198,68 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+DECLARE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+
+/*
+ * devmap_protected() requires a reference on the page to ensure there is no
+ * races with dev_pagemap tear down.
+ */
+static inline bool devmap_protected(struct page *page)
+{
+	if (!static_branch_unlikely(&dev_pgmap_protection_static_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	if (page->pgmap->flags & PGMAP_PROTECTION)
+		return true;
+	return false;
+}
+
+void __pgmap_mk_readwrite(struct dev_pagemap *pgmap);
+void __pgmap_mk_noaccess(struct dev_pagemap *pgmap);
+
+static inline bool pgmap_check_pgmap_prot(struct page *page)
+{
+	if (!devmap_protected(page))
+		return false;
+
+	/*
+	 * There is no known use case to change permissions in an irq for pgmap
+	 * pages
+	 */
+	lockdep_assert_in_irq();
+	return true;
+}
+
+static inline void pgmap_mk_readwrite(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_mk_readwrite(page->pgmap);
+}
+static inline void pgmap_mk_noaccess(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_mk_noaccess(page->pgmap);
+}
+
+bool pgmap_protection_enabled(void);
+
+#else
+
+static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
+static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { }
+static inline void pgmap_mk_readwrite(struct page *page) { }
+static inline void pgmap_mk_noaccess(struct page *page) { }
+static inline bool pgmap_protection_enabled(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define page_ref_zero_or_close_to_overflow(page) \
 	((unsigned int) page_ref_count(page) + 127u <= 127u)
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 549fa01d7da3..c06b47264c5d 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -49,6 +49,7 @@ static inline bool arch_pkeys_enabled(void)
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 enum pks_pkey_consumers {
 	PKS_KEY_DEFAULT = 0, /* Must be 0 for default PTE values */
+	PKS_KEY_PGMAP_PROTECTION,
 	PKS_KEY_NR_CONSUMERS
 };
 extern u32 pkrs_init_value;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d88641..2d035d9981b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,13 @@ struct task_struct {
 	struct llist_head               kretprobe_instances;
 #endif
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	/*
+	 * NOTE: pgmap_prot_count is modified within a single thread of
+	 * execution.  So it does not need to be atomic_t.
+	 */
+	u32                             pgmap_prot_count;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 562f2ef8d157..f628ad552ee3 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -213,6 +213,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	.pgmap_prot_count = 0,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..7f7b946f4f2e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -956,6 +956,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+#endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	tsk->pgmap_prot_count = 0;
 #endif
 	return tsk;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index ea6ffee69f55..201d41269a36 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -790,6 +790,24 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVMAP_ACCESS_PROTECTION
+	bool "Access protection for memremap_pages()"
+	depends on NVDIMM_PFN
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	select GENERAL_PKS_USER
+	default y
+
+	help
+	  Enable extra protections on device memory.  This protects against
+	  unintended access to devices such as a stray writes.  This feature is
+	  particularly useful to protect against corruption of persistent
+	  memory.
+
+	  This depends on architecture support of supervisor PKeys and has no
+	  overhead if the architecture does not support them.
+
+	  If you have persistent memory say 'Y'.
+
 config DEV_PAGEMAP_OPS
 	bool
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 15a074ffb8d7..a05de8714916 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -6,6 +6,7 @@
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
 #include <linux/pfn_t.h>
+#include <linux/pkeys.h>
 #include <linux/swap.h>
 #include <linux/mmzone.h>
 #include <linux/swapops.h>
@@ -63,6 +64,68 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+/*
+ * Note; all devices which have asked for protections share the same key.  The
+ * key may, or may not, have been provided by the core.  If not, protection
+ * will be disabled.  The key acquisition is attempted when the first ZONE
+ * DEVICE requests it and freed when all zones have been unmapped.
+ *
+ * Also this must be EXPORT_SYMBOL rather than EXPORT_SYMBOL_GPL because it is
+ * intended to be used in the kmap API.
+ */
+DEFINE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+EXPORT_SYMBOL(dev_pgmap_protection_static_key);
+
+static void devmap_protection_enable(void)
+{
+	static_branch_inc(&dev_pgmap_protection_static_key);
+}
+
+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+	pgprotval_t val;
+
+	val = pgprot_val(prot);
+	return __pgprot(val | _PAGE_PKEY(PKS_KEY_PGMAP_PROTECTION));
+}
+
+static void devmap_protection_disable(void)
+{
+	static_branch_dec(&dev_pgmap_protection_static_key);
+}
+
+void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
+{
+	if (!current->pgmap_prot_count++)
+		pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite);
+
+void __pgmap_mk_noaccess(struct dev_pagemap *pgmap)
+{
+	if (!--current->pgmap_prot_count)
+		pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess);
+
+bool pgmap_protection_enabled(void)
+{
+	return pks_enabled();
+}
+EXPORT_SYMBOL_GPL(pgmap_protection_enabled);
+
+#else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */
+
+static void devmap_protection_enable(void) { }
+static void devmap_protection_disable(void) { }
+
+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+	return prot;
+}
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
 static void pgmap_array_delete(struct range *range)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -181,6 +244,9 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put(pgmap);
+
+	if (pgmap->flags & PGMAP_PROTECTION)
+		devmap_protection_disable();
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -329,6 +395,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
 		return ERR_PTR(-EINVAL);
 
+	if (pgmap->flags & PGMAP_PROTECTION) {
+		if (!pgmap_protection_enabled())
+			return ERR_PTR(-EINVAL);
+		devmap_protection_enable();
+		params.pgprot = devmap_protection_adjust_pgprot(params.pgprot);
+	}
+
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (12 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS) ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:57   ` Randy Dunlap
  2021-08-11 19:01   ` Edgecombe, Rick P
  2021-08-04  4:32 ` [PATCH V7 15/18] kmap: Add stray access protection for devmap pages ira.weiny
                   ` (3 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Some systems may be using pmem in unanticipated ways.  As such it is
possible a code path may violation the restrictions of the PMEM PKS
protections.

In order to provide a more seamless integration of the PMEM PKS feature
provide a pks_fault_mode that allows for a relaxed mode should a
previously working feature start to fault on PKS protected PMEM.

2 modes are available:

	'relaxed' (default) -- WARN_ONCE, abandon the protections, and
	continuing to operate.

	'strict' -- BUG_ON/or fault indicating the error.  This is the
	most protective of the PMEM memory but may be undesirable in
	some configurations.

NOTE: There was some debate about if a 3rd mode called 'silent' should
be available.  'silent' would be the same as 'relaxed' but not print any
output.  While 'silent' is nice for admins to reduce console/log output
it would result in less motivation to fix invalid access to the
protected pmem pages.  Therefore, 'silent' is left out.

In addition, kmap() is known to not work with this protection.  Provide
a new call; pgmap_protection_flag_invalid().  This gives better
debugging for missed kmap() users.  This call also respects the
pks_fault_mode settings.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Leverage Rick Edgecombe's fault callback infrastructure to relax invalid
		uses and prevent crashes
	From Dan Williams
		Use sysfs_* calls for parameter
		Make pgmap_disable_protection inline
		Remove pfn from warn output
	Remove silent parameter option
---
 .../admin-guide/kernel-parameters.txt         | 14 +++
 arch/x86/mm/pkeys.c                           |  8 +-
 include/linux/mm.h                            | 26 ++++++
 mm/memremap.c                                 | 85 +++++++++++++++++++
 4 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f713..7902fce7f1da 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4081,6 +4081,20 @@
 	pirq=		[SMP,APIC] Manual mp-table setup
 			See Documentation/x86/i386/IO-APIC.rst.
 
+	memremap.pks_fault_mode=	[X86] Control the behavior of page map
+			protection violations.  Violations may not be an actual
+			use of the memory but simply an attempt to map it in an
+			incompatible way.
+			(depends on CONFIG_DEVMAP_ACCESS_PROTECTION
+
+			Format: { relaxed | strict }
+
+			relaxed - Print a warning, disable the protection and
+				  continue execution.
+			strict - Stop kernel execution via BUG_ON or fault
+
+			default: relaxed
+
 	plip=		[PPT,NET] Parallel port network link
 			Format: { parport<nr> | timid | 0 }
 			See also Documentation/admin-guide/parport.rst.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index cdebc2018888..201004586c2b 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -9,6 +9,7 @@
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
+#include <linux/mm.h>                   /* fault callback               */
 #include <uapi/asm-generic/mman-common.h>
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
@@ -241,7 +242,12 @@ int handle_abandoned_pks_value(struct pt_regs *regs)
 	return (ept_regs->thread_pkrs != old);
 }
 
-static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 };
+static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
+	[PKS_KEY_DEFAULT]            = NULL,
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	[PKS_KEY_PGMAP_PROTECTION]   = pgmap_pks_fault_callback,
+#endif
+};
 
 bool handle_pks_key_callback(unsigned long address, bool write, u16 key)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d3c1a3ecca87..c13c7af7cad3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1216,6 +1216,7 @@ static inline bool devmap_protected(struct page *page)
 	return false;
 }
 
+void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap);
 void __pgmap_mk_readwrite(struct dev_pagemap *pgmap);
 void __pgmap_mk_noaccess(struct dev_pagemap *pgmap);
 
@@ -1232,6 +1233,27 @@ static inline bool pgmap_check_pgmap_prot(struct page *page)
 	return true;
 }
 
+/*
+ * pgmap_protection_flag_invalid - Check and flag an invalid use of a pgmap
+ *                                 protected page
+ *
+ * There are code paths which are known to not be compatible with pgmap
+ * protections.  pgmap_protection_flag_invalid() is provided as a 'relief
+ * valve' to be used in those functions which are known to be incompatible.
+ *
+ * Thus an invalid code path can be flag more precisely what code contains the
+ * bug vs just flagging a fault.  Like the fault handler code this abandons the
+ * use of the PKS key and optionally allows the calling code path to continue
+ * based on the configuration of the memremap.pks_fault_mode command line
+ * (and/or sysfs) option.
+ */
+static inline void pgmap_protection_flag_invalid(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_protection_flag_invalid(page->pgmap);
+}
+
 static inline void pgmap_mk_readwrite(struct page *page)
 {
 	if (!pgmap_check_pgmap_prot(page))
@@ -1247,10 +1269,14 @@ static inline void pgmap_mk_noaccess(struct page *page)
 
 bool pgmap_protection_enabled(void);
 
+bool pgmap_pks_fault_callback(unsigned long address, bool write);
+
 #else
 
 static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
 static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { }
+
+static inline void pgmap_protection_flag_invalid(struct page *page) { }
 static inline void pgmap_mk_readwrite(struct page *page) { }
 static inline void pgmap_mk_noaccess(struct page *page) { }
 static inline bool pgmap_protection_enabled(void)
diff --git a/mm/memremap.c b/mm/memremap.c
index a05de8714916..930b360bad86 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -95,6 +95,91 @@ static void devmap_protection_disable(void)
 	static_branch_dec(&dev_pgmap_protection_static_key);
 }
 
+/*
+ * Ignore the checkpatch warning because the typedef allows
+ * param_check_pks_fault_modes to automatically check the passed value.
+ */
+typedef enum {
+	PKS_MODE_STRICT  = 0,
+	PKS_MODE_RELAXED = 1,
+} pks_fault_modes;
+
+pks_fault_modes pks_fault_mode = PKS_MODE_RELAXED;
+
+static int param_set_pks_fault_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret = -EINVAL;
+
+	if (!sysfs_streq(val, "relaxed")) {
+		pks_fault_mode = PKS_MODE_RELAXED;
+		ret = 0;
+	} else if (!sysfs_streq(val, "strict")) {
+		pks_fault_mode = PKS_MODE_STRICT;
+		ret = 0;
+	}
+
+	return ret;
+}
+
+static int param_get_pks_fault_mode(char *buffer, const struct kernel_param *kp)
+{
+	int ret = 0;
+
+	switch (pks_fault_mode) {
+	case PKS_MODE_STRICT:
+		ret = sysfs_emit(buffer, "strict\n");
+		break;
+	case PKS_MODE_RELAXED:
+		ret = sysfs_emit(buffer, "relaxed\n");
+		break;
+	default:
+		ret = sysfs_emit(buffer, "<unknown>\n");
+		break;
+	}
+
+	return ret;
+}
+
+static const struct kernel_param_ops param_ops_pks_fault_modes = {
+	.set = param_set_pks_fault_mode,
+	.get = param_get_pks_fault_mode,
+};
+
+#define param_check_pks_fault_modes(name, p) \
+	__param_check(name, p, pks_fault_modes)
+module_param(pks_fault_mode, pks_fault_modes, 0644);
+
+static void pgmap_abandon_protection(void)
+{
+	static bool protections_abandoned = false;
+
+	if (!protections_abandoned) {
+		protections_abandoned = true;
+		pks_abandon_protections(PKS_KEY_PGMAP_PROTECTION);
+	}
+}
+
+void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap)
+{
+	BUG_ON(pks_fault_mode == PKS_MODE_STRICT);
+
+	WARN_ONCE(1, "Page map protection disabled");
+	pgmap_abandon_protection();
+}
+EXPORT_SYMBOL_GPL(__pgmap_protection_flag_invalid);
+
+bool pgmap_pks_fault_callback(unsigned long address, bool write)
+{
+	/* In strict mode just let the fault handler oops */
+	if (pks_fault_mode == PKS_MODE_STRICT)
+		return false;
+
+	WARN_ONCE(1, "Page map protection disabled");
+	pgmap_abandon_protection();
+	return true;
+}
+EXPORT_SYMBOL_GPL(pgmap_pks_fault_callback);
+
 void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
 {
 	if (!current->pgmap_prot_count++)
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 15/18] kmap: Add stray access protection for devmap pages
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (13 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 16/18] dax: Stray access protection for dax_direct_access() ira.weiny
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Fenghua Yu, Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Enable PKS protection for devmap pages.  The devmap protection facility
wants to co-opt kmap_{local_page,atomic}() to mediate access to PKS
protected pages.

kmap() allows for global mappings to be established, while the PKS
facility depends on thread-local access. For this reason kmap() is not
supported, but it leaves a policy decision for what to do when kmap()
is attempted on a protected devmap page.

Neither of the 2 current DAX-capable filesystems (ext4 and xfs) perform
such global mappings.  The bulk of device drivers that would handle
devmap pages are not using kmap().  Any future filesystems that gain DAX
support, or device drivers wanting to support devmap protected pages
will need to move to kmap_local_page().  In the meantime to handle these
kmap() users call pgmap_protection_flag_invalid() to flag and invalid
use of any potentially protected pages.  This allows better debugging of
invalided uses vs catching faults later on when the address is used.

Direct-map exposure is already mitigated by default on HIGHMEM systems
because by definition HIGHMEM systems do not have large capacities of
memory in the direct map.  Therefore, to reduce complexity HIGHMEM
systems are not supported.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/highmem-internal.h | 5 +++++
 mm/Kconfig                       | 1 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index 7902c7d8b55f..f88bc14a643b 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -142,6 +142,7 @@ static inline struct page *kmap_to_page(void *addr)
 static inline void *kmap(struct page *page)
 {
 	might_sleep();
+	pgmap_protection_flag_invalid(page);
 	return page_address(page);
 }
 
@@ -157,6 +158,7 @@ static inline void kunmap(struct page *page)
 
 static inline void *kmap_local_page(struct page *page)
 {
+	pgmap_mk_readwrite(page);
 	return page_address(page);
 }
 
@@ -175,12 +177,14 @@ static inline void __kunmap_local(void *addr)
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(addr);
 #endif
+	pgmap_mk_noaccess(kmap_to_page(addr));
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
 	preempt_disable();
 	pagefault_disable();
+	pgmap_mk_readwrite(page);
 	return page_address(page);
 }
 
@@ -199,6 +203,7 @@ static inline void __kunmap_atomic(void *addr)
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(addr);
 #endif
+	pgmap_mk_noaccess(kmap_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/mm/Kconfig b/mm/Kconfig
index 201d41269a36..4184d0a7531d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -794,6 +794,7 @@ config DEVMAP_ACCESS_PROTECTION
 	bool "Access protection for memremap_pages()"
 	depends on NVDIMM_PFN
 	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	depends on !HIGHMEM
 	select GENERAL_PKS_USER
 	default y
 
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 16/18] dax: Stray access protection for dax_direct_access()
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (14 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 15/18] kmap: Add stray access protection for devmap pages ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 17/18] nvdimm/pmem: Enable stray access protection ira.weiny
  2021-08-04  4:32 ` [PATCH V7 18/18] devdax: " ira.weiny
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

dax_direct_access() provides a way to obtain the direct map address of
PMEM memory.  Coordinate PKS protection with dax_direct_access() of
protected devmap pages.

Introduce 3 new calls dax_{protected,mk_readwrite,mk_noaccess}()
These 3 calls do not have to be implemented by the dax provider if no
protection is implemented.

Single threads of execution can use dax_mk_{readwrite,noaccess}() to
relax the protection of the dax device and allow direct use of the kaddr
returned from dax_direct_access().  dax_mk_{readwrite,noaccess}() must
be used within the dax_read_[un]lock() protected region.  And they only
need to be used to guard actual access to the memory pointed to.  Other
uses of dax_direct_access() do not need to use these guards.

For users who require a permanent address to the dax device such as the
DM write cache.  dax_protected() indicates that the dax device has
additional protections.  In this case the user choses to create it's own
mapping of the memory.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Rework cover letter.
	Do not include a FS_DAX_LIMITED restriction for dcss.  It  will
		simply not implement the protection and there is no need
		to special case this.
		Clean up commit message because I did not originally
		understand the nuance of the s390 device.
	Introduce dax_{protected,mk_readwrite,mk_noaccess}()
	From Dan Williams
		Remove old clean up cruft from previous versions
		Remove map_protected
	Remove 'global' parameters all calls
---
 drivers/dax/super.c        | 54 ++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-writecache.c |  8 +++++-
 fs/dax.c                   |  8 ++++++
 fs/fuse/virtio_fs.c        |  2 ++
 include/linux/dax.h        |  8 ++++++
 5 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 44736cbd446e..dc05c89102d0 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -296,6 +296,8 @@ EXPORT_SYMBOL_GPL(dax_attribute_group);
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
  * @kaddr: output parameter that returns a virtual address mapping of pfn
+ *         Direct access through this pointer must be guarded by calls to
+ *         dax_mk_{readwrite,noaccess}()
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
  * Return: negative errno if an error occurs, otherwise the number of
@@ -389,6 +391,58 @@ void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 #endif
 EXPORT_SYMBOL_GPL(dax_flush);
 
+bool dax_map_protected(struct dax_device *dax_dev)
+{
+	if (!dax_alive(dax_dev))
+		return false;
+
+	if (dax_dev->ops->map_protected)
+		return dax_dev->ops->map_protected(dax_dev);
+	return false;
+}
+EXPORT_SYMBOL_GPL(dax_map_protected);
+
+/**
+ * dax_mk_readwrite() - make protected dax devices read/write
+ * @dax_dev: the dax device representing the memory to access
+ *
+ * Any access of the kaddr memory returned from dax_direct_access() must be
+ * guarded by dax_mk_readwrite() and dax_mk_noaccess().  This ensures that any
+ * dax devices which have additional protections are allowed to relax those
+ * protections for the thread using this memory.
+ *
+ * NOTE these calls must be contained within a single thread of execution and
+ * both must be guarded by dax_read_lock()  Which is also a requirement for
+ * dax_direct_access() anyway.
+ */
+void dax_mk_readwrite(struct dax_device *dax_dev)
+{
+	if (!dax_alive(dax_dev))
+		return;
+
+	if (dax_dev->ops->mk_readwrite)
+		dax_dev->ops->mk_readwrite(dax_dev);
+}
+EXPORT_SYMBOL_GPL(dax_mk_readwrite);
+
+/**
+ * dax_mk_noaccess() - restore protection to dax devices if needed
+ * @dax_dev: the dax device representing the memory to access
+ *
+ * See dax_direct_access() and dax_mk_readwrite()
+ *
+ * NOTE Must be called prior to dax_read_unlock()
+ */
+void dax_mk_noaccess(struct dax_device *dax_dev)
+{
+	if (!dax_alive(dax_dev))
+		return;
+
+	if (dax_dev->ops->mk_noaccess)
+		dax_dev->ops->mk_noaccess(dax_dev);
+}
+EXPORT_SYMBOL_GPL(dax_mk_noaccess);
+
 void dax_write_cache(struct dax_device *dax_dev, bool wc)
 {
 	if (wc)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index e21e29e81bbf..27671300ad50 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -284,7 +284,13 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		r = -EOPNOTSUPP;
 		goto err2;
 	}
-	if (da != p) {
+
+	/*
+	 * Force the write cache to map the pages directly if the dax device
+	 * mapping is protected or if the number of pages returned was not what
+	 * was requested.
+	 */
+	if (dax_map_protected(wc->ssd_dev->dax_dev) || da != p) {
 		long i;
 		wc->memory_map = NULL;
 		pages = kvmalloc_array(p, sizeof(struct page *), GFP_KERNEL);
diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..9dfb93b39754 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -728,7 +728,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
 		return rc;
 	}
 	vto = kmap_atomic(to);
+	dax_mk_readwrite(dax_dev);
 	copy_user_page(vto, (void __force *)kaddr, vaddr, to);
+	dax_mk_noaccess(dax_dev);
 	kunmap_atomic(vto);
 	dax_read_unlock(id);
 	return 0;
@@ -1096,8 +1098,10 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 	}
 
 	if (!page_aligned) {
+		dax_mk_readwrite(iomap->dax_dev);
 		memset(kaddr + offset, 0, size);
 		dax_flush(iomap->dax_dev, kaddr + offset, size);
+		dax_mk_noaccess(iomap->dax_dev);
 	}
 	dax_read_unlock(id);
 	return size;
@@ -1169,6 +1173,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		if (map_len > end - pos)
 			map_len = end - pos;
 
+		dax_mk_readwrite(dax_dev);
+
 		/*
 		 * The userspace address for the memory copy has already been
 		 * validated via access_ok() in either vfs_read() or
@@ -1181,6 +1187,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 
+		dax_mk_noaccess(dax_dev);
+
 		pos += xfer;
 		length -= xfer;
 		done += xfer;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8f52cdaa8445..3dfb053b1c4d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -776,8 +776,10 @@ static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
 	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
 	if (rc < 0)
 		return rc;
+	dax_mk_readwrite(dax_dev);
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
 	dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
+	dax_mk_noaccess(dax_dev);
 	return 0;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..8ad4839705ca 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,6 +36,10 @@ struct dax_operations {
 			struct iov_iter *);
 	/* zero_page_range: required operation. Zero page range   */
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+
+	bool (*map_protected)(struct dax_device *dax_dev);
+	void (*mk_readwrite)(struct dax_device *dax_dev);
+	void (*mk_noaccess)(struct dax_device *dax_dev);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -228,6 +232,10 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
+bool dax_map_protected(struct dax_device *dax_dev);
+void dax_mk_readwrite(struct dax_device *dax_dev);
+void dax_mk_noaccess(struct dax_device *dax_dev);
+
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops);
 vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 17/18] nvdimm/pmem: Enable stray access protection
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (15 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 16/18] dax: Stray access protection for dax_direct_access() ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  2021-08-04  4:32 ` [PATCH V7 18/18] devdax: " ira.weiny
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Now that all potential / valid kernel initiated access' to PMEM have
been annotated with {__}pgmap_mk_{readwrite,noaccess}(), turn on
PGMAP_PROTECTION.

Implement the dax_protected which communicates this memory has extra
protection.  Also implement pmem_mk_{readwrite,noaccess}() to relax
those protections for valid users.

Internally, the pmem driver uses a cached virtual address,
pmem->virt_addr (pmem_addr).

Call __pgmap_mk_{readwrite,noaccess}() directly when PGMAP_PROTECTION is
active on the device.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Remove global param
	Add internal structure which uses the pmem device and pgmap
		device directly in the *_mk_*() calls.
	Add pmem dax ops callbacks
	Use pgmap_protection_enabled()
	s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
---
 drivers/nvdimm/pmem.c | 55 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1e0615b8565e..6e924b907264 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
 	return BLK_STS_OK;
 }
 
+static void __pmem_mk_readwrite(struct pmem_device *pmem)
+{
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		__pgmap_mk_readwrite(&pmem->pgmap);
+}
+
+static void __pmem_mk_noaccess(struct pmem_device *pmem)
+{
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		__pgmap_mk_noaccess(&pmem->pgmap);
+}
+
 static blk_status_t pmem_do_read(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
@@ -149,7 +161,10 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
 		return BLK_STS_IOERR;
 
+	__pmem_mk_readwrite(pmem);
 	rc = read_pmem(page, page_off, pmem_addr, len);
+	__pmem_mk_noaccess(pmem);
+
 	flush_dcache_page(page);
 	return rc;
 }
@@ -181,11 +196,14 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 	 * after clear poison.
 	 */
 	flush_dcache_page(page);
+
+	__pmem_mk_readwrite(pmem);
 	write_pmem(pmem_addr, page, page_off, len);
 	if (unlikely(bad_pmem)) {
 		rc = pmem_clear_poison(pmem, pmem_off, len);
 		write_pmem(pmem_addr, page, page_off, len);
 	}
+	__pmem_mk_noaccess(pmem);
 
 	return rc;
 }
@@ -320,6 +338,23 @@ static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 	return _copy_mc_to_iter(addr, bytes, i);
 }
 
+static bool pmem_map_protected(struct dax_device *dax_dev)
+{
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+
+	return (pmem->pgmap.flags & PGMAP_PROTECTION);
+}
+
+static void pmem_mk_readwrite(struct dax_device *dax_dev)
+{
+	__pmem_mk_readwrite(dax_get_private(dax_dev));
+}
+
+static void pmem_mk_noaccess(struct dax_device *dax_dev)
+{
+	__pmem_mk_noaccess(dax_get_private(dax_dev));
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
 	.dax_supported = generic_fsdax_supported,
@@ -328,6 +363,17 @@ static const struct dax_operations pmem_dax_ops = {
 	.zero_page_range = pmem_dax_zero_page_range,
 };
 
+static const struct dax_operations pmem_protected_dax_ops = {
+	.direct_access = pmem_dax_direct_access,
+	.dax_supported = generic_fsdax_supported,
+	.copy_from_iter = pmem_copy_from_iter,
+	.copy_to_iter = pmem_copy_to_iter,
+	.zero_page_range = pmem_dax_zero_page_range,
+	.map_protected = pmem_map_protected,
+	.mk_readwrite = pmem_mk_readwrite,
+	.mk_noaccess = pmem_mk_noaccess,
+};
+
 static const struct attribute_group *pmem_attribute_groups[] = {
 	&dax_attribute_group,
 	NULL,
@@ -432,6 +478,8 @@ static int pmem_attach_disk(struct device *dev,
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
+		if (pgmap_protection_enabled())
+			pmem->pgmap.flags |= PGMAP_PROTECTION;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -446,6 +494,8 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
+		if (pgmap_protection_enabled())
+			pmem->pgmap.flags |= PGMAP_PROTECTION;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
 		bb_range = pmem->pgmap.range;
@@ -483,7 +533,10 @@ static int pmem_attach_disk(struct device *dev,
 
 	if (is_nvdimm_sync(nd_region))
 		flags = DAXDEV_F_SYNC;
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags);
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_protected_dax_ops, flags);
+	else
+		dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags);
 	if (IS_ERR(dax_dev)) {
 		return PTR_ERR(dax_dev);
 	}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V7 18/18] devdax: Enable stray access protection
  2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (16 preceding siblings ...)
  2021-08-04  4:32 ` [PATCH V7 17/18] nvdimm/pmem: Enable stray access protection ira.weiny
@ 2021-08-04  4:32 ` ira.weiny
  17 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2021-08-04  4:32 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Ira Weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

From: Ira Weiny <ira.weiny@intel.com>

Device dax is primarily accessed through user space.  Kernel access is
controlled through the kmap interfaces.

Now that all valid kernel initiated access to dax devices have been
accounted for with pgmap_mk_{readwrite,noaccess}() through kmap, turn
on PGMAP_PKEYS_PROTECT for device dax.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7
	Use pgmap_protetion_enabled()
	s/PGMAP_PKEYS_PROTECT/PGMAP_PROTECTION/
---
 drivers/dax/device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..cdf6ef4c1edb 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (pgmap_protection_enabled())
+		pgmap->flags |= PGMAP_PROTECTION;
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode
  2021-08-04  4:32 ` [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode ira.weiny
@ 2021-08-04  4:57   ` Randy Dunlap
  2021-08-07 19:32     ` Ira Weiny
  2021-08-11 19:01   ` Edgecombe, Rick P
  1 sibling, 1 reply; 42+ messages in thread
From: Randy Dunlap @ 2021-08-04  4:57 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, H. Peter Anvin, Fenghua Yu, Rick Edgecombe, x86,
	linux-kernel, nvdimm, linux-mm

On 8/3/21 9:32 PM, ira.weiny@intel.com wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bdb22006f713..7902fce7f1da 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4081,6 +4081,20 @@
>   	pirq=		[SMP,APIC] Manual mp-table setup
>   			See Documentation/x86/i386/IO-APIC.rst.
>   
> +	memremap.pks_fault_mode=	[X86] Control the behavior of page map
> +			protection violations.  Violations may not be an actual
> +			use of the memory but simply an attempt to map it in an
> +			incompatible way.
> +			(depends on CONFIG_DEVMAP_ACCESS_PROTECTION

Missing closing ')' above.

> +
> +			Format: { relaxed | strict }
> +
> +			relaxed - Print a warning, disable the protection and
> +				  continue execution.
> +			strict - Stop kernel execution via BUG_ON or fault
> +
> +			default: relaxed
> +


-- 
~Randy


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode
  2021-08-04  4:57   ` Randy Dunlap
@ 2021-08-07 19:32     ` Ira Weiny
  0 siblings, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2021-08-07 19:32 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Dave Hansen, Dan Williams, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Andy Lutomirski, H. Peter Anvin,
	Fenghua Yu, Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

On Tue, Aug 03, 2021 at 09:57:31PM -0700, Randy Dunlap wrote:
> On 8/3/21 9:32 PM, ira.weiny@intel.com wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index bdb22006f713..7902fce7f1da 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4081,6 +4081,20 @@
> >   	pirq=		[SMP,APIC] Manual mp-table setup
> >   			See Documentation/x86/i386/IO-APIC.rst.
> > +	memremap.pks_fault_mode=	[X86] Control the behavior of page map
> > +			protection violations.  Violations may not be an actual
> > +			use of the memory but simply an attempt to map it in an
> > +			incompatible way.
> > +			(depends on CONFIG_DEVMAP_ACCESS_PROTECTION
> 
> Missing closing ')' above.

Fixed.  Thank you!
Ira

> 
> > +
> > +			Format: { relaxed | strict }
> > +
> > +			relaxed - Print a warning, disable the protection and
> > +				  continue execution.
> > +			strict - Stop kernel execution via BUG_ON or fault
> > +
> > +			default: relaxed
> > +
> 
> 
> -- 
> ~Randy
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode
  2021-08-04  4:32 ` [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode ira.weiny
  2021-08-04  4:57   ` Randy Dunlap
@ 2021-08-11 19:01   ` Edgecombe, Rick P
  2021-08-17  3:12     ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2021-08-11 19:01 UTC (permalink / raw)
  To: Williams, Dan J, Weiny, Ira, dave.hansen
  Cc: linux-kernel, peterz, nvdimm, tglx, linux-mm, Yu, Fenghua, x86,
	hpa, mingo, Lutomirski, Andy, bp

On Tue, 2021-08-03 at 21:32 -0700, ira.weiny@intel.com wrote:
> +static int param_set_pks_fault_mode(const char *val, const struct
> kernel_param *kp)
> +{
> +       int ret = -EINVAL;
> +
> +       if (!sysfs_streq(val, "relaxed")) {
> +               pks_fault_mode = PKS_MODE_RELAXED;
> +               ret = 0;
> +       } else if (!sysfs_streq(val, "strict")) {
> +               pks_fault_mode = PKS_MODE_STRICT;
> +               ret = 0;
> +       }
> +
> +       return ret;
> +}
> +

Looks like !sysfs_streq() should be just sysfs_streq().

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 12/18] x86/pks: Add PKS fault callbacks
  2021-08-04  4:32 ` [PATCH V7 12/18] x86/pks: Add PKS fault callbacks ira.weiny
@ 2021-08-11 21:18   ` Edgecombe, Rick P
  2021-08-17  3:21     ` Ira Weiny
  0 siblings, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2021-08-11 21:18 UTC (permalink / raw)
  To: Williams, Dan J, Weiny, Ira, dave.hansen
  Cc: linux-kernel, peterz, nvdimm, tglx, linux-mm, Yu, Fenghua, x86,
	hpa, mingo, Lutomirski, Andy, bp

On Tue, 2021-08-03 at 21:32 -0700, ira.weiny@intel.com wrote:
> +static const pks_key_callback
> pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 };
> +
> +bool handle_pks_key_callback(unsigned long address, bool write, u16
> key)
> +{
> +       if (key > PKS_KEY_NR_CONSUMERS)
> +               return false;
Good idea, should be >= though?

> +
> +       if (pks_key_callbacks[key])
> +               return pks_key_callbacks[key](address, write);
> +
> +       return false;
> +}
> +

Otherwise, I've rebased on this series and didn't hit any problems.
Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode
  2021-08-11 19:01   ` Edgecombe, Rick P
@ 2021-08-17  3:12     ` Ira Weiny
  0 siblings, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2021-08-17  3:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Williams, Dan J, dave.hansen, linux-kernel, peterz, nvdimm, tglx,
	linux-mm, Yu, Fenghua, x86, hpa, mingo, Lutomirski, Andy, bp

On Wed, Aug 11, 2021 at 12:01:28PM -0700, Edgecombe, Rick P wrote:
> On Tue, 2021-08-03 at 21:32 -0700, ira.weiny@intel.com wrote:
> > +static int param_set_pks_fault_mode(const char *val, const struct
> > kernel_param *kp)
> > +{
> > +       int ret = -EINVAL;
> > +
> > +       if (!sysfs_streq(val, "relaxed")) {
> > +               pks_fault_mode = PKS_MODE_RELAXED;
> > +               ret = 0;
> > +       } else if (!sysfs_streq(val, "strict")) {
> > +               pks_fault_mode = PKS_MODE_STRICT;
> > +               ret = 0;
> > +       }
> > +
> > +       return ret;
> > +}
> > +
> 
> Looks like !sysfs_streq() should be just sysfs_streq().

Indeed. Fixed.

Thanks!
Ira


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 12/18] x86/pks: Add PKS fault callbacks
  2021-08-11 21:18   ` Edgecombe, Rick P
@ 2021-08-17  3:21     ` Ira Weiny
  0 siblings, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2021-08-17  3:21 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Williams, Dan J, dave.hansen, linux-kernel, peterz, nvdimm, tglx,
	linux-mm, Yu, Fenghua, x86, hpa, mingo, Lutomirski, Andy, bp

On Wed, Aug 11, 2021 at 02:18:26PM -0700, Edgecombe, Rick P wrote:
> On Tue, 2021-08-03 at 21:32 -0700, ira.weiny@intel.com wrote:
> > +static const pks_key_callback
> > pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 };
> > +
> > +bool handle_pks_key_callback(unsigned long address, bool write, u16
> > key)
> > +{
> > +       if (key > PKS_KEY_NR_CONSUMERS)
> > +               return false;
> Good idea, should be >= though?

Yep.  Fixed thanks.

> 
> > +
> > +       if (pks_key_callbacks[key])
> > +               return pks_key_callbacks[key](address, write);
> > +
> > +       return false;
> > +}
> > +
> 
> Otherwise, I've rebased on this series and didn't hit any problems.
> Thanks.

Awesome!  I still want Dave and Dan to weigh in prior to me respining with the
changes so far.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-08-04  4:32 ` [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
@ 2021-11-13  0:50   ` Ira Weiny
  2021-11-25 11:19     ` Thomas Gleixner
  2021-12-03  1:13     ` Andy Lutomirski
  2021-11-25 14:12   ` Thomas Gleixner
  1 sibling, 2 replies; 42+ messages in thread
From: Ira Weiny @ 2021-11-13  0:50 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams, Andy Lutomirski, H. Peter Anvin
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Fenghua Yu, Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
> switch but this support leaves exception handling code open to memory
> accesses during exceptions.
> 
> 2 possible places for preserving this state were considered,
> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
> was potentially fraught with unintended consequences.[2]  However, Andy
> came up with a way to hide additional values on the stack which could be
> accessed as "extended_pt_regs".[3]

Andy,

I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
since I originally implemented this in V4[1].

Does this meets your expectations?  Are there any issues you can see with this
code?

I would appreciate your feedback.

Thanks,
Ira

[1] https://lore.kernel.org/lkml/20210322053020.2287058-9-ira.weiny@intel.com/

> This method allows for; any place
> which has struct pt_regs can get access to the extra information; no
> extra information is added to irq_state; and pt_regs is left intact for
> compatibility with outside tools like BPF.
> 
> To simplify, the assembly code only adds space on the stack.  The
> setting or use of any needed values are left to the C code.  While some
> entry points may not use this space it is still added where ever pt_regs
> is passed to the C code for consistency.
> 
> Each nested exception gets another copy of this extended space allowing
> for any number of levels of exception handling.
> 
> In the assembly, a macro is defined to allow a central place to add
> space for other uses should the need arise.
> 
> Finally export pkrs_{save|restore}_irq to the common code to allow
> it to preserve the current task's PKRS in the new extended pt_regs if
> enabled.
> 
> Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
> aided in the development of the patch..
> 
> [1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
> [3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/
> 
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for V7:
> 	Rebased to 5.14 entry code
> 	declare write_pkrs() in pks.h
> 	s/INIT_PKRS_VALUE/pkrs_init_value
> 	Remove unnecessary INIT_PKRS_VALUE def
> 	s/pkrs_save_set_irq/pkrs_save_irq/
> 		The inital value for exceptions is best managed
> 		completely within the pkey code.
> ---
>  arch/x86/entry/calling.h               | 26 +++++++++++++
>  arch/x86/entry/common.c                | 54 ++++++++++++++++++++++++++
>  arch/x86/entry/entry_64.S              | 22 ++++++-----
>  arch/x86/entry/entry_64_compat.S       |  6 +--
>  arch/x86/include/asm/pks.h             | 18 +++++++++
>  arch/x86/include/asm/processor-flags.h |  2 +
>  arch/x86/kernel/head_64.S              |  7 ++--
>  arch/x86/mm/fault.c                    |  3 ++
>  include/linux/pkeys.h                  | 11 +++++-
>  kernel/entry/common.c                  | 14 ++++++-
>  10 files changed, 143 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index a4c061fb7c6e..a2f94677c3fd 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -63,6 +63,32 @@ For 32-bit we have the following conventions - kernel is built with
>   * for assembly code:
>   */
>  
> +/*
> + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> + * @cfunc:		C function to be called
> + *
> + * This will ensure that extended_ptregs is added and removed as needed during
> + * a call into C code.
> + */
> +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* add space for extended_pt_regs */
> +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> +	.if \annotate_retpoline_safe == 1
> +		ANNOTATE_RETPOLINE_SAFE
> +	.endif
> +	call	\cfunc
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* remove space for extended_pt_regs */
> +	addq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> +.endm
> +
> +.macro call_ext_ptregs cfunc
> +	__call_ext_ptregs \cfunc, annotate_retpoline_safe=0
> +.endm
> +
>  .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
>  	.if \save_ret
>  	pushq	%rsi		/* pt_regs->si */
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 6c2826417b33..a0d1d5519dba 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -19,6 +19,7 @@
>  #include <linux/nospec.h>
>  #include <linux/syscalls.h>
>  #include <linux/uaccess.h>
> +#include <linux/pkeys.h>
>  
>  #ifdef CONFIG_XEN_PV
>  #include <xen/xen-ops.h>
> @@ -34,6 +35,7 @@
>  #include <asm/io_bitmap.h>
>  #include <asm/syscall.h>
>  #include <asm/irq_stack.h>
> +#include <asm/pks.h>
>  
>  #ifdef CONFIG_X86_64
>  
> @@ -252,6 +254,56 @@ SYSCALL_DEFINE0(ni_syscall)
>  	return -ENOSYS;
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code)
> +{
> +	struct extended_pt_regs *ept_regs = extended_pt_regs(regs);
> +
> +	if (cpu_feature_enabled(X86_FEATURE_PKS) && (error_code & X86_PF_PK))
> +		pr_alert("PKRS: 0x%x\n", ept_regs->thread_pkrs);
> +}
> +
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection for
> + * pages which have been mapped with a protection key.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory save the current
> + * thread value and set the PKRS value to be used during the exception.
> + */
> +void pkrs_save_irq(struct pt_regs *regs)
> +{
> +	struct extended_pt_regs *ept_regs;
> +
> +	BUILD_BUG_ON(sizeof(struct extended_pt_regs)
> +			!= EXTENDED_PT_REGS_SIZE
> +				+ sizeof(struct pt_regs));
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	ept_regs = extended_pt_regs(regs);
> +	ept_regs->thread_pkrs = current->thread.saved_pkrs;
> +	write_pkrs(pkrs_init_value);
> +}
> +
> +void pkrs_restore_irq(struct pt_regs *regs)
> +{
> +	struct extended_pt_regs *ept_regs;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	ept_regs = extended_pt_regs(regs);
> +	write_pkrs(ept_regs->thread_pkrs);
> +	current->thread.saved_pkrs = ept_regs->thread_pkrs;
> +}
> +
> +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
> +
>  #ifdef CONFIG_XEN_PV
>  #ifndef CONFIG_PREEMPTION
>  /*
> @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>  	inhcall = get_and_clear_inhcall();
>  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> +		/* Normally called by irqentry_exit, restore pkrs here */
> +		pkrs_restore_irq(regs);
>  		irqentry_exit_cond_resched();
>  		instrumentation_end();
>  		restore_inhcall(inhcall);
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e38a4cf795d9..1c390975a3de 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -332,7 +332,7 @@ SYM_CODE_END(ret_from_fork)
>  		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
>  	.endif
>  
> -	call	\cfunc
> +	call_ext_ptregs \cfunc
>  
>  	jmp	error_return
>  .endm
> @@ -435,7 +435,7 @@ SYM_CODE_START(\asmsym)
>  
>  	movq	%rsp, %rdi		/* pt_regs pointer */
>  
> -	call	\cfunc
> +	call_ext_ptregs \cfunc
>  
>  	jmp	paranoid_exit
>  
> @@ -496,7 +496,7 @@ SYM_CODE_START(\asmsym)
>  	 * stack.
>  	 */
>  	movq	%rsp, %rdi		/* pt_regs pointer */
> -	call	vc_switch_off_ist
> +	call_ext_ptregs vc_switch_off_ist
>  	movq	%rax, %rsp		/* Switch to new stack */
>  
>  	UNWIND_HINT_REGS
> @@ -507,7 +507,7 @@ SYM_CODE_START(\asmsym)
>  
>  	movq	%rsp, %rdi		/* pt_regs pointer */
>  
> -	call	kernel_\cfunc
> +	call_ext_ptregs kernel_\cfunc
>  
>  	/*
>  	 * No need to switch back to the IST stack. The current stack is either
> @@ -542,7 +542,7 @@ SYM_CODE_START(\asmsym)
>  	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
>  	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
>  	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
> -	call	\cfunc
> +	call_ext_ptregs \cfunc
>  
>  	jmp	paranoid_exit
>  
> @@ -781,7 +781,7 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
>  	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
>  	UNWIND_HINT_REGS
>  
> -	call	xen_pv_evtchn_do_upcall
> +	call_ext_ptregs xen_pv_evtchn_do_upcall
>  
>  	jmp	error_return
>  SYM_CODE_END(exc_xen_hypervisor_callback)
> @@ -987,7 +987,7 @@ SYM_CODE_START_LOCAL(error_entry)
>  	/* Put us onto the real thread stack. */
>  	popq	%r12				/* save return addr in %12 */
>  	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
> -	call	sync_regs
> +	call_ext_ptregs sync_regs
>  	movq	%rax, %rsp			/* switch stack */
>  	ENCODE_FRAME_POINTER
>  	pushq	%r12
> @@ -1042,7 +1042,7 @@ SYM_CODE_START_LOCAL(error_entry)
>  	 * as if we faulted immediately after IRET.
>  	 */
>  	mov	%rsp, %rdi
> -	call	fixup_bad_iret
> +	call_ext_ptregs fixup_bad_iret
>  	mov	%rax, %rsp
>  	jmp	.Lerror_entry_from_usermode_after_swapgs
>  SYM_CODE_END(error_entry)
> @@ -1148,7 +1148,7 @@ SYM_CODE_START(asm_exc_nmi)
>  
>  	movq	%rsp, %rdi
>  	movq	$-1, %rsi
> -	call	exc_nmi
> +	call_ext_ptregs exc_nmi
>  
>  	/*
>  	 * Return back to user mode.  We must *not* do the normal exit
> @@ -1184,6 +1184,8 @@ SYM_CODE_START(asm_exc_nmi)
>  	 * +---------------------------------------------------------+
>  	 * | pt_regs                                                 |
>  	 * +---------------------------------------------------------+
> +	 * | (Optionally) extended_pt_regs                           |
> +	 * +---------------------------------------------------------+
>  	 *
>  	 * The "original" frame is used by hardware.  Before re-enabling
>  	 * NMIs, we need to be done with it, and we need to leave enough
> @@ -1360,7 +1362,7 @@ end_repeat_nmi:
>  
>  	movq	%rsp, %rdi
>  	movq	$-1, %rsi
> -	call	exc_nmi
> +	call_ext_ptregs exc_nmi
>  
>  	/* Always restore stashed CR3 value (see paranoid_entry) */
>  	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index 0051cf5c792d..53254d29d5c7 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -136,7 +136,7 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
>  .Lsysenter_flags_fixed:
>  
>  	movq	%rsp, %rdi
> -	call	do_SYSENTER_32
> +	call_ext_ptregs do_SYSENTER_32
>  	/* XEN PV guests always use IRET path */
>  	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
>  		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
> @@ -253,7 +253,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
>  	UNWIND_HINT_REGS
>  
>  	movq	%rsp, %rdi
> -	call	do_fast_syscall_32
> +	call_ext_ptregs do_fast_syscall_32
>  	/* XEN PV guests always use IRET path */
>  	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
>  		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
> @@ -410,6 +410,6 @@ SYM_CODE_START(entry_INT80_compat)
>  	cld
>  
>  	movq	%rsp, %rdi
> -	call	do_int80_syscall_32
> +	call_ext_ptregs do_int80_syscall_32
>  	jmp	swapgs_restore_regs_and_return_to_usermode
>  SYM_CODE_END(entry_INT80_compat)
> diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
> index e7727086cec2..76960ec71b4b 100644
> --- a/arch/x86/include/asm/pks.h
> +++ b/arch/x86/include/asm/pks.h
> @@ -4,15 +4,33 @@
>  
>  #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>  
> +struct extended_pt_regs {
> +	u32 thread_pkrs;
> +	/* Keep stack 8 byte aligned */
> +	u32 pad;
> +	struct pt_regs pt_regs;
> +};
> +
>  void setup_pks(void);
>  void pkrs_write_current(void);
>  void pks_init_task(struct task_struct *task);
> +void write_pkrs(u32 new_pkrs);
> +
> +static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs)
> +{
> +	return container_of(regs, struct extended_pt_regs, pt_regs);
> +}
> +
> +void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code);
>  
>  #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
>  
>  static inline void setup_pks(void) { }
>  static inline void pkrs_write_current(void) { }
>  static inline void pks_init_task(struct task_struct *task) { }
> +static inline void write_pkrs(u32 new_pkrs) { }
> +static inline void show_extended_regs_oops(struct pt_regs *regs,
> +					   unsigned long error_code) { }
>  
>  #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
>  
> diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
> index 02c2cbda4a74..4a41fc4cf028 100644
> --- a/arch/x86/include/asm/processor-flags.h
> +++ b/arch/x86/include/asm/processor-flags.h
> @@ -53,4 +53,6 @@
>  # define X86_CR3_PTI_PCID_USER_BIT	11
>  #endif
>  
> +#define EXTENDED_PT_REGS_SIZE 8
> +
>  #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index d8b3ebd2bb85..90e76178b6b4 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -319,8 +319,7 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
>  	movq    %rsp, %rdi
>  	movq	ORIG_RAX(%rsp), %rsi
>  	movq	initial_vc_handler(%rip), %rax
> -	ANNOTATE_RETPOLINE_SAFE
> -	call	*%rax
> +	__call_ext_ptregs *%rax, annotate_retpoline_safe=1
>  
>  	/* Unwind pt_regs */
>  	POP_REGS
> @@ -397,7 +396,7 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
>  	UNWIND_HINT_REGS
>  
>  	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
> -	call do_early_exception
> +	call_ext_ptregs do_early_exception
>  
>  	decl early_recursion_flag(%rip)
>  	jmp restore_regs_and_return_to_kernel
> @@ -421,7 +420,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
>  	/* Call C handler */
>  	movq    %rsp, %rdi
>  	movq	ORIG_RAX(%rsp), %rsi
> -	call    do_vc_no_ghcb
> +	call_ext_ptregs do_vc_no_ghcb
>  
>  	/* Unwind pt_regs */
>  	POP_REGS
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index e133c0ed72a0..a4ce7cef0260 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -32,6 +32,7 @@
>  #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
>  #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
>  #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
> +#include <asm/pks.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <asm/trace/exceptions.h>
> @@ -547,6 +548,8 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
>  		 (error_code & X86_PF_PK)    ? "protection keys violation" :
>  					       "permissions violation");
>  
> +	show_extended_regs_oops(regs, error_code);
> +
>  	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
>  		struct desc_ptr idt, gdt;
>  		u16 ldtr, tr;
> diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> index 580238388f0c..76eb19a37942 100644
> --- a/include/linux/pkeys.h
> +++ b/include/linux/pkeys.h
> @@ -52,6 +52,15 @@ enum pks_pkey_consumers {
>  	PKS_KEY_NR_CONSUMERS
>  };
>  extern u32 pkrs_init_value;
> -#endif
> +
> +void pkrs_save_irq(struct pt_regs *regs);
> +void pkrs_restore_irq(struct pt_regs *regs);
> +
> +#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
> +
> +static inline void pkrs_save_irq(struct pt_regs *regs) { }
> +static inline void pkrs_restore_irq(struct pt_regs *regs) { }
> +
> +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
>  
>  #endif /* _LINUX_PKEYS_H */
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..aa0b1e8dd742 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/pkeys.h>
>  
>  #include "common.h"
>  
> @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  		instrumentation_end();
>  
>  		ret.exit_rcu = true;
> -		return ret;
> +		goto done;
>  	}
>  
>  	/*
> @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  	trace_hardirqs_off_finish();
>  	instrumentation_end();
>  
> +done:
> +	pkrs_save_irq(regs);
>  	return ret;
>  }
>  
> @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  	/* Check whether this returns to user mode */
>  	if (user_mode(regs)) {
>  		irqentry_exit_to_user_mode(regs);
> -	} else if (!regs_irqs_disabled(regs)) {
> +		return;
> +	}
> +
> +	pkrs_restore_irq(regs);
> +
> +	if (!regs_irqs_disabled(regs)) {
>  		/*
>  		 * If RCU was not watching on entry this needs to be done
>  		 * carefully and needs the same ordering of lockdep/tracing
> @@ -458,11 +466,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  	ftrace_nmi_enter();
>  	instrumentation_end();
>  
> +	pkrs_save_irq(regs);
>  	return irq_state;
>  }
>  
>  void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
>  {
> +	pkrs_restore_irq(regs);
>  	instrumentation_begin();
>  	ftrace_nmi_exit();
>  	if (irq_state.lockdep) {
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-11-13  0:50   ` Ira Weiny
@ 2021-11-25 11:19     ` Thomas Gleixner
  2021-12-03  1:13     ` Andy Lutomirski
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 11:19 UTC (permalink / raw)
  To: Ira Weiny, Dave Hansen, Dan Williams, Andy Lutomirski, H. Peter Anvin
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

On Fri, Nov 12 2021 at 16:50, Ira Weiny wrote:
> On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny <ira.weiny@intel.com>
>> 
>> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
>> switch but this support leaves exception handling code open to memory
>> accesses during exceptions.
>> 
>> 2 possible places for preserving this state were considered,
>> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
>> was potentially fraught with unintended consequences.[2]  However, Andy
>> came up with a way to hide additional values on the stack which could be
>> accessed as "extended_pt_regs".[3]
>
> Andy,
>
> I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
> since I originally implemented this in V4[1].
>
> Does this meets your expectations?  Are there any issues you can see with this
> code?
>
> I would appreciate your feedback.

Not Andy here, but I'll respond to the patch in a minute.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-08-04  4:32 ` [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
  2021-11-13  0:50   ` Ira Weiny
@ 2021-11-25 14:12   ` Thomas Gleixner
  2021-12-07  1:54     ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 14:12 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Andy Lutomirski, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Fenghua Yu, Rick Edgecombe, x86,
	linux-kernel, nvdimm, linux-mm

Ira,

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> + * @cfunc:		C function to be called
> + *
> + * This will ensure that extended_ptregs is added and removed as needed during
> + * a call into C code.
> + */
> +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* add space for extended_pt_regs */
> +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> +	.if \annotate_retpoline_safe == 1
> +		ANNOTATE_RETPOLINE_SAFE
> +	.endif

This annotation is new and nowhere mentioned why it is part of this
patch.

Can you please do _ONE_ functional change per patch and not a
unreviewable pile of changes in one go? Same applies for the ASM and the
C code changes. The ASM change has to go first and then the C code can
build upon it.

> +	call	\cfunc
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* remove space for extended_pt_regs */
> +	addq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif

I really have to ask the question whether this #ifdeffery has any value
at all. 8 bytes extra stack usage is not going to be the end of the
world and distro kernels will enable that config anyway.

If we really want to save the space then certainly not by sprinkling
CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
extra sized ptregs in the pkrs header.

You are changing generic architecture code so you better think about
making such a change generic and extensible. Can folks please start
thinking beyond the brim of their teacup and not pretend that the
feature they are working on is the unicorn which requires unique magic
brandnamed after the unicorn of the day.

If the next feature comes around which needs to save something in that
extended area then we are going to change the world again, right?
Certainly not.

This wants to go into asm/ptrace.h:

struct pt_regs_aux {
	u32	pkrs;
};

struct pt_regs_extended {
	struct pt_regs_aux	aux;
        struct pt_regs		regs __attribute__((aligned(8)));
};

and then have in asm-offset:

   DEFINE(PT_REGS_AUX_SIZE, sizeof(struct pt_regs_extended) - sizeof(struct pt_regs));

which does the right thing whatever the size of pt_regs_aux is. So for
the above it will have:

 #define PT_REGS_AUX_SIZE 8 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */

Even, if you do

struct pt_regs_aux {
#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
	u32	pkrs;
#endif        
};

and the config switch is disabled. It's still correct:

 #define PT_REGS_AUX_SIZE 0 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */

See? No magic hardcoded constant, no build time error checking for that
constant. Nothing, it just works.

That's one part, but let me come back to this:

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* add space for extended_pt_regs */
> +	subq    $EXTENDED_PT_REGS_SIZE, %rsp

What guarantees that RSP points to pt_regs at this point?  Nothing at
all. It's just pure luck and a question of time until this explodes in
hard to diagnose ways.

Because between

        movq	%rsp, %rdi
and
        call    ....

can legitimately be other code which causes the stack pointer to
change. It's not the case today, but nothing prevents this in the
future.

The correct thing to do is:

        movq	%rsp, %rdi
        RSP_MAKE_PT_REGS_AUX_SPACE
        call	...
        RSP_REMOVE_PT_REGS_AUX_SPACE

The few extra macro lines in the actual code are way better as they make
it completely obvious what's going on and any misuse can be spotted
easily.

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection for
> + * pages which have been mapped with a protection key.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory save the current
> + * thread value and set the PKRS value to be used during the exception.
> + */
> +void pkrs_save_irq(struct pt_regs *regs)

That's a misnomer as this is invoked for _any_ exception not just
interrupts.

>  #ifdef CONFIG_XEN_PV
>  #ifndef CONFIG_PREEMPTION
>  /*
> @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>  	inhcall = get_and_clear_inhcall();
>  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> +		/* Normally called by irqentry_exit, restore pkrs here */
> +		pkrs_restore_irq(regs);
> 		irqentry_exit_cond_resched();

Sigh. Consistency is overrated....

> +
>  void setup_pks(void);
>  void pkrs_write_current(void);
>  void pks_init_task(struct task_struct *task);
> +void write_pkrs(u32 new_pkrs);

So we have pkrs_write_current() and write_pkrs() now. Can you please
stick to a common prefix, i.e. pkrs_ ?

> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..aa0b1e8dd742 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/pkeys.h>
>  
>  #include "common.h"
>  
> @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  		instrumentation_end();
>  
>  		ret.exit_rcu = true;
> -		return ret;
> +		goto done;
>  	}
>  
>  	/*
> @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  	trace_hardirqs_off_finish();
>  	instrumentation_end();
>  
> +done:
> +	pkrs_save_irq(regs);

This still calls out into instrumentable code. I explained you before
why this is wrong. Also objtool emits warnings to that effect if you do a
proper verified build.

>  	return ret;
>  }
>  
> @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  	/* Check whether this returns to user mode */
>  	if (user_mode(regs)) {
>  		irqentry_exit_to_user_mode(regs);
> -	} else if (!regs_irqs_disabled(regs)) {
> +		return;
> +	}
> +
> +	pkrs_restore_irq(regs);

At least you are now putting it consistently at the wrong place
vs. noinstr.

Though, if you look at the xen_pv_evtchn_do_upcall() part where you
added this extra invocation you might figure out that adding
pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
the 'else' path in irqentry_exit() makes it magically consistent for
both use cases.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access()
  2021-08-04  4:32 ` [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
@ 2021-11-25 14:23   ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 14:23 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, H. Peter Anvin, Fenghua Yu, Rick Edgecombe, x86,
	linux-kernel, nvdimm, linux-mm

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * Replace disable bits for @pkey with values from @flags
> + *
> + * Kernel users use the same flags as user space:
> + *     PKEY_DISABLE_ACCESS
> + *     PKEY_DISABLE_WRITE
> + */
> +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)

pkey_.... please.

> +{
> +	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> +
> +	/*  Mask out old bit values */
> +	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +
> +	/*  Or in new values */
> +	if (flags & PKEY_DISABLE_ACCESS)
> +		pk_reg |= PKR_AD_BIT << pkey_shift;
> +	if (flags & PKEY_DISABLE_WRITE)
> +		pk_reg |= PKR_WD_BIT << pkey_shift;
> +
> +	return pk_reg;

Also this code is silly.

#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)

u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
{
	int shift = pkey * PKR_BITS_PER_PKEY;

        if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
        	accessbits &= PKEY_ACCESS_MASK;

        pkval &= ~(PKEY_ACCESS_MASK << shift);
	return pkval | accessbit << pkey_shift;
}

See? It does not even need comments because it's self explaining and
uses sensible argument names matching the function name.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros
  2021-08-04  4:32 ` [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros ira.weiny
@ 2021-11-25 14:25   ` Thomas Gleixner
  2021-11-25 16:58     ` Thomas Gleixner
  2021-12-08  0:51     ` Ira Weiny
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 14:25 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, H. Peter Anvin, Fenghua Yu, Rick Edgecombe, x86,
	linux-kernel, nvdimm, linux-mm

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
>   */
>  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>  {
> -	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> -
>  	/*  Mask out old bit values */
> -	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +	pk_reg &= ~PKR_PKEY_MASK(pkey);
>  
>  	/*  Or in new values */
>  	if (flags & PKEY_DISABLE_ACCESS)
> -		pk_reg |= PKR_AD_BIT << pkey_shift;
> +		pk_reg |= PKR_AD_KEY(pkey);
>  	if (flags & PKEY_DISABLE_WRITE)
> -		pk_reg |= PKR_WD_BIT << pkey_shift;
> +		pk_reg |= PKR_WD_KEY(pkey);

I'm not seeing how this is improving that code. Quite the contrary.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 05/18] x86/pks: Add PKS setup code
  2021-08-04  4:32 ` [PATCH V7 05/18] x86/pks: Add PKS setup code ira.weiny
@ 2021-11-25 15:15   ` Thomas Gleixner
  2021-11-26  3:11     ` taoyi.ty
  2021-11-26 11:03     ` Thomas Gleixner
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 15:15 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Fenghua Yu, Hansen, Dave, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin, Rick Edgecombe,
	x86, linux-kernel, nvdimm, linux-mm

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +void setup_pks(void);

pks_setup()

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +static DEFINE_PER_CPU(u32, pkrs_cache);
> +u32 __read_mostly pkrs_init_value;
> +
> +/*
> + * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
> + * be checked quickly.
> + *
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + *     WRPKRU will never execute transiently. Memory accesses
> + *     affected by PKRU register will not execute (even transiently)
> + *     until all prior executions of WRPKRU have completed execution
> + *     and updated the PKRU register.
> + */
> +void write_pkrs(u32 new_pkrs)

pkrs_write()

> +{
> +	u32 *pkrs;
> +
> +	if (!static_cpu_has(X86_FEATURE_PKS))
> +		return;

  cpu_feature_enabled() if at all. Why is this function even invoked
  when PKS is off?

> +
> +	pkrs = get_cpu_ptr(&pkrs_cache);

As far as I've seen this is mostly called from non-preemptible
regions. So that get/put pair is pointless. Stick a lockdep assert into
the code and disable preemption at the maybe one callsite which needs
it.

> +	if (*pkrs != new_pkrs) {
> +		*pkrs = new_pkrs;
> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> +	}
> +	put_cpu_ptr(pkrs);
> +}
> +
> +/*
> + * Build a default PKRS value from the array specified by consumers
> + */
> +static int __init create_initial_pkrs_value(void)
> +{
> +	/* All users get Access Disabled unless changed below */
> +	u8 consumer_defaults[PKS_NUM_PKEYS] = {
> +		[0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT
> +	};
> +	int i;
> +
> +	consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
> +
> +	/* Ensure the number of consumers is less than the number of keys */
> +	BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
> +
> +	pkrs_init_value = 0;

This needs to be cleared because the BSS might be non zero?

> +	/* Fill the defaults for the consumers */
> +	for (i = 0; i < PKS_NUM_PKEYS; i++)
> +		pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);

Also PKR_RW_BIT is a horrible invention:

> +#define PKR_RW_BIT 0x0
>  #define PKR_AD_BIT 0x1
>  #define PKR_WD_BIT 0x2
>  #define PKR_BITS_PER_PKEY 2

This makes my brain spin. How do you fit 3 bits into 2 bits per key?
That's really non-intuitive.

PKR_RW_ENABLE		0x0
PKR_ACCESS_DISABLE	0x1
PKR_WRITE_DISABLE	0x2

makes it obvious what this is about, no?

Aside of that, the function which set's up the init value is really
bogus. As you explained in the cover letter a kernel user has to:

   1) Claim an index in the enum
   2) Add a default value to the array in that function

Seriously? How is that any better than doing:

#define PKS_KEY0_DEFAULT	PKR_RW_ENABLE
#define PKS_KEY1_DEFAULT	PKR_ACCESS_DISABLE
...
#define PKS_KEY15_DEFAULT	PKR_ACCESS_DISABLE

and let the compiler construct pkrs_init_value?

TBH, it's not and this function has to be ripped out in case that you
need a dynamic allocation of keys some day. So what is this buying us
aside of horrible to read and utterly pointless code?

> +	return 0;
> +}
> +early_initcall(create_initial_pkrs_value);
> +
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + * Configure PKS if the CPU supports the feature.
> + */
> +void setup_pks(void)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	write_pkrs(pkrs_init_value);

Is the init value set up _before_ this function is invoked for the first
time?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch
  2021-08-04  4:32 ` [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
@ 2021-11-25 15:25   ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 15:25 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Rick Edgecombe,
	x86, linux-kernel, nvdimm, linux-mm

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> @@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  	/* Load the Intel cache allocation PQR MSR. */
>  	resctrl_sched_in();
>  
> +	pkrs_write_current();

This is invoked from switch_to() and does this extra get/put_cpu_ptr()
dance in the write function for no reason. Can you please stop sticking
overhead into the hotpath just because?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros
  2021-11-25 14:25   ` Thomas Gleixner
@ 2021-11-25 16:58     ` Thomas Gleixner
  2021-12-08  0:51     ` Ira Weiny
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-25 16:58 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, H. Peter Anvin, Fenghua Yu, Rick Edgecombe, x86,
	linux-kernel, nvdimm, linux-mm

On Thu, Nov 25 2021 at 15:25, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
>> @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
>>   */
>>  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>>  {
>> -	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
>> -
>>  	/*  Mask out old bit values */
>> -	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
>> +	pk_reg &= ~PKR_PKEY_MASK(pkey);
>>  
>>  	/*  Or in new values */
>>  	if (flags & PKEY_DISABLE_ACCESS)
>> -		pk_reg |= PKR_AD_BIT << pkey_shift;
>> +		pk_reg |= PKR_AD_KEY(pkey);
>>  	if (flags & PKEY_DISABLE_WRITE)
>> -		pk_reg |= PKR_WD_BIT << pkey_shift;
>> +		pk_reg |= PKR_WD_KEY(pkey);
>
> I'm not seeing how this is improving that code. Quite the contrary.

Aside of that why are you ordering it the wrong way around, i.e.

   1) implement turd
   2) polish turd

instead of implementing the required helpers first if they are really
providing value.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 05/18] x86/pks: Add PKS setup code
  2021-11-25 15:15   ` Thomas Gleixner
@ 2021-11-26  3:11     ` taoyi.ty
  2021-11-26  9:57       ` Thomas Gleixner
  2021-11-26 11:03     ` Thomas Gleixner
  1 sibling, 1 reply; 42+ messages in thread
From: taoyi.ty @ 2021-11-26  3:11 UTC (permalink / raw)
  To: Thomas Gleixner, ira.weiny, Dave Hansen, Dan Williams
  Cc: Peter Zijlstra, Fenghua Yu, Hansen, Dave, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin, Rick Edgecombe,
	x86, linux-kernel, nvdimm, linux-mm

On 11/25/21 11:15 PM, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
>> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> +
>> +void setup_pks(void);
> 
> pks_setup()
> 
>> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> +
>> +static DEFINE_PER_CPU(u32, pkrs_cache);
>> +u32 __read_mostly pkrs_init_value;
>> +
>> +/*
>> + * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
>> + * be checked quickly.
>> + *
>> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
>> + * serializing but still maintains ordering properties similar to WRPKRU.
>> + * The current SDM section on PKRS needs updating but should be the same as
>> + * that of WRPKRU.  So to quote from the WRPKRU text:
>> + *
>> + *     WRPKRU will never execute transiently. Memory accesses
>> + *     affected by PKRU register will not execute (even transiently)
>> + *     until all prior executions of WRPKRU have completed execution
>> + *     and updated the PKRU register.
>> + */
>> +void write_pkrs(u32 new_pkrs)
> 
> pkrs_write()
> 
>> +{
>> +	u32 *pkrs;
>> +
>> +	if (!static_cpu_has(X86_FEATURE_PKS))
>> +		return;
> 
>    cpu_feature_enabled() if at all. Why is this function even invoked
>    when PKS is off?
> 
>> +
>> +	pkrs = get_cpu_ptr(&pkrs_cache);
> 
> As far as I've seen this is mostly called from non-preemptible
> regions. So that get/put pair is pointless. Stick a lockdep assert into
> the code and disable preemption at the maybe one callsite which needs
> it.
> 
>> +	if (*pkrs != new_pkrs) {
>> +		*pkrs = new_pkrs;
>> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
>> +	}
>> +	put_cpu_ptr(pkrs);
>> +}
>> +
>> +/*
>> + * Build a default PKRS value from the array specified by consumers
>> + */
>> +static int __init create_initial_pkrs_value(void)
>> +{
>> +	/* All users get Access Disabled unless changed below */
>> +	u8 consumer_defaults[PKS_NUM_PKEYS] = {
>> +		[0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT
>> +	};
>> +	int i;
>> +
>> +	consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
>> +
>> +	/* Ensure the number of consumers is less than the number of keys */
>> +	BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
>> +
>> +	pkrs_init_value = 0;
> 
> This needs to be cleared because the BSS might be non zero?
> 
>> +	/* Fill the defaults for the consumers */
>> +	for (i = 0; i < PKS_NUM_PKEYS; i++)
>> +		pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);
> 
> Also PKR_RW_BIT is a horrible invention:
> 
>> +#define PKR_RW_BIT 0x0
>>   #define PKR_AD_BIT 0x1
>>   #define PKR_WD_BIT 0x2
>>   #define PKR_BITS_PER_PKEY 2
> 
> This makes my brain spin. How do you fit 3 bits into 2 bits per key?
> That's really non-intuitive.
> 
> PKR_RW_ENABLE		0x0
> PKR_ACCESS_DISABLE	0x1
> PKR_WRITE_DISABLE	0x2
> 
> makes it obvious what this is about, no?
> 
> Aside of that, the function which set's up the init value is really
> bogus. As you explained in the cover letter a kernel user has to:
> 
>     1) Claim an index in the enum
>     2) Add a default value to the array in that function
> 
> Seriously? How is that any better than doing:
> 
> #define PKS_KEY0_DEFAULT	PKR_RW_ENABLE
> #define PKS_KEY1_DEFAULT	PKR_ACCESS_DISABLE
> ...
> #define PKS_KEY15_DEFAULT	PKR_ACCESS_DISABLE
> 
> and let the compiler construct pkrs_init_value?
> 
> TBH, it's not and this function has to be ripped out in case that you
> need a dynamic allocation of keys some day. So what is this buying us
> aside of horrible to read and utterly pointless code?
> 
>> +	return 0;
>> +}
>> +early_initcall(create_initial_pkrs_value);
>> +
>> +/*
>> + * PKS is independent of PKU and either or both may be supported on a CPU.
>> + * Configure PKS if the CPU supports the feature.
>> + */
>> +void setup_pks(void)
>> +{
>> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
>> +		return;
>> +
>> +	write_pkrs(pkrs_init_value);
> 
> Is the init value set up _before_ this function is invoked for the first
> time?
> 
> Thanks,
> 
>          tglx
> 
Setting up for cpu0 is before create_initial_pkrs_value. therefore pkrs 
value of cpu0 won't be set correctly.

[root@AliYun ~]# rdmsr -a 0x000006E1
0
55555554
55555554
55555554
55555554
55555554
55555554
55555554
55555554
55555554

Here are my test results after applying the patches

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 05/18] x86/pks: Add PKS setup code
  2021-11-26  3:11     ` taoyi.ty
@ 2021-11-26  9:57       ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-26  9:57 UTC (permalink / raw)
  To: taoyi.ty, ira.weiny, Dave Hansen, Dan Williams
  Cc: Peter Zijlstra, Fenghua Yu, Hansen, Dave, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin, Rick Edgecombe,
	x86, linux-kernel, nvdimm, linux-mm

On Fri, Nov 26 2021 at 11:11, taoyi ty wrote:
> On 11/25/21 11:15 PM, Thomas Gleixner wrote:
>>> +void setup_pks(void)
>>> +{
>>> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
>>> +		return;
>>> +
>>> +	write_pkrs(pkrs_init_value);
>> 
>> Is the init value set up _before_ this function is invoked for the first
>> time?
>> 
> Setting up for cpu0 is before create_initial_pkrs_value. therefore pkrs 
> value of cpu0 won't be set correctly.
>
> [root@AliYun ~]# rdmsr -a 0x000006E1
> 0
> 55555554
> 55555554
> 55555554
> 55555554
> 55555554
> 55555554
> 55555554
> 55555554
> 55555554
>
> Here are my test results after applying the patches

Thanks for confirming what I assumed from looking at the patches!

       tglx



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 05/18] x86/pks: Add PKS setup code
  2021-11-25 15:15   ` Thomas Gleixner
  2021-11-26  3:11     ` taoyi.ty
@ 2021-11-26 11:03     ` Thomas Gleixner
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-26 11:03 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, Dan Williams
  Cc: Ira Weiny, Peter Zijlstra, Fenghua Yu, Hansen, Dave, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin, Rick Edgecombe,
	x86, linux-kernel, nvdimm, linux-mm

On Thu, Nov 25 2021 at 16:15, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> Aside of that, the function which set's up the init value is really
> bogus. As you explained in the cover letter a kernel user has to:
>
>    1) Claim an index in the enum
>    2) Add a default value to the array in that function
>
> Seriously? How is that any better than doing:
>
> #define PKS_KEY0_DEFAULT	PKR_RW_ENABLE
> #define PKS_KEY1_DEFAULT	PKR_ACCESS_DISABLE
> ...
> #define PKS_KEY15_DEFAULT	PKR_ACCESS_DISABLE
>
> and let the compiler construct pkrs_init_value?
>
> TBH, it's not and this function has to be ripped out in case that you
> need a dynamic allocation of keys some day. So what is this buying us
> aside of horrible to read and utterly pointless code?

And as Taoyi confirmed its broken.

It surely takes a reviewer to spot that and an external engineer to run
rdmsr -a to validate that this is not working as expected, right?

Sigh...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-11-13  0:50   ` Ira Weiny
  2021-11-25 11:19     ` Thomas Gleixner
@ 2021-12-03  1:13     ` Andy Lutomirski
  1 sibling, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2021-12-03  1:13 UTC (permalink / raw)
  To: Ira Weiny, Dave Hansen, Dan Williams, H. Peter Anvin
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Fenghua Yu, Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

On 11/12/21 16:50, Ira Weiny wrote:
> On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny <ira.weiny@intel.com>
>>
>> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
>> switch but this support leaves exception handling code open to memory
>> accesses during exceptions.
>>
>> 2 possible places for preserving this state were considered,
>> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
>> was potentially fraught with unintended consequences.[2]  However, Andy
>> came up with a way to hide additional values on the stack which could be
>> accessed as "extended_pt_regs".[3]
> 
> Andy,
> 
> I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
> since I originally implemented this in V4[1].
> 
> Does this meets your expectations?  Are there any issues you can see with this
> code?

I think I'm generally okay with the approach to allocating space.  All 
of Thomas' comments still apply, though.  (Sorry, I'm horribly behind.)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-11-25 14:12   ` Thomas Gleixner
@ 2021-12-07  1:54     ` Ira Weiny
  2021-12-07  4:45       ` Ira Weiny
  2021-12-08  0:21       ` Thomas Gleixner
  0 siblings, 2 replies; 42+ messages in thread
From: Ira Weiny @ 2021-12-07  1:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Dan Williams, Peter Zijlstra, Andy Lutomirski,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

Thomas,

Thanks for the review.  Sorry for being so late to respond I was sick all last
week and so it took me longer to figure out some of this stuff.

On Thu, Nov 25, 2021 at 03:12:47PM +0100, Thomas Gleixner wrote:
> Ira,
> 
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> > +/*
> > + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> > + * @cfunc:		C function to be called
> > + *
> > + * This will ensure that extended_ptregs is added and removed as needed during
> > + * a call into C code.
> > + */
> > +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* add space for extended_pt_regs */
> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> > +#endif
> > +	.if \annotate_retpoline_safe == 1
> > +		ANNOTATE_RETPOLINE_SAFE
> > +	.endif
> 
> This annotation is new and nowhere mentioned why it is part of this
> patch.

I don't understand.  ANNOTATE_RETPOLINE_SAFE has been around since:

9e0e3c5130e9 x86/speculation, objtool: Annotate indirect calls/jumps for objtool

> 
> Can you please do _ONE_ functional change per patch and not a
> unreviewable pile of changes in one go? Same applies for the ASM and the
> C code changes. The ASM change has to go first and then the C code can
> build upon it.

I'm sorry for having the ASM and C code together but this all seemed like 1
change to me.

I can split it if you prefer.  How about a patch with just the x86 extended
pt_regs stuff but that would leave a zero size for the extended stuff?  Then
followed by the pks bits?

> 
> > +	call	\cfunc
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* remove space for extended_pt_regs */
> > +	addq    $EXTENDED_PT_REGS_SIZE, %rsp
> > +#endif
> 
> I really have to ask the question whether this #ifdeffery has any value
> at all. 8 bytes extra stack usage is not going to be the end of the
> world and distro kernels will enable that config anyway.

My goal with this has always been 0 overhead if turned off.  So this seemed
like a logical addition.  Furthermore, ARCH_ENABLE_SUPERVISOR_PKEYS is
predicated on ARCH_HAS_SUPERVISOR_PKEYS which is only available with x86_64.
This removes the space for x86 when not needed.

All the config stuff was introduced in patch 04/18.[0]

> 
> If we really want to save the space then certainly not by sprinkling
> CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
> extra sized ptregs in the pkrs header.
> 
> You are changing generic architecture code so you better think about
> making such a change generic and extensible.

I agree.  And I tried to do so.  The generic entry code is modified only by the
addition of pkrs_[save|restore]_irq().  These are only defined if the arch
defines ARCH_HAS_SUPERVISOR_PKEYS and furthermore, if something triggers
enabling ARCH_ENABLE_SUPERVISOR_PKEYS.

ARCH_HAS_SUPERVISOR_PKEYS is restricted to x86_64 at the moment.  All other
arch's including x86 should not see any changes in the generic code.

I thought we had agreed that it was ok for me to restrict the addition of the
extended pt_regs to what was required for PKS when these changes were
discussed.  Because at the time I was concerned about my lack of knowledge of
all the other architectures.[1]

>
> Can folks please start
> thinking beyond the brim of their teacup and not pretend that the
> feature they are working on is the unicorn which requires unique magic
> brandnamed after the unicorn of the day.
> 
> If the next feature comes around which needs to save something in that
> extended area then we are going to change the world again, right?

I'm not sure what you mean by 'change the world'.  I would anticipate the entry
code to be modified with something similar to pks_[save|restore]_irq() and let
the arch deal with the specifics.

Also in [1] I thought Peter and Andy agreed that placing additional generic
state in the extended pt_regs was not needed and does not buy us anything.  I
specifically asked if that was something we wanted to do in.[2]

> Certainly not.
> 
> This wants to go into asm/ptrace.h:
> 
> struct pt_regs_aux {
> 	u32	pkrs;
> };
> 
> struct pt_regs_extended {
> 	struct pt_regs_aux	aux;
>         struct pt_regs		regs __attribute__((aligned(8)));
> };

Ok the aligned attribute does what I was doing much more gracefully.  This is a
good idea yes, thank you.

> 
> and then have in asm-offset:
> 
>    DEFINE(PT_REGS_AUX_SIZE, sizeof(struct pt_regs_extended) - sizeof(struct pt_regs));
> 
> which does the right thing whatever the size of pt_regs_aux is. So for
> the above it will have:
> 
>  #define PT_REGS_AUX_SIZE 8 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */
> 
> Even, if you do
> 
> struct pt_regs_aux {
> #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> 	u32	pkrs;
> #endif        
> };
> 
> and the config switch is disabled. It's still correct:
> 
>  #define PT_REGS_AUX_SIZE 0 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */
> 
> See? No magic hardcoded constant, no build time error checking for that
> constant. Nothing, it just works.

Yes agreed definitely an improvement.

> 
> That's one part, but let me come back to this:
> 
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* add space for extended_pt_regs */
> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> 
> What guarantees that RSP points to pt_regs at this point?  Nothing at
> all. It's just pure luck and a question of time until this explodes in
> hard to diagnose ways.

It took me a bit to wrap my head around what I think you mean.  My initial
response was that rsp should be the stack pointer for __call_ext_ptregs() just
like it was for call.  But I think I see that it is better to open code this
since others may want to play the same trick without using this code and
therefore we may not be getting the extended pt_regs structure on the stack
like we think.  For example if someone did...

	movq	%rsp, %rdi
	RSP_ADD_OTHER_STACK_STUFF
	__call_ext_ptregs	...
	RSP_REMOVE_OTHER_STACK_STUFF

... it would be broken.

My assumption was that would be illegal after this patch.  But indeed there is
no way to easily see that in the future.

> 
> Because between
> 
>         movq	%rsp, %rdi
> and
>         call    ....
> 
> can legitimately be other code which causes the stack pointer to
> change. It's not the case today, but nothing prevents this in the
> future.
> 
> The correct thing to do is:
> 
>         movq	%rsp, %rdi
>         RSP_MAKE_PT_REGS_AUX_SPACE
>         call	...
>         RSP_REMOVE_PT_REGS_AUX_SPACE
> 
> The few extra macro lines in the actual code are way better as they make
> it completely obvious what's going on and any misuse can be spotted
> easily.

Sure FWIW this is what I had originally but thought it would be cleaner to wrap
the 'call'.  I will convert it back.  Also this removes the
annotate_retpoline_safe stuff above.

> 
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +/*
> > + * PKRS is a per-logical-processor MSR which overlays additional protection for
> > + * pages which have been mapped with a protection key.
> > + *
> > + * Context switches save the MSR in the task struct thus taking that value to
> > + * other processors if necessary.
> > + *
> > + * To protect against exceptions having access to this memory save the current
> > + * thread value and set the PKRS value to be used during the exception.
> > + */
> > +void pkrs_save_irq(struct pt_regs *regs)
> 
> That's a misnomer as this is invoked for _any_ exception not just
> interrupts.

I'm confused by the naming in kernel/entry/common.c then.  I'm more than
willing to change the name.  But I only see irq* for almost everything in that
file.  And I was trying to follow that convention.

> 
> >  #ifdef CONFIG_XEN_PV
> >  #ifndef CONFIG_PREEMPTION
> >  /*
> > @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
> >  
> >  	inhcall = get_and_clear_inhcall();
> >  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> > +		/* Normally called by irqentry_exit, restore pkrs here */
> > +		pkrs_restore_irq(regs);
> > 		irqentry_exit_cond_resched();
> 
> Sigh. Consistency is overrated....

I'm not that familiar with the xen code so perhaps I missed something?

> 
> > +
> >  void setup_pks(void);
> >  void pkrs_write_current(void);
> >  void pks_init_task(struct task_struct *task);
> > +void write_pkrs(u32 new_pkrs);
> 
> So we have pkrs_write_current() and write_pkrs() now. Can you please
> stick to a common prefix, i.e. pkrs_ ?

Sorry, yes.

> 
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index bf16395b9e13..aa0b1e8dd742 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/livepatch.h>
> >  #include <linux/audit.h>
> >  #include <linux/tick.h>
> > +#include <linux/pkeys.h>
> >  
> >  #include "common.h"
> >  
> > @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> >  		instrumentation_end();
> >  
> >  		ret.exit_rcu = true;
> > -		return ret;
> > +		goto done;
> >  	}
> >  
> >  	/*
> > @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> >  	trace_hardirqs_off_finish();
> >  	instrumentation_end();
> >  
> > +done:
> > +	pkrs_save_irq(regs);
> 
> This still calls out into instrumentable code. I explained you before
> why this is wrong. Also objtool emits warnings to that effect if you do a
> proper verified build.

I was not sure what a 'proper verified build' was and objtool was not throwing
any warnings for me even if I ran it directly.

10:49:27 > ./tools/objtool/objtool check -n vmlinux.o
vmlinux.o: warning: objtool: ftrace_caller()+0x94: call without frame pointer save/setup
vmlinux.o: warning: objtool: ftrace_regs_caller()+0xde: call without frame pointer save/setup
vmlinux.o: warning: objtool: return_to_handler()+0x10: call without frame pointer save/setup
vmlinux.o: warning: objtool: copy_mc_fragile() falls through to next function copy_mc_fragile_handle_tail()
vmlinux.o: warning: objtool: copy_user_enhanced_fast_string() falls through to next function copy_user_generic_unrolled()
vmlinux.o: warning: objtool: __memset() falls through to next function memset_erms()
vmlinux.o: warning: objtool: __memcpy() falls through to next function memcpy_erms()
vmlinux.o: warning: objtool: file already has .static_call_sites section, skipping


After asking around and digging quite a bit I found CONFIG_DEBUG_ENTRY which
enabled the check and the error.  [But only during a build and not with the
above command???  Shouldn't the above command work too?]

What other config options should we be running with to verify the build?

Regardless, reading more about noinstr and looking at the code more carefully I
realize I _completely_ misunderstood what you meant before in [3].  I should
have asked for clarification.

Yes this was originally marked noinstr because it was called from a noinstr
function.  I see now, or at least I think I see, that you were taking exception
to my blindly marking pkrs_save_irq() noinstr without a good reason.

When you said 'there is absolutely no reason to have this marked noinstr.'  I
thought that meant we could simply remove it from noinstr.  But what I think
you meant is that there is no reason to have it _be_ noinstr _and_ I should
also make it called from the instrumentable sections of the irqentry_*() calls.

So something like this patch on top of this series?  [With an equivalent change
for pkrs_restore_irq().]

11:03:18 > git di
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c7356733632e..1c0a70a17e93 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -360,10 +360,11 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
                rcu_irq_enter();
                instrumentation_begin();
                trace_hardirqs_off_finish();
+               pkrs_save_irq(regs);
                instrumentation_end();
 
                ret.exit_rcu = true;
-               goto done;
+               return ret;
        }
 
        /*
@@ -376,10 +377,9 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
        instrumentation_begin();
        rcu_irq_enter_check_tick();
        trace_hardirqs_off_finish();
+       pkrs_save_irq(regs);
        instrumentation_end();
 
-done:
-       pkrs_save_irq(regs);
        return ret;
 }
 
@@ -462,9 +462,9 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
        instrumentation_begin();
        trace_hardirqs_off_finish();
        ftrace_nmi_enter();
+       pkrs_save_irq(regs);
        instrumentation_end();
 
-       pkrs_save_irq(regs);
        return irq_state;
 }


> 
> >  	return ret;
> >  }
> >  
> > @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> >  	/* Check whether this returns to user mode */
> >  	if (user_mode(regs)) {
> >  		irqentry_exit_to_user_mode(regs);
> > -	} else if (!regs_irqs_disabled(regs)) {
> > +		return;
> > +	}
> > +
> > +	pkrs_restore_irq(regs);
> 
> At least you are now putting it consistently at the wrong place
> vs. noinstr.

Indeed.  Sorry about not understanding noinstr fully.

> 
> Though, if you look at the xen_pv_evtchn_do_upcall() part where you
> added this extra invocation you might figure out that adding
> pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
> the 'else' path in irqentry_exit() makes it magically consistent for
> both use cases.
> 

Thank you, yes good catch.  However, I think I need at least 1 more call in the
!regs_irqs_disabled() && state.exit_rcu case right?

11:29:48 > git di
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 717091910ebc..667676ebc129 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -363,8 +363,6 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
        inhcall = get_and_clear_inhcall();
        if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
-               /* Normally called by irqentry_exit, restore pkrs here */
-               pkrs_restore_irq(regs);
                irqentry_exit_cond_resched();
                instrumentation_end();
                restore_inhcall(inhcall);
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 1c0a70a17e93..60ae3d4c9cc0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -385,6 +385,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 
 void irqentry_exit_cond_resched(void)
 {
+       pkrs_restore_irq(regs);
        if (!preempt_count()) {
                /* Sanity check RCU and thread stack */
                rcu_irq_exit_check_preempt();
@@ -408,8 +409,6 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
                return;
        }
 
-       pkrs_restore_irq(regs);
-
        if (!regs_irqs_disabled(regs)) {
                /*
                 * If RCU was not watching on entry this needs to be done
@@ -421,6 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
                        /* Tell the tracer that IRET will enable interrupts */
                        trace_hardirqs_on_prepare();
                        lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+                       pkrs_restore_irq(regs);
                        instrumentation_end();
                        rcu_irq_exit();
                        lockdep_hardirqs_on(CALLER_ADDR0);
@@ -439,6 +439,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
                trace_hardirqs_on();
                instrumentation_end();
        } else {
+               instrumentation_begin();
+               pkrs_restore_irq(regs);
+               instrumentation_end();
+
                /*
                 * IRQ flags state is correct already. Just tell RCU if it
                 * was not watching on entry.
@@ -470,8 +474,8 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
-       pkrs_restore_irq(regs);
        instrumentation_begin();
+       pkrs_restore_irq(regs);
        ftrace_nmi_exit();
        if (irq_state.lockdep) {
                trace_hardirqs_on_prepare();


Thank you again for the review,
Ira


[0] https://lore.kernel.org/lkml/20210804043231.2655537-5-ira.weiny@intel.com/
[1] https://lore.kernel.org/lkml/20201217131924.GW3040@hirez.programming.kicks-ass.net/
[2] https://lore.kernel.org/lkml/20201216013202.GY1563847@iweiny-DESK2.sc.intel.com/
[3] https://lore.kernel.org/lkml/87y2hwqwng.fsf@nanos.tec.linutronix.de/

> Thanks,
> 
>         tglx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-12-07  1:54     ` Ira Weiny
@ 2021-12-07  4:45       ` Ira Weiny
  2021-12-08  0:21       ` Thomas Gleixner
  1 sibling, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2021-12-07  4:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Dan Williams, Peter Zijlstra, Andy Lutomirski,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

On Mon, Dec 06, 2021 at 05:54:23PM -0800, 'Ira Weiny' wrote:

[snip]

> > 
> > Though, if you look at the xen_pv_evtchn_do_upcall() part where you
> > added this extra invocation you might figure out that adding
> > pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
> > the 'else' path in irqentry_exit() makes it magically consistent for
> > both use cases.
> > 
> 
> Thank you, yes good catch.  However, I think I need at least 1 more call in the
> !regs_irqs_disabled() && state.exit_rcu case right?
> 
> 11:29:48 > git di
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 717091910ebc..667676ebc129 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -363,8 +363,6 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>         inhcall = get_and_clear_inhcall();
>         if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> -               /* Normally called by irqentry_exit, restore pkrs here */
> -               pkrs_restore_irq(regs);
>                 irqentry_exit_cond_resched();
>                 instrumentation_end();
>                 restore_inhcall(inhcall);
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 1c0a70a17e93..60ae3d4c9cc0 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -385,6 +385,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  
>  void irqentry_exit_cond_resched(void)

Opps...  Of course regs will need to be passed in here now...

Ira

>  {
> +       pkrs_restore_irq(regs);
>         if (!preempt_count()) {
>                 /* Sanity check RCU and thread stack */
>                 rcu_irq_exit_check_preempt();
> @@ -408,8 +409,6 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>                 return;
>         }
>  
> -       pkrs_restore_irq(regs);
> -
>         if (!regs_irqs_disabled(regs)) {
>                 /*
>                  * If RCU was not watching on entry this needs to be done
> @@ -421,6 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>                         /* Tell the tracer that IRET will enable interrupts */
>                         trace_hardirqs_on_prepare();
>                         lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> +                       pkrs_restore_irq(regs);
>                         instrumentation_end();
>                         rcu_irq_exit();
>                         lockdep_hardirqs_on(CALLER_ADDR0);
> @@ -439,6 +439,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>                 trace_hardirqs_on();
>                 instrumentation_end();
>         } else {
> +               instrumentation_begin();
> +               pkrs_restore_irq(regs);
> +               instrumentation_end();
> +
>                 /*
>                  * IRQ flags state is correct already. Just tell RCU if it
>                  * was not watching on entry.
> @@ -470,8 +474,8 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  
>  void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
>  {
> -       pkrs_restore_irq(regs);
>         instrumentation_begin();
> +       pkrs_restore_irq(regs);
>         ftrace_nmi_exit();
>         if (irq_state.lockdep) {
>                 trace_hardirqs_on_prepare();
> 
> 
> Thank you again for the review,
> Ira
> 
> 
> [0] https://lore.kernel.org/lkml/20210804043231.2655537-5-ira.weiny@intel.com/
> [1] https://lore.kernel.org/lkml/20201217131924.GW3040@hirez.programming.kicks-ass.net/
> [2] https://lore.kernel.org/lkml/20201216013202.GY1563847@iweiny-DESK2.sc.intel.com/
> [3] https://lore.kernel.org/lkml/87y2hwqwng.fsf@nanos.tec.linutronix.de/
> 
> > Thanks,
> > 
> >         tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions
  2021-12-07  1:54     ` Ira Weiny
  2021-12-07  4:45       ` Ira Weiny
@ 2021-12-08  0:21       ` Thomas Gleixner
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-08  0:21 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, Dan Williams, Peter Zijlstra, Andy Lutomirski,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

Ira,

On Mon, Dec 06 2021 at 17:54, Ira Weiny wrote:
> On Thu, Nov 25, 2021 at 03:12:47PM +0100, Thomas Gleixner wrote:
>> > +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +	/* add space for extended_pt_regs */
>> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
>> > +#endif
>> > +	.if \annotate_retpoline_safe == 1
>> > +		ANNOTATE_RETPOLINE_SAFE
>> > +	.endif
>> 
>> This annotation is new and nowhere mentioned why it is part of this
>> patch.
>
> I don't understand.  ANNOTATE_RETPOLINE_SAFE has been around since:
>
> 9e0e3c5130e9 x86/speculation, objtool: Annotate indirect calls/jumps
> for objtool

Sorry, I misread that macro maze. It's conditional obviously.

> I can split it if you prefer.  How about a patch with just the x86 extended
> pt_regs stuff but that would leave a zero size for the extended stuff?  Then
> followed by the pks bits?

Whatever makes sense and does one thing per patch.

>> I really have to ask the question whether this #ifdeffery has any value
>> at all. 8 bytes extra stack usage is not going to be the end of the
>> world and distro kernels will enable that config anyway.
>
> My goal with this has always been 0 overhead if turned off.  So this seemed
> like a logical addition.  Furthermore, ARCH_ENABLE_SUPERVISOR_PKEYS is
> predicated on ARCH_HAS_SUPERVISOR_PKEYS which is only available with x86_64.
> This removes the space for x86 when not needed.

The question is not about 64 vs. 32bit. The question is whether the
conditional makes sense for 64bit in the first place. Whether this
matters for 32bit has to be determined. It makes some sense, but less
#ifdeffery and less obfuscation makes sense too.

>> If we really want to save the space then certainly not by sprinkling
>> CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
>> extra sized ptregs in the pkrs header.
>> 
>> You are changing generic architecture code so you better think about
>> making such a change generic and extensible.
>
> I agree.  And I tried to do so.  The generic entry code is modified only by the
> addition of pkrs_[save|restore]_irq().  These are only defined if the arch
> defines ARCH_HAS_SUPERVISOR_PKEYS and furthermore, if something triggers
> enabling ARCH_ENABLE_SUPERVISOR_PKEYS.

I'm talking about generic _architecture_ code, i.e. the code in
arch/x86/ which affects all vendors and systems.

> ARCH_HAS_SUPERVISOR_PKEYS is restricted to x86_64 at the moment.  All other
> arch's including x86 should not see any changes in the generic code.

That was not the question and I'm well aware of that.

>> If the next feature comes around which needs to save something in that
>> extended area then we are going to change the world again, right?
>
> I'm not sure what you mean by 'change the world'.  I would anticipate the entry
> code to be modified with something similar to pks_[save|restore]_irq() and let
> the arch deal with the specifics.

If on X86 the next X86 specific feature comes around which needs extra
reg space then someone has to change world in arch/x86 again, replace
all the ARCH_ENABLE_SUPERVISOR_PKEYS #ifdefs with something else, right?

Instead of adding a new field to pt_regs_aux and be done with it.

> Also in [1] I thought Peter and Andy agreed that placing additional generic
> state in the extended pt_regs was not needed and does not buy us anything.  I
> specifically asked if that was something we wanted to do in.[2]

This was about a generic representation which affects the common entry
code in kernel/entry/... Can you spot the difference?

What I suggested is _solely_ x86 specific and does not trickle into
anything outside of arch/x86.

>> See? No magic hardcoded constant, no build time error checking for that
>> constant. Nothing, it just works.
>
> Yes agreed definitely an improvement.

It's the right thing to do.

>> That's one part, but let me come back to this:
>> 
>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +	/* add space for extended_pt_regs */
>> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
>> 
>> What guarantees that RSP points to pt_regs at this point?  Nothing at
>> all. It's just pure luck and a question of time until this explodes in
>> hard to diagnose ways.
>
> It took me a bit to wrap my head around what I think you mean.  My initial
> response was that rsp should be the stack pointer for __call_ext_ptregs() just
> like it was for call.  But I think I see that it is better to open code this
> since others may want to play the same trick without using this code and
> therefore we may not be getting the extended pt_regs structure on the stack
> like we think.  For example if someone did...
>
> 	movq	%rsp, %rdi
> 	RSP_ADD_OTHER_STACK_STUFF
> 	__call_ext_ptregs	...
> 	RSP_REMOVE_OTHER_STACK_STUFF
>
> ... it would be broken.
>
> My assumption was that would be illegal after this patch.  But indeed there is
> no way to easily see that in the future.

There is no law which forbids to put code there. Aside of that software
developmemnt is strictly not based on assumptions by definition.

>> Because between
>> 
>>         movq	%rsp, %rdi
>> and
>>         call    ....
>> 
>> can legitimately be other code which causes the stack pointer to
>> change. It's not the case today, but nothing prevents this in the
>> future.
>> 
>> The correct thing to do is:
>> 
>>         movq	%rsp, %rdi
>>         RSP_MAKE_PT_REGS_AUX_SPACE
>>         call	...
>>         RSP_REMOVE_PT_REGS_AUX_SPACE
>> 
>> The few extra macro lines in the actual code are way better as they make
>> it completely obvious what's going on and any misuse can be spotted
>> easily.
>
> Sure FWIW this is what I had originally but thought it would be cleaner to wrap
> the 'call'.  I will convert it back.  Also this removes the
> annotate_retpoline_safe stuff above.

It makes the whole ifdeffery more palatable. Hiding everything in
hideous macros in not an improvement at all.

>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +/*
>> > + * PKRS is a per-logical-processor MSR which overlays additional protection for
>> > + * pages which have been mapped with a protection key.
>> > + *
>> > + * Context switches save the MSR in the task struct thus taking that value to
>> > + * other processors if necessary.
>> > + *
>> > + * To protect against exceptions having access to this memory save the current
>> > + * thread value and set the PKRS value to be used during the exception.
>> > + */
>> > +void pkrs_save_irq(struct pt_regs *regs)
>> 
>> That's a misnomer as this is invoked for _any_ exception not just
>> interrupts.
>
> I'm confused by the naming in kernel/entry/common.c then.  I'm more than
> willing to change the name.  But I only see irq* for almost everything in that
> file.  And I was trying to follow that convention.

Do you see anything named irq* in the NMI parts?

>> >  #ifdef CONFIG_XEN_PV
>> >  #ifndef CONFIG_PREEMPTION
>> >  /*
>> > @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>> >  
>> >  	inhcall = get_and_clear_inhcall();
>> >  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
>> > +		/* Normally called by irqentry_exit, restore pkrs here */
>> > +		pkrs_restore_irq(regs);
>> > 		irqentry_exit_cond_resched();
>> 
>> Sigh. Consistency is overrated....
>
> I'm not that familiar with the xen code so perhaps I missed something?

Yes, taste. And that has nothing to do with being familiar with XEN code.

>> > +		/* Normally called by irqentry_exit, restore pkrs here */
>> > +		pkrs_restore_irq(regs);
>> > 		irqentry_exit_cond_resched();

Your comment says: Normally called by irqentry_exit

Alone writing this comment should have made you look into the invoked
function below:

 		irqentry_exit_cond_resched();

And because the generic entry code is pretty consequent about naming
conventions it's not a surprise that you can find an invocation of
irqentry_exit_cond_resched() in irqentry_exit(), right?

So instead of writing 'Normally' which is completely confusing you could
have done a proper analysis and figured out what I suggested:

>> Though, if you look at the xen_pv_evtchn_do_upcall() part where you
>> added this extra invocation you might figure out that adding
>> pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
>> the 'else' path in irqentry_exit() makes it magically consistent for
>> both use cases.

But because your preference seems to be have random naming conventions,
i.e. chosing the prefix of the day, I'm not that surprised.

> Thank you, yes good catch.  However, I think I need at least 1 more
> call in the !regs_irqs_disabled() && state.exit_rcu case right?

I take this as a rethorical question.

>> > +done:
>> > +	pkrs_save_irq(regs);
>> 
>> This still calls out into instrumentable code. I explained you before
>> why this is wrong. Also objtool emits warnings to that effect if you do a
>> proper verified build.
>
> I was not sure what a 'proper verified build' was and objtool was not throwing
> any warnings for me even if I ran it directly.
...
> After asking around and digging quite a bit I found CONFIG_DEBUG_ENTRY which
> enabled the check and the error.

May I ask the obvious question why you did not ask around when I told
you the same thing several month ago?

Seriously, you are changing code which has very obviously placed
'noinstr' annotations on functions and a boat load of very obvious
instrumentation_begin()/end() pairs in it along with a gazillion of
comments and you just go and add your stuff as you can see fit?

Even _after_ I told you that this is wrong?

The word "engineering" has a meaning. Engineering is based on math and
science. Both mandate structured and diligent problem analyis.

Can you pretty please explain me how ignoring these annotations in the
first place and then ignoring the related review comments is related to
that?

> [But only during a build and not with the above command???  Shouldn't
> the above command work too?]

Did you even try to figure out what CONFIG_DEBUG_ENTRY does?

# git grep -l CONFIG_DEBUG_ENTRY

and looking at the resulting output which has one very obvious named
file in it:

     include/linux/instrumentation.h

might have told you the answer. Also

# git log ...
# git blame ...
# your_favourite_browser https://duckduckgo.com/?q=objtool+noinstr
# your_favourite_browser https://duckduckgo.com/?q=objtool+CONFIG_DEBUG_ENTRY

aside of asking colleagues or replying to my previous review comment
with a sensible question would have solved that, right?

Asking does not make you look stupid. Not asking and making uninformed
assumptions will make you look stupid for sure.

But asking just to avoid homework is not in the book either. The
community does not have the capacity to deal with that.

> Regardless, reading more about noinstr and looking at the code more carefully I
> realize I _completely_ misunderstood what you meant before in [3].  I should
> have asked for clarification.

Bingo!

> Yes this was originally marked noinstr because it was called from a noinstr
> function.  I see now, or at least I think I see, that you were taking exception
> to my blindly marking pkrs_save_irq() noinstr without a good reason.
>
> When you said 'there is absolutely no reason to have this marked noinstr.'  I
> thought that meant we could simply remove it from noinstr.  But what I think
> you meant is that there is no reason to have it _be_ noinstr _and_ I should
> also make it called from the instrumentable sections of the irqentry_*() calls.
>
> So something like this patch on top of this series?  [With an equivalent change
> for pkrs_restore_irq().]

No comment. Why?

   1) This is not an basic enginering course

   2) Because I refuse to comment on hastily cobbled together crap which still
      gets it wrong. Hint:

      I did not even look at the result of this patch applied.  I just
      did a mental inventory based on the patch hunks you provided.
      They simply do not sum up.

      Don't dare to ask me why.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros
  2021-11-25 14:25   ` Thomas Gleixner
  2021-11-25 16:58     ` Thomas Gleixner
@ 2021-12-08  0:51     ` Ira Weiny
  2021-12-08 15:11       ` Thomas Gleixner
  1 sibling, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2021-12-08  0:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Dan Williams, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

On Thu, Nov 25, 2021 at 03:25:09PM +0100, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> > @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
> >   */
> >  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> >  {
> > -	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > -
> >  	/*  Mask out old bit values */
> > -	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > +	pk_reg &= ~PKR_PKEY_MASK(pkey);
> >  
> >  	/*  Or in new values */
> >  	if (flags & PKEY_DISABLE_ACCESS)
> > -		pk_reg |= PKR_AD_BIT << pkey_shift;
> > +		pk_reg |= PKR_AD_KEY(pkey);
> >  	if (flags & PKEY_DISABLE_WRITE)
> > -		pk_reg |= PKR_WD_BIT << pkey_shift;
> > +		pk_reg |= PKR_WD_KEY(pkey);
> 
> I'm not seeing how this is improving that code. Quite the contrary.

Fair enough.  Even more so when using the code you suggested for pkey_update_pkval().

In that case it boils down to:

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index eb6d6b872652..b7127329d115 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -198,7 +198,7 @@ __setup("init_pkru=", setup_init_pkru);
  */
 u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
 {
-        int shift = pkey * PKR_BITS_PER_PKEY;
+        int shift = PKR_PKEY_SHIFT(pkey);
 
         if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
                 accessbits &= PKEY_ACCESS_MASK;


Better?

As to the reason of why to put this patch after the other one.  Why would I
improve the old pre-refactoring code only to throw it away when moving it to
pkey_update_pkval()?  This reasoning is even stronger when pkey_update_pkval()
is implemented.

I agree with Dan regarding the macros though.  I think they make it easier to
see what is going on without dealing with masks and shifts directly.  But I can
remove this patch if you feel that strongly about it.

Ira

> 
> Thanks,
> 
>         tglx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros
  2021-12-08  0:51     ` Ira Weiny
@ 2021-12-08 15:11       ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-08 15:11 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, Dan Williams, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin, Fenghua Yu,
	Rick Edgecombe, x86, linux-kernel, nvdimm, linux-mm

Ira,

On Tue, Dec 07 2021 at 16:51, Ira Weiny wrote:
> On Thu, Nov 25, 2021 at 03:25:09PM +0100, Thomas Gleixner wrote:
>
>  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
>  {
> -      int shift = pkey * PKR_BITS_PER_PKEY;
> +      int shift = PKR_PKEY_SHIFT(pkey);
>
> 	 if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
>		 accessbits &= PKEY_ACCESS_MASK;
>
> Better?

Let me postpone this question.

> As to the reason of why to put this patch after the other one.  Why would I
> improve the old pre-refactoring code only to throw it away when moving it to
> pkey_update_pkval()?  This reasoning is even stronger when pkey_update_pkval()
> is implemented.

Which refactoring? We seem to have fundamentally definitions of that
term. Let me illustrate why.

The original version of this was:

  u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val)
  {
  	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
  	u32 new_pkr_bits = 0;
  
  	/* Set the bits we need in the register:  */
  	if (init_val & PKEY_DISABLE_ACCESS)
  		new_pkr_bits |= PKR_AD_BIT;
  	if (init_val & PKEY_DISABLE_WRITE)
  		new_pkr_bits |= PKR_WD_BIT;
  
  	/* Shift the bits in to the correct place: */
  	new_pkr_bits <<= pkey_shift;
  
  	/* Mask off any old bits in place: */
  	old_pkr &= ~((PKR_AD_BIT | PKR_WD_BIT) << pkey_shift);
  
  	/* Return the old part along with the new part: */
  	return old_pkr | new_pkr_bits;
  }

IOW, mechanical Cut & Paste.

Then PeterZ came along and suggested to improve it this way:

  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
  {
	  int pkey_shift = pkey * PKR_BITS_PER_PKEY;

	  /*  Mask out old bit values */
	  pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);

	  /*  Or in new values */
	  if (flags & PKEY_DISABLE_ACCESS)
		  pk_reg |= PKR_AD_BIT << pkey_shift;
	  if (flags & PKEY_DISABLE_WRITE)
		  pk_reg |= PKR_WD_BIT << pkey_shift;

	  return pk_reg;
  }

which is already better. So you changed your approach from Cut & Paste
to Copy & Paste.

But neither Cut & Paste nor Copy & Paste match what refactoring is
really about. Just throwing the term refactoring at it does not make it
so.

Refactoring is about improving the code in design and implementation.
The keyword is: improving.

There are obviously cases where you can take the code as is and split it
out into a new helper function.

You really have to look at it and answer the question whether it's good
code or not, whether it could be written in better ways and with
improved functionality.

I could have given you this minimalistic one:

  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
  {
	  int shift = pkey * PKR_BITS_PER_PKEY;

	  pkval &= ~(PKEY_ACCESS_MASK << shift);
	  return pkval | (accessbit & PKEY_ACCESS_MASK) << shift;
  }

But I gave you this:

  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
  {
	  int shift = pkey * PKR_BITS_PER_PKEY;

	  if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
		  accessbits &= PKEY_ACCESS_MASK;

	  pkval &= ~(PKEY_ACCESS_MASK << shift);
	  return pkval | accessbit << shift;
  }

This is what refactoring is about. See?

> I agree with Dan regarding the macros though.  I think they make it easier to
> see what is going on without dealing with masks and shifts directly.  But I can
> remove this patch if you feel that strongly about it.

I'm not against macros per se, but not everything is automatically
better when it is hidden behind a macro.

What I'm arguing against is the claim that macros are an improvement by
definition. Especially when they are just blindly thrown into code which
should not exist in the first place.

Also versus ordering. What's wrong with doing it this way:

  1) Define the macros first without changing the code

  2) Implement pkey_update_pkval() in a sensible way and use the macros
     where appropriate. Thereby replacing the existing version in the
     other function.

Which would end up in the obviously even simpler code:

  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
  {
	  if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
		  accessbits &= PKEY_ACCESS_MASK;

	  pkval &= ~PKR_PKEY_VALUE(pkey, PKEY_ACCESS_MASK);
	  return pkval | PKR_PKEY_VALUE(pkey, accessbits);
  }

That fits the goal of that macro exercise to make it easy to read and
obvious what's going on, no?

Instead of:

>  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
>  {
> -      int shift = pkey * PKR_BITS_PER_PKEY;
> +      int shift = PKR_PKEY_SHIFT(pkey);
>
> 	 if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
>		 accessbits &= PKEY_ACCESS_MASK;
>
>	  pkval &= ~(PKEY_ACCESS_MASK << shift);
>	  return pkval | accessbit << shift;
>  }
>
> Better?

You surely can answer this question yourself, no?

  "By continuously improving the design of code, we make it easier and
   easier to work with. This is in sharp contrast to what typically
   happens: little refactoring and a great deal of attention paid to
   expediently adding new features. If you get into the hygienic habit
   of refactoring continuously, you'll find that it is easier to extend
   and maintain code." -- Joshua Kerievsky

If you study that quote carefully, you surely can find our diverging
approach to refactoring in it, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2021-12-08 15:11 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-04  4:32 [PATCH V7 00/18] PKS/PMEM: Add Stray Write Protection ira.weiny
2021-08-04  4:32 ` [PATCH V7 01/18] x86/pkeys: Create pkeys_common.h ira.weiny
2021-08-04  4:32 ` [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
2021-11-25 14:23   ` Thomas Gleixner
2021-08-04  4:32 ` [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros ira.weiny
2021-11-25 14:25   ` Thomas Gleixner
2021-11-25 16:58     ` Thomas Gleixner
2021-12-08  0:51     ` Ira Weiny
2021-12-08 15:11       ` Thomas Gleixner
2021-08-04  4:32 ` [PATCH V7 04/18] x86/pks: Add PKS defines and Kconfig options ira.weiny
2021-08-04  4:32 ` [PATCH V7 05/18] x86/pks: Add PKS setup code ira.weiny
2021-11-25 15:15   ` Thomas Gleixner
2021-11-26  3:11     ` taoyi.ty
2021-11-26  9:57       ` Thomas Gleixner
2021-11-26 11:03     ` Thomas Gleixner
2021-08-04  4:32 ` [PATCH V7 06/18] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
2021-08-04  4:32 ` [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
2021-11-25 15:25   ` Thomas Gleixner
2021-08-04  4:32 ` [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
2021-11-13  0:50   ` Ira Weiny
2021-11-25 11:19     ` Thomas Gleixner
2021-12-03  1:13     ` Andy Lutomirski
2021-11-25 14:12   ` Thomas Gleixner
2021-12-07  1:54     ` Ira Weiny
2021-12-07  4:45       ` Ira Weiny
2021-12-08  0:21       ` Thomas Gleixner
2021-08-04  4:32 ` [PATCH V7 09/18] x86/pks: Add PKS kernel API ira.weiny
2021-08-04  4:32 ` [PATCH V7 10/18] x86/pks: Introduce pks_abandon_protections() ira.weiny
2021-08-04  4:32 ` [PATCH V7 11/18] x86/pks: Add PKS Test code ira.weiny
2021-08-04  4:32 ` [PATCH V7 12/18] x86/pks: Add PKS fault callbacks ira.weiny
2021-08-11 21:18   ` Edgecombe, Rick P
2021-08-17  3:21     ` Ira Weiny
2021-08-04  4:32 ` [PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS) ira.weiny
2021-08-04  4:32 ` [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode ira.weiny
2021-08-04  4:57   ` Randy Dunlap
2021-08-07 19:32     ` Ira Weiny
2021-08-11 19:01   ` Edgecombe, Rick P
2021-08-17  3:12     ` Ira Weiny
2021-08-04  4:32 ` [PATCH V7 15/18] kmap: Add stray access protection for devmap pages ira.weiny
2021-08-04  4:32 ` [PATCH V7 16/18] dax: Stray access protection for dax_direct_access() ira.weiny
2021-08-04  4:32 ` [PATCH V7 17/18] nvdimm/pmem: Enable stray access protection ira.weiny
2021-08-04  4:32 ` [PATCH V7 18/18] devdax: " ira.weiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.