Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3
@ 2020-10-09 19:42 ira.weiny
  2020-10-09 19:42 ` [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h ira.weiny
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

This RFC series has been reviewed by Dave Hansen.

Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and the 2 use cases can be found here:

	https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
	1a) Prevents access without page table manipulations
	1b) No TLB flushes required
   2) Works on a per thread basis

PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/
[2] https://github.com/weiny2/linux-kernel/commit/f10abb0f0d7b4e14f03fc8890313a5830cde1e49
	and a testing patch
    https://github.com/weiny2/linux-kernel/commit/2a8e0fc7654a7c69b243d628f63b01ff26a5a797


Fenghua Yu (3):
  x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  x86/pks: Enable Protection Keys Supervisor (PKS)
  x86/pks: Add PKS kernel API

Ira Weiny (6):
  x86/pkeys: Create pkeys_common.h
  x86/pks: Preserve the PKRS MSR on context switch
  x86/entry: Pass irqentry_state_t by reference
  x86/entry: Preserve PKRS MSR across exceptions
  x86/fault: Report the PKRS state on fault
  x86/pks: Add PKS test code

 Documentation/core-api/protection-keys.rst  | 102 ++-
 arch/x86/Kconfig                            |   1 +
 arch/x86/entry/common.c                     |  57 +-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/idtentry.h             |  29 +-
 arch/x86/include/asm/msr-index.h            |   1 +
 arch/x86/include/asm/pgtable.h              |  13 +-
 arch/x86/include/asm/pgtable_types.h        |  12 +
 arch/x86/include/asm/pkeys.h                |  15 +
 arch/x86/include/asm/pkeys_common.h         |  36 +
 arch/x86/include/asm/processor.h            |  13 +
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  17 +
 arch/x86/kernel/cpu/mce/core.c              |   4 +
 arch/x86/kernel/fpu/xstate.c                |  22 +-
 arch/x86/kernel/kvm.c                       |   4 +-
 arch/x86/kernel/nmi.c                       |   7 +-
 arch/x86/kernel/process.c                   |  21 +
 arch/x86/kernel/traps.c                     |  21 +-
 arch/x86/mm/fault.c                         |  86 ++-
 arch/x86/mm/pkeys.c                         | 188 +++++-
 include/linux/entry-common.h                |  19 +-
 include/linux/pgtable.h                     |   4 +
 include/linux/pkeys.h                       |  23 +-
 kernel/entry/common.c                       |  28 +-
 lib/Kconfig.debug                           |  12 +
 lib/Makefile                                |   3 +
 lib/pks/Makefile                            |   3 +
 lib/pks/pks_test.c                          | 690 ++++++++++++++++++++
 mm/Kconfig                                  |   2 +
 tools/testing/selftests/x86/Makefile        |   3 +-
 tools/testing/selftests/x86/test_pks.c      |  65 ++
 32 files changed, 1376 insertions(+), 128 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 17:46   ` Dave Hansen
  2020-10-09 19:42 ` [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Normally, these defines
would be put in asm/pkeys.h to be used internally and externally to the
arch code.  However, the defines are required in pgtable.h and inclusion
of pkeys.h in that header creates complex dependencies which are best
resolved in a separate header.

Share these defines by moving those them into a new header, change their
names to reflect the common use, and include the header where needed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
NOTE: The initialization of init_pkru_value cause checkpatch errors
because of the space after the '(' in the macros.  We leave this as is
because it is more readable in this format.  And it was existing code.
---
 arch/x86/include/asm/pgtable.h      | 13 ++++++-------
 arch/x86/include/asm/pkeys.h        |  2 ++
 arch/x86/include/asm/pkeys_common.h | 11 +++++++++++
 arch/x86/kernel/fpu/xstate.c        |  8 ++++----
 arch/x86/mm/pkeys.c                 | 14 ++++++--------
 5 files changed, 29 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b836138ce852..2576154be6cf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1361,9 +1361,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include <asm/pkeys_common.h>
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -1373,18 +1371,19 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
 	/*
 	 * Access-disable disables writes too so we need to check
 	 * both bits here.
 	 */
-	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..f9feba80894b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
+#include <asm/pkeys_common.h>
+
 #define ARCH_DEFAULT_PKEY	0
 
 /*
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index 000000000000..a9f086f1e4b4
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 038e19c0019e..b55cf3acd82a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -991,7 +991,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val)
 {
 	u32 old_pkru;
-	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
 	/*
@@ -1010,16 +1010,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 
 	/* Set the bits we need in PKRU:  */
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKRU_AD_BIT;
+		new_pkru_bits |= PKR_AD_BIT;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKRU_WD_BIT;
+		new_pkru_bits |= PKR_WD_BIT;
 
 	/* Shift the bits in to the correct place in PKRU for pkey: */
 	new_pkru_bits <<= pkey_shift;
 
 	/* Get old PKRU and mask off any old bits in place: */
 	old_pkru = read_pkru();
-	old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
 
 	/* Write old part along with new part: */
 	write_pkru(old_pkru | new_pkru_bits);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 8873ed1438a9..f5efb4007e74 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -111,19 +111,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
 	return vma_pkey(vma);
 }
 
-#define PKRU_AD_KEY(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
-
 /*
  * Make the default PKRU value (at execve() time) as restrictive
  * as possible.  This ensures that any threads clone()'d early
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
-		      PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
-		      PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
-		      PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
-		      PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);
+u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
+		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
+		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
+		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
+		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
 
 /*
  * Called from the FPU code when creating a fresh set of FPU
@@ -173,7 +171,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 	 * up immediately if someone attempts to disable access
 	 * or writes to pkey 0.
 	 */
-	if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT))
+	if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT))
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
  2020-10-09 19:42 ` [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 17:50   ` Dave Hansen
  2020-10-16 10:57   ` Peter Zijlstra
  2020-10-09 19:42 ` [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, Ira Weiny, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
 arch/x86/mm/pkeys.c          | 21 +++++++++++++++++++++
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index f9feba80894b..4526245b03e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b55cf3acd82a..725f10670d0a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -990,9 +990,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val)
 {
-	u32 old_pkru;
-	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
-	u32 new_pkru_bits = 0;
+	u32 pkru;
 
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
@@ -1008,21 +1006,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	 */
 	WARN_ON_ONCE(pkey >= arch_max_pkey());
 
-	/* Set the bits we need in PKRU:  */
-	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKR_AD_BIT;
-	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKR_WD_BIT;
-
-	/* Shift the bits in to the correct place in PKRU for pkey: */
-	new_pkru_bits <<= pkey_shift;
-
-	/* Get old PKRU and mask off any old bits in place: */
-	old_pkru = read_pkru();
-	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-	/* Write old part along with new part: */
-	write_pkru(old_pkru | new_pkru_bits);
+	pkru = read_pkru();
+	pkru = update_pkey_val(pkru, pkey, init_val);
+	write_pkru(pkru);
 
 	return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..3cf8f775f36d 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,24 @@ static __init int setup_init_pkru(char *opt)
 	return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Update the pk_reg value and return it.
+ *
+ * Kernel users use the same flags as user space:
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+	if (flags & PKEY_DISABLE_ACCESS)
+		pk_reg |= PKR_AD_BIT << pkey_shift;
+	if (flags & PKEY_DISABLE_WRITE)
+		pk_reg |= PKR_WD_BIT << pkey_shift;
+
+	return pk_reg;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS)
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
  2020-10-09 19:42 ` [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h ira.weiny
  2020-10-09 19:42 ` [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 18:23   ` Dave Hansen
  2020-10-09 19:42 ` [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, Dan Williams, Ira Weiny, x86, Dave Hansen,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Define ARCH_HAS_SUPERVISOR_PKEYS to distinguish this functionality from
the existing ARCH_HAS_PKEYS and then enable PKS when configured and
indicated by the CPU instance.  While not strictly necessary in this
patch, ARCH_HAS_SUPERVISOR_PKEYS separates this functionality through
the patch series so it is introduced here.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/Kconfig                            |  1 +
 arch/x86/include/asm/cpufeatures.h          |  1 +
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c                | 15 +++++++++++++++
 mm/Kconfig                                  |  2 ++
 5 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..1bfb912342a3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1873,6 +1873,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_SUPERVISOR_PKEYS
 	help
 	  Memory Protection Keys provides a mechanism for enforcing
 	  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2901d5df4366..7646b03fc3f4 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -353,6 +353,7 @@
 #define X86_FEATURE_CLDEMOTE		(16*32+25) /* CLDEMOTE instruction */
 #define X86_FEATURE_MOVDIRI		(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B		(16*32+28) /* MOVDIR64B instruction */
+#define X86_FEATURE_PKS			(16*32+31) /* Protection Keys for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV	(17*32+ 0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c5d6f17d9b9d..a129d5e4afab 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1447,6 +1447,20 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c)
 #endif
 }
 
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the cpu supports the feature.
+ */
+static void setup_pks(void)
+{
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
+		return;
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	cr4_set_bits(X86_CR4_PKS);
+}
+
 /*
  * This does the hard work of actually picking apart the CPU stuff...
  */
@@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	setup_pks();
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f..1b9bc004d9bc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+	bool
 
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (2 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 18:31   ` Dave Hansen
  2020-10-16 11:06   ` Peter Zijlstra
  2020-10-09 19:42 ` [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API ira.weiny
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU.  Initialize all tasks with the default MSR value.  Then, on
schedule in, check the saved task MSR vs the per-cpu value.  If
different proceed to write the MSR.  If not avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/msr-index.h    |  1 +
 arch/x86/include/asm/pkeys_common.h | 20 ++++++++++++++++++++
 arch/x86/include/asm/processor.h    | 13 +++++++++++++
 arch/x86/kernel/cpu/common.c        |  2 ++
 arch/x86/kernel/process.c           | 21 +++++++++++++++++++++
 arch/x86/mm/pkeys.c                 | 28 ++++++++++++++++++++++++++++
 6 files changed, 85 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4f39a8..e467e087f1b3 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -747,6 +747,7 @@
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 
+#define MSR_IA32_PKRS			0x000006E1
 
 #define MSR_TSX_FORCE_ABORT		0x0000010F
 
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index a9f086f1e4b4..05781be33c14 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -8,4 +8,24 @@
 
 #define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
 
+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction.  All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+			 PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+			 PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+			 PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+			 PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 new_pkrs);
+#else
+static inline void write_pkrs(u32 new_pkrs) { }
+#endif
+
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 97143d87994c..da2381136b2d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -18,6 +18,7 @@ struct vm86;
 #include <asm/cpufeatures.h>
 #include <asm/page.h>
 #include <asm/pgtable_types.h>
+#include <asm/pkeys_common.h>
 #include <asm/percpu.h>
 #include <asm/msr.h>
 #include <asm/desc_defs.h>
@@ -542,6 +543,11 @@ struct thread_struct {
 
 	unsigned int		sig_on_uaccess_err:1;
 
+#ifdef	CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+	/* Saved Protection key register for supervisor mappings */
+	u32			saved_pkrs;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
@@ -840,8 +846,15 @@ static inline void spin_lock_prefetch(const void *x)
 #define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS	.saved_pkrs = INIT_PKRS_VALUE
+#else
+#define INIT_THREAD_PKRS	0
+#endif
+
 #define INIT_THREAD  {						\
 	.addr_limit		= KERNEL_DS,			\
+	INIT_THREAD_PKRS,					\
 }
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a129d5e4afab..968863d59b6c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -57,6 +57,7 @@
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
 #include <asm/uv/uv.h>
+#include <linux/pkeys.h>
 
 #include "cpu.h"
 
@@ -1458,6 +1459,7 @@ static void setup_pks(void)
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
+	write_pkrs(INIT_PKRS_VALUE);
 	cr4_set_bits(X86_CR4_PKS);
 }
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ba4593a913fa..eb3a95a69392 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,6 +43,7 @@
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
 #include <asm/frame.h>
+#include <asm/pkeys_common.h>
 
 #include "process.h"
 
@@ -187,6 +188,22 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	return ret;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+DECLARE_PER_CPU(u32, pkrs_cache);
+static inline void pks_init_task(struct task_struct *tsk)
+{
+	/* New tasks get the most restrictive PKRS value */
+	tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
+}
+static inline void pks_sched_in(void)
+{
+	write_pkrs(current->thread.saved_pkrs);
+}
+#else
+static inline void pks_init_task(struct task_struct *tsk) { }
+static inline void pks_sched_in(void) { }
+#endif
+
 void flush_thread(void)
 {
 	struct task_struct *tsk = current;
@@ -195,6 +212,8 @@ void flush_thread(void)
 	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
 
 	fpu__clear_all(&tsk->thread.fpu);
+
+	pks_init_task(tsk);
 }
 
 void disable_TSC(void)
@@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
 
 	if ((tifp ^ tifn) & _TIF_SLD)
 		switch_to_sld(tifn);
+
+	pks_sched_in();
 }
 
 /*
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 3cf8f775f36d..30f65dd3d0c5 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
 
 	return pk_reg;
 }
+
+DEFINE_PER_CPU(u32, pkrs_cache);
+
+/**
+ * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
+ * serializing but still maintains ordering properties similar to WRPKRU.
+ * The current SDM section on PKRS needs updating but should be the same as
+ * that of WRPKRU.  So to quote from the WRPKRU text:
+ *
+ * 	WRPKRU will never execute transiently. Memory accesses
+ * 	affected by PKRU register will not execute (even transiently)
+ * 	until all prior executions of WRPKRU have completed execution
+ * 	and updated the PKRU register.
+ */
+void write_pkrs(u32 new_pkrs)
+{
+	u32 *pkrs;
+
+	if (!static_cpu_has(X86_FEATURE_PKS))
+		return;
+
+	pkrs = get_cpu_ptr(&pkrs_cache);
+	if (*pkrs != new_pkrs) {
+		*pkrs = new_pkrs;
+		wrmsrl(MSR_IA32_PKRS, new_pkrs);
+	}
+	put_cpu_ptr(pkrs);
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (3 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 18:43   ` Dave Hansen
  2020-10-16 11:07   ` Peter Zijlstra
  2020-10-09 19:42 ` [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference ira.weiny
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, Ira Weiny, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mknoaccess(),
pks_mkread(), pks_mkrdwr(), and pks_key_free().  Add 2 new macros;
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 Documentation/core-api/protection-keys.rst | 101 ++++++++++++++---
 arch/x86/include/asm/pgtable_types.h       |  12 ++
 arch/x86/include/asm/pkeys.h               |  11 ++
 arch/x86/include/asm/pkeys_common.h        |   4 +
 arch/x86/mm/pkeys.c                        | 125 +++++++++++++++++++++
 include/linux/pgtable.h                    |   4 +
 include/linux/pkeys.h                      |  22 ++++
 7 files changed, 261 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..00a046a913e4 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ======================
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-========
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+============================
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,57 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+The following interface is used to allocate, use, and free a pkey which defines
+a 'protection domain' within the kernel.  Setting a pkey value in a supervisor
+mapping adds that mapping to the protection domain.
+
+        int pks_key_alloc(const char * const pkey_user);
+        #define PAGE_KERNEL_PKEY(pkey)
+        #define _PAGE_KEY(pkey)
+        void pks_mknoaccess(int pkey);
+        void pks_mkread(int pkey);
+        void pks_mkrdwr(int pkey);
+        void pks_key_free(int pkey);
+
+pks_key_alloc() allocates keys dynamically to allow better use of the limited
+key space.
+
+Callers of pks_key_alloc() _must_ be prepared for it to fail and take
+appropriate action.  This is due mainly to the fact that PKS may not be
+available on all arch's.  Failure to check the return of pks_key_alloc() and
+using any of the rest of the API is undefined.
+
+Kernel users must set the PTE permissions in the page table entries for the
+mappings they want to protect.  This can be done with PAGE_KERNEL_PKEY() or
+_PAGE_KEY().
+
+The pks_mk*() family of calls allows kernel users the ability to change the
+protections for the domain identified by the pkey specified.  3 states are
+available pks_mknoaccess(), pks_mkread(), and pks_mkrdwr() which set the access
+to none, read, and read/write respectively.
+
+Finally, pks_key_free() allows a user to return the key to the allocator for
+use by others.
+
+The interface maintains pks_mknoaccess() (Access Disabled (AD=1)) for all keys
+not currently allocated.  Therefore, the user can depend on access being
+disabled when pks_key_alloc() returns a key and the user should remove mappings
+from the domain (remove the pkey from the PTE) prior to calling pks_key_free().
+
+It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
+but still maintains ordering properties similar to WRPKRU.  Thus it is safe to
+immediately use a mapping when the pks_mk*() functions returns.
+
+The current SDM section on PKRS needs updating but should be the same as that
+of WRPKRU.  So to quote from the WRPKRU text:
+
+	WRPKRU will never execute transiently. Memory accesses
+	affected by PKRU register will not execute (even transiently)
+	until all prior executions of WRPKRU have completed execution
+	and updated the PKRU register.
+
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..c9fdfbdcbbfb 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -73,6 +73,12 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+#else
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
 #else
@@ -229,6 +235,12 @@ enum page_cache_mode {
 #define PAGE_KERNEL_IO		__pgprot_mask(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE	__pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define PAGE_KERNEL_PKEY(pkey)	__pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+#else
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 /*         xwr */
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 4526245b03e5..79952216474e 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_PKEYS_H
 
 #include <asm/pkeys_common.h>
+#include <asm-generic/mman-common.h>
 
 #define ARCH_DEFAULT_PKEY	0
 
@@ -138,4 +139,14 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 
 u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+int pks_key_alloc(const char *const pkey_user);
+void pks_key_free(int pkey);
+
+void pks_mknoaccess(int pkey);
+void pks_mkread(int pkey);
+void pks_mkrdwr(int pkey);
+
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index 05781be33c14..40464c170522 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -22,6 +22,10 @@
 			 PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
 			 PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
 
+/*  PKS supports 16 keys. Key 0 is reserved for the kernel. */
+#define        PKS_KERN_DEFAULT_KEY    0
+#define        PKS_NUM_KEYS            16
+
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
 void write_pkrs(u32 new_pkrs);
 #else
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 30f65dd3d0c5..1d9f451b4b78 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -3,6 +3,9 @@
  * Intel Memory Protection Keys management
  * Copyright (c) 2015, Intel Corporation.
  */
+#undef pr_fmt
+#define pr_fmt(fmt) "x86/pkeys: " fmt
+
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
@@ -229,6 +232,7 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
 
 	return pk_reg;
 }
+EXPORT_SYMBOL_GPL(update_pkey_val);
 
 DEFINE_PER_CPU(u32, pkrs_cache);
 
@@ -257,3 +261,124 @@ void write_pkrs(u32 new_pkrs)
 	}
 	put_cpu_ptr(pkrs);
 }
+EXPORT_SYMBOL_GPL(write_pkrs);
+
+/**
+ * Do not call this directly, see pks_mk*() below.
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ *
+ */
+static inline void pks_update_protection(int pkey, unsigned long protection)
+{
+	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
+						     pkey, protection);
+	preempt_disable();
+	write_pkrs(current->thread.saved_pkrs);
+	preempt_enable();
+}
+
+/**
+ * PKS access control functions
+ *
+ * Change the access of the domain specified by the pkey.  These are global
+ * updates.  They only affects the current running thread.  It is undefined and
+ * a bug for users to call this without having allocated a pkey and using it as
+ * pkey here.
+ *
+ * pks_mknoaccess()
+ *     Disable all access to the domain
+ * pks_mkread()
+ *     Make the domain Read only
+ * pks_mkrdwr()
+ *     Make the domain Read/Write
+ *
+ * @pkey the pkey for which the access should change.
+ *
+ */
+void pks_mknoaccess(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+EXPORT_SYMBOL_GPL(pks_mknoaccess);
+
+void pks_mkread(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_WRITE);
+}
+EXPORT_SYMBOL_GPL(pks_mkread);
+
+void pks_mkrdwr(int pkey)
+{
+	pks_update_protection(pkey, 0);
+}
+EXPORT_SYMBOL_GPL(pks_mkrdwr);
+
+static const char pks_key_user0[] = "kernel";
+
+/* Store names of allocated keys for debug.  Key 0 is reserved for the kernel.  */
+static const char *pks_key_users[PKS_NUM_KEYS] = {
+	pks_key_user0
+};
+
+/*
+ * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved for
+ * its use.  We use ulong for the bit operations but only 16 bits are used.
+ */
+static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
+
+/*
+ * pks_key_alloc - Allocate a PKS key
+ *
+ * @pkey_user: String stored for debugging of key exhaustion.  The caller is
+ * responsible to maintain this memory until pks_key_free().
+ */
+int pks_key_alloc(const char * const pkey_user)
+{
+	int nr;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return -EINVAL;
+
+	while (1) {
+		nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS);
+		if (nr >= PKS_NUM_KEYS) {
+			pr_info("Cannot allocate supervisor key for %s.\n",
+				pkey_user);
+			return -ENOSPC;
+		}
+		if (!test_and_set_bit_lock(nr, &pks_key_allocation_map))
+			break;
+	}
+
+	/* for debugging key exhaustion */
+	pks_key_users[nr] = pkey_user;
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(pks_key_alloc);
+
+/*
+ * pks_key_free - Free a previously allocate PKS key
+ *
+ * @pkey: Key to be free'ed
+ */
+void pks_key_free(int pkey)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)
+		return;
+
+	/* Restore to default of no access */
+	pks_mknoaccess(pkey);
+	pks_key_users[pkey] = NULL;
+	__clear_bit(pkey, &pks_key_allocation_map);
+}
+EXPORT_SYMBOL_GPL(pks_key_free);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 90654cb63e9e..6900182d53ee 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1374,6 +1374,10 @@ static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
+#ifndef PAGE_KERNEL_PKEY
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 /*
  * Page Table Modification bits for pgtbl_mod_mask.
  *
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 2955ba976048..cc3510cde64e 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -50,4 +50,26 @@ static inline void copy_init_pkru_to_fpregs(void)
 
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
+#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+static inline int pks_key_alloc(const char * const pkey_user)
+{
+	return -EINVAL;
+}
+static inline void pks_key_free(int pkey)
+{
+}
+static inline void pks_mknoaccess(int pkey)
+{
+	WARN_ON_ONCE(1);
+}
+static inline void pks_mkread(int pkey)
+{
+	WARN_ON_ONCE(1);
+}
+static inline void pks_mkrdwr(int pkey)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #endif /* _LINUX_PKEYS_H */
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (4 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-16 11:45   ` Peter Zijlstra
  2020-10-09 19:42 ` [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

In preparation for adding PKS information to struct irqentry_state_t
change all call sites and usages to pass the struct by reference
instead of by value.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/entry/common.c         | 16 +++++++---------
 arch/x86/include/asm/idtentry.h | 29 +++++++++++++++++------------
 arch/x86/kernel/kvm.c           |  4 ++--
 arch/x86/kernel/nmi.c           |  7 ++++---
 arch/x86/kernel/traps.c         | 21 +++++++++++++--------
 arch/x86/mm/fault.c             |  4 ++--
 include/linux/entry-common.h    |  7 ++++---
 kernel/entry/common.c           | 20 ++++++++------------
 8 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..305da13770b6 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,9 +209,9 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
+noinstr void idtentry_enter_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
 {
-	bool irq_state = lockdep_hardirqs_enabled();
+	irq_state->exit_rcu = lockdep_hardirqs_enabled();
 
 	__nmi_enter();
 	lockdep_hardirqs_off(CALLER_ADDR0);
@@ -222,15 +222,13 @@ noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
 	instrumentation_end();
-
-	return irq_state;
 }
 
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
+noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
 {
 	instrumentation_begin();
 	ftrace_nmi_exit();
-	if (restore) {
+	if (irq_state->exit_rcu) {
 		trace_hardirqs_on_prepare();
 		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	}
@@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
 
 	rcu_nmi_exit();
 	lockdep_hardirq_exit();
-	if (restore)
+	if (irq_state->exit_rcu)
 		lockdep_hardirqs_on(CALLER_ADDR0);
 	__nmi_exit();
 }
@@ -295,7 +293,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 	bool inhcall;
 	irqentry_state_t state;
 
-	state = irqentry_enter(regs);
+	irqentry_enter(regs, &state);
 	old_regs = set_irq_regs(regs);
 
 	instrumentation_begin();
@@ -311,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
-		irqentry_exit(regs, state);
+		irqentry_exit(regs, &state);
 	}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a0638640f1ed..622889ba21d0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,8 +11,8 @@
 
 #include <asm/irq_stack.h>
 
-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
+void idtentry_enter_nmi(struct pt_regs *regs, irqentry_state_t *irq_state);
+void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state);
 
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
@@ -52,12 +52,13 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	irqentry_enter(regs, &state);					\
 	instrumentation_begin();					\
 	__##func (regs);						\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	irqentry_exit(regs, &state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -99,12 +100,13 @@ static __always_inline void __##func(struct pt_regs *regs,		\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	irqentry_enter(regs, &state);					\
 	instrumentation_begin();					\
 	__##func (regs, error_code);					\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	irqentry_exit(regs, &state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs,		\
@@ -195,15 +197,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector);	\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	irqentry_enter(regs, &state);					\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	__##func (regs, (u8)error_code);				\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	irqentry_exit(regs, &state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -237,15 +240,16 @@ static void __##func(struct pt_regs *regs);				\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	irqentry_enter(regs, &state);					\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	run_sysvec_on_irqstack_cond(__##func, regs);			\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	irqentry_exit(regs, &state);					\
 }									\
 									\
 static noinline void __##func(struct pt_regs *regs)
@@ -266,15 +270,16 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	irqentry_enter(regs, &state);					\
 	instrumentation_begin();					\
 	__irq_enter_raw();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	__##func (regs);						\
 	__irq_exit_raw();						\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	irqentry_exit(regs, &state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 9663ba31347c..c6be0a54236f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -241,7 +241,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 	if (!flags)
 		return false;
 
-	state = irqentry_enter(regs);
+	irqentry_enter(regs, &state);
 	instrumentation_begin();
 
 	/*
@@ -262,7 +262,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 	}
 
 	instrumentation_end();
-	irqentry_exit(regs, state);
+	irqentry_exit(regs, &state);
 	return true;
 }
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4fc9954a9560..68c07cad0150 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -33,6 +33,7 @@
 #include <asm/reboot.h>
 #include <asm/cache.h>
 #include <asm/nospec-branch.h>
+#include <asm/idtentry.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -475,7 +476,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-	bool irq_state;
+	irqentry_state_t irq_state = { };
 
 	if (IS_ENABLED(CONFIG_SMP) && arch_cpu_is_offline(smp_processor_id()))
 		return;
@@ -490,14 +491,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
 	this_cpu_write(nmi_dr7, local_db_save());
 
-	irq_state = idtentry_enter_nmi(regs);
+	idtentry_enter_nmi(regs, &irq_state);
 
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
-	idtentry_exit_nmi(regs, irq_state);
+	idtentry_exit_nmi(regs, &irq_state);
 
 	local_db_restore(this_cpu_read(nmi_dr7));
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 81a2fb711091..daf7bc02fc99 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -254,11 +254,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 	if (!user_mode(regs) && handle_bug(regs))
 		return;
 
-	state = irqentry_enter(regs);
+	irqentry_enter(regs, &state);
 	instrumentation_begin();
 	handle_invalid_op(regs);
 	instrumentation_end();
-	irqentry_exit(regs, state);
+	irqentry_exit(regs, &state);
 }
 
 DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -342,6 +342,7 @@ __visible void __noreturn handle_stack_overflow(const char *message,
  */
 DEFINE_IDTENTRY_DF(exc_double_fault)
 {
+	irqentry_state_t irq_state;
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
 
@@ -404,7 +405,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	}
 #endif
 
-	idtentry_enter_nmi(regs);
+	idtentry_enter_nmi(regs, &irq_state);
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -650,12 +651,15 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		instrumentation_end();
 		irqentry_exit_to_user_mode(regs);
 	} else {
-		bool irq_state = idtentry_enter_nmi(regs);
+		irqentry_state_t irq_state;
+
+		idtentry_enter_nmi(regs, &irq_state);
+
 		instrumentation_begin();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
 		instrumentation_end();
-		idtentry_exit_nmi(regs, irq_state);
+		idtentry_exit_nmi(regs, &irq_state);
 	}
 }
 
@@ -861,7 +865,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	 * includes the entry stack is excluded for everything.
 	 */
 	unsigned long dr7 = local_db_save();
-	bool irq_state = idtentry_enter_nmi(regs);
+	irqentry_state_t irq_state;
+
+	idtentry_enter_nmi(regs, &irq_state);
 	instrumentation_begin();
 
 	/*
@@ -880,8 +886,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	handle_debug(regs, dr6, false);
 
 	instrumentation_end();
-	idtentry_exit_nmi(regs, irq_state);
-
+	idtentry_exit_nmi(regs, &irq_state);
 	local_db_restore(dr7);
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6e3e8a124903..e55bc4bff389 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1476,11 +1476,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	 * code reenabled RCU to avoid subsequent wreckage which helps
 	 * debugability.
 	 */
-	state = irqentry_enter(regs);
+	irqentry_enter(regs, &state);
 
 	instrumentation_begin();
 	handle_page_fault(regs, error_code, address);
 	instrumentation_end();
 
-	irqentry_exit(regs, state);
+	irqentry_exit(regs, &state);
 }
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..de4f24c554ee 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -349,6 +349,7 @@ typedef struct irqentry_state {
 /**
  * irqentry_enter - Handle state tracking on ordinary interrupt entries
  * @regs:	Pointer to pt_regs of interrupted context
+ * @state:	Pointer to an object to store the irq state
  *
  * Invokes:
  *  - lockdep irqflag state tracking as low level ASM entry disabled
@@ -377,7 +378,7 @@ typedef struct irqentry_state {
  *
  * Returns: An opaque object that must be passed to idtentry_exit()
  */
-irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *state);
 
 /**
  * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
@@ -389,7 +390,7 @@ void irqentry_exit_cond_resched(void);
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
  * @regs:	Pointer to pt_regs (exception entry regs)
- * @state:	Return value from matching call to irqentry_enter()
+ * @state:	Reference to the value saved in irqentry_enter()
  *
  * Depending on the return target (kernel/user) this runs the necessary
  * preemption and work checks if possible and reguired and returns to
@@ -400,6 +401,6 @@ void irqentry_exit_cond_resched(void);
  *
  * Counterpart to irqentry_enter().
  */
-void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *state);
 
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..21601993ad1b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -282,15 +282,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 	exit_to_user_mode();
 }
 
-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
+noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
 {
-	irqentry_state_t ret = {
-		.exit_rcu = false,
-	};
+	state->exit_rcu = false;
 
 	if (user_mode(regs)) {
 		irqentry_enter_from_user_mode(regs);
-		return ret;
+		return;
 	}
 
 	/*
@@ -328,8 +326,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		trace_hardirqs_off_finish();
 		instrumentation_end();
 
-		ret.exit_rcu = true;
-		return ret;
+		state->exit_rcu = true;
+		return;
 	}
 
 	/*
@@ -343,8 +341,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	/* Use the combo lockdep/tracing function */
 	trace_hardirqs_off();
 	instrumentation_end();
-
-	return ret;
 }
 
 void irqentry_exit_cond_resched(void)
@@ -359,7 +355,7 @@ void irqentry_exit_cond_resched(void)
 	}
 }
 
-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *state)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -372,7 +368,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		 * carefully and needs the same ordering of lockdep/tracing
 		 * and RCU as the return to user mode path.
 		 */
-		if (state.exit_rcu) {
+		if (state->exit_rcu) {
 			instrumentation_begin();
 			/* Tell the tracer that IRET will enable interrupts */
 			trace_hardirqs_on_prepare();
@@ -394,7 +390,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		 * IRQ flags state is correct already. Just tell RCU if it
 		 * was not watching on entry.
 		 */
-		if (state.exit_rcu)
+		if (state->exit_rcu)
 			rcu_irq_exit();
 	}
 }
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (5 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 18:52   ` Dave Hansen
  2020-10-09 19:42 ` [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault ira.weiny
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Dave Hansen, x86, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward.  It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/entry/common.c             | 43 +++++++++++++++++++++++++++++
 arch/x86/include/asm/pkeys_common.h |  5 ++--
 arch/x86/kernel/cpu/mce/core.c      |  4 +++
 arch/x86/mm/pkeys.c                 |  2 +-
 include/linux/entry-common.h        | 12 ++++++++
 kernel/entry/common.c               | 12 ++++++--
 6 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 305da13770b6..324a8fd5ac10 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include <linux/nospec.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/pkeys.h>
 
 #ifdef CONFIG_XEN_PV
 #include <xen/xen-ops.h>
@@ -222,6 +223,8 @@ noinstr void idtentry_enter_nmi(struct pt_regs *regs, irqentry_state_t *irq_stat
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
 	instrumentation_end();
+
+	irq_save_pkrs(irq_state);
 }
 
 noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
@@ -238,9 +241,47 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state
 	lockdep_hardirq_exit();
 	if (irq_state->exit_rcu)
 		lockdep_hardirqs_on(CALLER_ADDR0);
+
+	irq_restore_pkrs(irq_state);
 	__nmi_exit();
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the default PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_pkrs(irqentry_state_t *state)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	state->thread_pkrs = current->thread.saved_pkrs;
+	state->pkrs = this_cpu_read(pkrs_cache);
+	write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *state)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	write_pkrs(state->pkrs);
+	current->thread.saved_pkrs = state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -304,6 +345,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+		/* Normally called by irqentry_exit, we must restore pkrs here */
+		irq_restore_pkrs(&state);
 		instrumentation_begin();
 		irqentry_exit_cond_resched();
 		instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index 40464c170522..8961e2ddd6ff 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -27,9 +27,10 @@
 #define        PKS_NUM_KEYS            16
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
 #else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
 #endif
 
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f43a78bde670..abcd41f19669 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1904,6 +1904,8 @@ void (*machine_check_vector)(struct pt_regs *) = unexpected_machine_check;
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
+	irqentry_state_t irq_state = { };
+
 	WARN_ON_ONCE(user_mode(regs));
 
 	/*
@@ -1915,6 +1917,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 		return;
 
 	nmi_enter();
+	irq_save_pkrs(&irq_state);
 	/*
 	 * The call targets are marked noinstr, but objtool can't figure
 	 * that out because it's an indirect call. Annotate it.
@@ -1925,6 +1928,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 	if (regs->flags & X86_EFLAGS_IF)
 		trace_hardirqs_on_prepare();
 	instrumentation_end();
+	irq_restore_pkrs(&irq_state);
 	nmi_exit();
 }
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 1d9f451b4b78..2431c68ef752 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -247,7 +247,7 @@ DEFINE_PER_CPU(u32, pkrs_cache);
  * 	until all prior executions of WRPKRU have completed execution
  * 	and updated the PKRU register.
  */
-void write_pkrs(u32 new_pkrs)
+noinstr void write_pkrs(u32 new_pkrs)
 {
 	u32 *pkrs;
 
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index de4f24c554ee..c3b361ffa059 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -342,10 +342,22 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs);
 
 #ifndef irqentry_state
 typedef struct irqentry_state {
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+	u32 pkrs;
+	u32 thread_pkrs;
+#endif
 	bool	exit_rcu;
 } irqentry_state_t;
 #endif
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+noinstr void irq_save_pkrs(irqentry_state_t *state);
+noinstr void irq_restore_pkrs(irqentry_state_t *state);
+#else
+static __always_inline void irq_save_pkrs(irqentry_state_t *state) { }
+static __always_inline void irq_restore_pkrs(irqentry_state_t *state) { }
+#endif
+
 /**
  * irqentry_enter - Handle state tracking on ordinary interrupt entries
  * @regs:	Pointer to pt_regs of interrupted context
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 21601993ad1b..b6fb3f580673 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -327,7 +327,7 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
 		instrumentation_end();
 
 		state->exit_rcu = true;
-		return;
+		goto done;
 	}
 
 	/*
@@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
 	/* Use the combo lockdep/tracing function */
 	trace_hardirqs_off();
 	instrumentation_end();
+
+done:
+	irq_save_pkrs(state);
 }
 
 void irqentry_exit_cond_resched(void)
@@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		irqentry_exit_to_user_mode(regs);
-	} else if (!regs_irqs_disabled(regs)) {
+		return;
+	}
+
+	irq_restore_pkrs(state);
+
+	if (!regs_irqs_disabled(regs)) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (6 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 18:56   ` Dave Hansen
  2020-10-09 19:42 ` [PATCH RFC V3 9/9] x86/pks: Add PKS test code ira.weiny
  2020-10-09 20:18 ` [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 Ira Weiny
  9 siblings, 1 reply; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

When only user space pkeys are enabled faulting within the kernel was an
unexpected condition which should never happen, therefore a WARN_ON was
added to the kernel fault handler to detect if it ever did.  Now that
PKS can be enabled this is no longer the case.

Report a Pkey fault with a normal splat and add the PKRS state to the
fault splat text.  Note the PKS register is reset during an exception
therefore the saved PKRS value from before the beginning of the
exception is passed down.

If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from
before.

Because each fault has its own state the pkrs information will be
correctly reported even if a fault 'faults'.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/mm/fault.c | 59 ++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e55bc4bff389..ee761c993f58 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index)
 }
 
 static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+		irqentry_state_t *irq_state)
 {
 	if (!oops_may_print())
 		return;
@@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 		 (error_code & X86_PF_PK)    ? "protection keys violation" :
 					       "permissions violation");
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+	if (irq_state && (error_code & X86_PF_PK))
+		pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
+#endif
+
 	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
 		struct desc_ptr idt, gdt;
 		u16 ldtr, tr;
@@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,
 
 static noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int signal, int si_code)
+	   unsigned long address, int signal, int si_code,
+	   irqentry_state_t *irq_state)
 {
 	struct task_struct *tsk = current;
 	unsigned long flags;
@@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	 */
 	flags = oops_begin();
 
-	show_fault_oops(regs, error_code, address);
+	show_fault_oops(regs, error_code, address, irq_state);
 
 	if (task_stack_end_corrupted(tsk))
 		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
@@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, u32 pkey, int si_code)
+		       unsigned long address, u32 pkey, int si_code,
+		       irqentry_state_t *state)
 {
 	struct task_struct *tsk = current;
 
@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	if (is_f00f_bug(regs, address))
 		return;
 
-	no_context(regs, error_code, address, SIGSEGV, si_code);
+	no_context(regs, error_code, address, SIGSEGV, si_code, state);
 }
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, irqentry_state_t *state)
 {
-	__bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, state);
 }
 
 static void
@@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
 	 */
 	mmap_read_unlock(mm);
 
-	__bad_area_nosemaphore(regs, error_code, address, pkey, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL);
 }
 
 static noinline void
@@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 {
 	/* Kernel mode? Handle exceptions or die: */
 	if (!(error_code & X86_PF_USER)) {
-		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL);
 		return;
 	}
 
@@ -957,7 +965,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	       unsigned long address, vm_fault_t fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
-		no_context(regs, error_code, address, 0, 0);
+		no_context(regs, error_code, address, 0, 0, NULL);
 		return;
 	}
 
@@ -965,7 +973,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 		/* Kernel mode? Handle exceptions or die: */
 		if (!(error_code & X86_PF_USER)) {
 			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
+				   SIGSEGV, SEGV_MAPERR, NULL);
 			return;
 		}
 
@@ -980,7 +988,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 			     VM_FAULT_HWPOISON_LARGE))
 			do_sigbus(regs, error_code, address, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 		else
 			BUG();
 	}
@@ -1148,14 +1156,15 @@ static int fault_in_kernel_space(unsigned long address)
  */
 static void
 do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
-		   unsigned long address)
+		   unsigned long address, irqentry_state_t *irq_state)
 {
 	/*
-	 * Protection keys exceptions only happen on user pages.  We
-	 * have no user pages in the kernel portion of the address
-	 * space, so do not expect them here.
+	 * If protection keys are not enabled for kernel space
+	 * do not expect Pkey errors here.
 	 */
-	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS) ||
+	    !cpu_feature_enabled(X86_FEATURE_PKS))
+		WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
 #ifdef CONFIG_X86_32
 	/*
@@ -1204,7 +1213,7 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	 * Don't take the mm semaphore here. If we fixup a prefetch
 	 * fault we could otherwise deadlock:
 	 */
-	bad_area_nosemaphore(regs, hw_error_code, address);
+	bad_area_nosemaphore(regs, hw_error_code, address, irq_state);
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
@@ -1245,7 +1254,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 		     !(hw_error_code & X86_PF_USER) &&
 		     !(regs->flags & X86_EFLAGS_AC)))
 	{
-		bad_area_nosemaphore(regs, hw_error_code, address);
+		bad_area_nosemaphore(regs, hw_error_code, address, NULL);
 		return;
 	}
 
@@ -1254,7 +1263,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, hw_error_code, address);
+		bad_area_nosemaphore(regs, hw_error_code, address, NULL);
 		return;
 	}
 
@@ -1316,7 +1325,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 			 * Fault from code in kernel from
 			 * which we do not expect faults.
 			 */
-			bad_area_nosemaphore(regs, hw_error_code, address);
+			bad_area_nosemaphore(regs, hw_error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1375,7 +1384,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
 			no_context(regs, hw_error_code, address, SIGBUS,
-				   BUS_ADRERR);
+				   BUS_ADRERR, NULL);
 		return;
 	}
 
@@ -1415,7 +1424,7 @@ trace_page_fault_entries(struct pt_regs *regs, unsigned long error_code,
 
 static __always_inline void
 handle_page_fault(struct pt_regs *regs, unsigned long error_code,
-			      unsigned long address)
+		  unsigned long address, irqentry_state_t *irq_state)
 {
 	trace_page_fault_entries(regs, error_code, address);
 
@@ -1424,7 +1433,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 
 	/* Was the fault on kernel-controlled part of the address space? */
 	if (unlikely(fault_in_kernel_space(address))) {
-		do_kern_addr_fault(regs, error_code, address);
+		do_kern_addr_fault(regs, error_code, address, irq_state);
 	} else {
 		do_user_addr_fault(regs, error_code, address);
 		/*
@@ -1479,7 +1488,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	irqentry_enter(regs, &state);
 
 	instrumentation_begin();
-	handle_page_fault(regs, error_code, address);
+	handle_page_fault(regs, error_code, address, &state);
 	instrumentation_end();
 
 	irqentry_exit(regs, &state);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC V3 9/9] x86/pks: Add PKS test code
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (7 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault ira.weiny
@ 2020-10-09 19:42 ` ira.weiny
  2020-10-13 19:02   ` Dave Hansen
  2020-10-09 20:18 ` [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 Ira Weiny
  9 siblings, 1 reply; 47+ messages in thread
From: ira.weiny @ 2020-10-09 19:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry.  Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
       this fd
'2' -- Check that the pkey register updated in '1' is still the same.
       (To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
       DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to exercise
context switch testing a user space program is provided in:

	.../tools/testing/selftests/x86/test_pks.c

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/core-api/protection-keys.rst |   1 +
 arch/x86/mm/fault.c                        |  23 +
 include/linux/pkeys.h                      |   1 -
 lib/Kconfig.debug                          |  12 +
 lib/Makefile                               |   3 +
 lib/pks/Makefile                           |   3 +
 lib/pks/pks_test.c                         | 690 +++++++++++++++++++++
 tools/testing/selftests/x86/Makefile       |   3 +-
 tools/testing/selftests/x86/test_pks.c     |  65 ++
 9 files changed, 799 insertions(+), 2 deletions(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 00a046a913e4..c60366921d60 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -163,3 +163,4 @@ of WRPKRU.  So to quote from the WRPKRU text:
 	until all prior executions of WRPKRU have completed execution
 	and updated the PKRU register.
 
+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ee761c993f58..dd5af9399131 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/pkeys.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1149,6 +1150,25 @@ static int fault_in_kernel_space(unsigned long address)
 	return address >= TASK_SIZE_MAX;
 }
 
+#ifdef CONFIG_PKS_TESTING
+bool pks_test_callback(irqentry_state_t *irq_state);
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t *irq_state)
+{
+	/*
+	 * If we get a protection key exception it could be because we
+	 * are running the PKS test.  If so, pks_test_callback() will
+	 * clear the protection mechanism and return true to indicate
+	 * the fault was handled.
+	 */
+	return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state);
+}
+#else
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t *irq_state)
+{
+	return false;
+}
+#endif
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1166,6 +1186,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	    !cpu_feature_enabled(X86_FEATURE_PKS))
 		WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
+	if (handle_pks_testing(hw_error_code, irq_state))
+		return;
+
 #ifdef CONFIG_X86_32
 	/*
 	 * We can fault-in kernel-space virtual memory on-demand. The
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index cc3510cde64e..f9552bd9341f 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -47,7 +47,6 @@ static inline bool arch_pkeys_enabled(void)
 static inline void copy_init_pkru_to_fpregs(void)
 {
 }
-
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0c781f912f9f..f015c09ba5a1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2400,6 +2400,18 @@ config HYPERV_TESTING
 	help
 	  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TESTING
+	bool "PKey(S)upervisor testing"
+	default n
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	help
+	  Select this option to enable testing of PKS core software and
+	  hardware.  The PKS core provides a mechanism to allocate keys as well
+	  as maintain the protection settings across context switches.
+	  Answer N if you don't know what supervisor keys are.
+
+	  If unsure, say N.
+
 endmenu # "Kernel Testing and Coverage"
 
 endmenu # Kernel hacking
diff --git a/lib/Makefile b/lib/Makefile
index a4a4c6864f51..81fcba41f256 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -348,3 +348,6 @@ obj-$(CONFIG_PLDMFW) += pldmfw/
 obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o
 obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o
 obj-$(CONFIG_BITS_TEST) += test_bits.o
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..7d1df7563db9
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TESTING) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..d7dbf92527bd
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,690 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2020 Intel Corporation. All rights reserved.
+ *
+ * Implement PKS testing
+ * Access to run this test can be with a command line parameter
+ * ("pks-test-on-boot") or more detailed tests can be triggered through:
+ *
+ *    /sys/kernel/debug/x86/run_pks
+ *
+ *  debugfs controls are:
+ *
+ *  '0' -- Run access tests with a single pkey
+ *
+ *  '1' -- Set up the pkey register with no access for the pkey allocated to
+ *         this fd
+ *  '2' -- Check that the pkey register updated in '1' is still the same.  (To
+ *         be used after a forced context switch.)
+ *
+ *  '3' -- Allocate all pkeys possible and run tests on each pkey allocated.
+ *         DEFAULT when run at boot.
+ *
+ *  Closing the fd will cleanup and release the pkey.
+ *
+ *  A companion user space program is provided in:
+ *
+ *          .../tools/testing/selftests/x86/test_pks.c
+ *
+ *  which will better test the context switching.
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/debugfs.h>
+#include <linux/delay.h>
+#include <linux/entry-common.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mman.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/percpu-defs.h>
+#include <linux/pgtable.h>
+#include <linux/pkeys.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+/*
+ * run_on_boot default '= false' which checkpatch complains about initializing;
+ * so we don't
+ */
+static bool run_on_boot;
+static struct dentry *pks_test_dentry;
+static bool run_9;
+
+/*
+ * We must lock the following globals for brief periods while the fault handler
+ * checks/updates them.
+ */
+static DEFINE_SPINLOCK(test_lock);
+static int test_armed_key;
+static unsigned long prev_cnt;
+static unsigned long fault_cnt;
+
+struct pks_test_ctx {
+	bool pass;
+	bool pks_cpu_enabled;
+	int pkey;
+	char data[64];
+};
+static struct pks_test_ctx *test_exception_ctx;
+
+static pte_t *walk_table(void *ptr)
+{
+	struct page *page = NULL;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ret = NULL;
+
+	pgdp = pgd_offset_k((unsigned long)ptr);
+	if (pgd_none(*pgdp) || pgd_bad(*pgdp))
+		goto error;
+
+	p4dp = p4d_offset(pgdp, (unsigned long)ptr);
+	if (p4d_none(*p4dp) || p4d_bad(*p4dp))
+		goto error;
+
+	pudp = pud_offset(p4dp, (unsigned long)ptr);
+	if (pud_none(*pudp) || pud_bad(*pudp))
+		goto error;
+
+	pmdp = pmd_offset(pudp, (unsigned long)ptr);
+	if (pmd_none(*pmdp) || pmd_bad(*pmdp))
+		goto error;
+
+	ret = pte_offset_map(pmdp, (unsigned long)ptr);
+	if (pte_present(*ret)) {
+		page = pte_page(*ret);
+		if (!page) {
+			pte_unmap(ret);
+			goto error;
+		}
+		pr_info("page 0x%lx; flags 0x%lx\n",
+		       (unsigned long)page, page->flags);
+	}
+
+error:
+	return ret;
+}
+
+static bool check_pkey_val(u32 pk_reg, int pkey, u32 expected)
+{
+	u32 pkey_shift = pkey * PKR_BITS_PER_PKEY;
+	u32 pkey_mask = ((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift;
+
+	pk_reg = (pk_reg & pkey_mask) >> pkey_shift;
+	return (pk_reg == expected);
+}
+
+/*
+ * Check if the register @pkey value matches @expected value
+ *
+ * Both the cached and actual MSR must match.
+ */
+static bool check_pkrs(int pkey, u32 expected)
+{
+	bool ret = true;
+	u64 pkrs;
+	u32 *tmp_cache;
+
+	tmp_cache = get_cpu_ptr(&pkrs_cache);
+	if (!check_pkey_val(*tmp_cache, pkey, expected))
+		ret = false;
+	put_cpu_ptr(tmp_cache);
+
+	rdmsrl(MSR_IA32_PKRS, pkrs);
+	if (!check_pkey_val(pkrs, pkey, expected))
+		ret = false;
+
+	return ret;
+}
+
+static void check_exception(irqentry_state_t *irq_state)
+{
+	/* Check the thread saved state */
+	if (!check_pkey_val(irq_state->pkrs, test_armed_key, PKEY_DISABLE_WRITE)) {
+		pr_err("     FAIL: checking irq_state->pkrs\n");
+		test_exception_ctx->pass = false;
+	}
+
+	/* Check the exception state */
+	if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: PKRS cache and MSR\n");
+		test_exception_ctx->pass = false;
+	}
+
+	/*
+	 * Check we can update the value during exception without affecting the
+	 * calling thread.  The calling thread is checked after exception...
+	 */
+	pks_mkrdwr(test_armed_key);
+	if (!check_pkrs(test_armed_key, 0)) {
+		pr_err("     FAIL: exception did not change register to 0\n");
+		test_exception_ctx->pass = false;
+	}
+	pks_mknoaccess(test_armed_key);
+	if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)) {
+		pr_err("     FAIL: exception did not change register to 0x3\n");
+		test_exception_ctx->pass = false;
+	}
+}
+
+/* Silence prototype warning */
+bool pks_test_callback(irqentry_state_t *irq_state);
+
+/**
+ * pks_test_callback() is exported so that the fault handler can detect
+ * and report back status of intentional faults.
+ *
+ * NOTE: It clears the protection key from the page such that the fault handler
+ * will not re-trigger.
+ */
+bool pks_test_callback(irqentry_state_t *irq_state)
+{
+	bool armed = (test_armed_key != 0);
+
+	if (test_exception_ctx) {
+		check_exception(irq_state);
+		/*
+		 * We stop this check within the exception because the
+		 * fault handler clean up code will call us 2x while checking
+		 * the PMD entry and we don't need to check this again
+		 */
+		test_exception_ctx = NULL;
+	}
+
+	if (armed) {
+		/* Enable read and write to stop faults */
+		irq_state->pkrs = update_pkey_val(irq_state->pkrs, test_armed_key, 0);
+		fault_cnt++;
+	}
+
+	return armed;
+}
+EXPORT_SYMBOL(pks_test_callback);
+
+static bool exception_caught(void)
+{
+	bool ret = (fault_cnt != prev_cnt);
+
+	prev_cnt = fault_cnt;
+	return ret;
+}
+
+static void report_pkey_settings(void *unused)
+{
+	u8 pkey;
+	unsigned long long msr = 0;
+	unsigned int cpu = smp_processor_id();
+
+	rdmsrl(MSR_IA32_PKRS, msr);
+
+	pr_info("for CPU %d : 0x%llx\n", cpu, msr);
+	for (pkey = 0; pkey < PKS_NUM_KEYS; pkey++) {
+		int ad, wd;
+
+		ad = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_ACCESS;
+		wd = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_WRITE;
+		pr_info("   %u: A:%d W:%d\n", pkey, ad, wd);
+	}
+}
+
+enum pks_access_mode {
+	PKS_TEST_NO_ACCESS,
+	PKS_TEST_RDWR,
+	PKS_TEST_RDONLY
+};
+
+static char *get_mode_str(enum pks_access_mode mode)
+{
+	switch (mode) {
+		case PKS_TEST_NO_ACCESS:
+			return "No Access";
+		case PKS_TEST_RDWR:
+			return "Read Write";
+		case PKS_TEST_RDONLY:
+			return "Read Only";
+		default:
+			pr_err("BUG in test invalid mode\n");
+			break;
+	}
+
+	return "";
+}
+
+struct pks_access_test {
+	enum pks_access_mode mode;
+	bool write;
+	bool exception;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+	/*  disable both */
+	{ PKS_TEST_NO_ACCESS, true,  true },
+	{ PKS_TEST_NO_ACCESS, false, true },
+
+	/*  enable both */
+	{ PKS_TEST_RDWR, true,  false },
+	{ PKS_TEST_RDWR, false, false },
+
+	/*  enable read only */
+	{ PKS_TEST_RDONLY, true,  true },
+	{ PKS_TEST_RDONLY, false, false },
+};
+
+static int test_it(struct pks_test_ctx *ctx, struct pks_access_test *test, void *ptr)
+{
+	bool exception;
+	int ret = 0;
+
+	spin_lock(&test_lock);
+	WRITE_ONCE(test_armed_key, ctx->pkey);
+
+	if (test->write)
+		memcpy(ptr, ctx->data, 8);
+	else
+		memcpy(ctx->data, ptr, 8);
+
+	exception = exception_caught();
+
+	WRITE_ONCE(test_armed_key, 0);
+	spin_unlock(&test_lock);
+
+	if (test->exception != exception) {
+		pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s\n",
+			get_mode_str(test->mode),
+			test->write ? "TRUE" : "FALSE",
+			test->exception ? "TRUE" : "FALSE",
+			exception ? "TRUE" : "FALSE");
+		ret = -EFAULT;
+	}
+
+	return ret;
+}
+
+static int run_access_test(struct pks_test_ctx *ctx,
+			   struct pks_access_test *test,
+			   void *ptr)
+{
+	switch (test->mode) {
+		case PKS_TEST_NO_ACCESS:
+			pks_mknoaccess(ctx->pkey);
+			break;
+		case PKS_TEST_RDWR:
+			pks_mkrdwr(ctx->pkey);
+			break;
+		case PKS_TEST_RDONLY:
+			pks_mkread(ctx->pkey);
+			break;
+		default:
+			pr_err("BUG in test invalid mode\n");
+			break;
+	}
+
+	return test_it(ctx, test, ptr);
+}
+
+static void *alloc_test_page(int pkey)
+{
+	return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL_PKEY(pkey), 0,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+static void test_mem_access(struct pks_test_ctx *ctx)
+{
+	int i, rc;
+	u8 pkey;
+	void *ptr = NULL;
+	pte_t *ptep;
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		ctx->pass = false;
+		return;
+	}
+
+	ptep = walk_table(ptr);
+	if (!ptep) {
+		pr_err("Failed to walk table???\n");
+		ctx->pass = false;
+		goto done;
+	}
+
+	pkey = pte_flags_pkey(ptep->pte);
+	pr_info("ptep flags 0x%lx pkey %u\n",
+	       (unsigned long)ptep->pte, pkey);
+
+	if (pkey != ctx->pkey) {
+		pr_err("invalid pkey found: %u, test_pkey: %u\n",
+			pkey, ctx->pkey);
+		ctx->pass = false;
+		goto unmap;
+	}
+
+	if (!ctx->pks_cpu_enabled) {
+		pr_err("not CPU enabled; skipping access tests...\n");
+		ctx->pass = true;
+		goto unmap;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+		rc = run_access_test(ctx, &pkey_test_ary[i], ptr);
+
+		/*  only save last error is fine */
+		if (rc)
+			ctx->pass = false;
+	}
+
+unmap:
+	pte_unmap(ptep);
+done:
+	vfree(ptr);
+}
+
+static void pks_run_test(struct pks_test_ctx *ctx)
+{
+	ctx->pass = true;
+
+	pr_info("\n");
+	pr_info("\n");
+	pr_info("     ***** BEGIN: Testing (CPU enabled : %s) *****\n",
+		ctx->pks_cpu_enabled ? "TRUE" : "FALSE");
+
+	if (ctx->pks_cpu_enabled)
+		on_each_cpu(report_pkey_settings, NULL, 1);
+
+	pr_info("           BEGIN: pkey %d Testing\n", ctx->pkey);
+	test_mem_access(ctx);
+	pr_info("           END: PAGE_KERNEL_PKEY Testing : %s\n",
+		ctx->pass ? "PASS" : "FAIL");
+
+	pr_info("     ***** END: Testing *****\n");
+	pr_info("\n");
+	pr_info("\n");
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	struct pks_test_ctx *ctx = file->private_data;
+	char buf[32];
+	unsigned int len;
+
+	if (!ctx)
+		len = sprintf(buf, "not run\n");
+	else
+		len = sprintf(buf, "%s\n", ctx->pass ? "PASS" : "FAIL");
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static struct pks_test_ctx *alloc_ctx(const char *name)
+{
+	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+	if (!ctx) {
+		pr_err("Failed to allocate memory for test context\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pkey = pks_key_alloc(name);
+	if (ctx->pkey <= 0) {
+		pr_err("Failed to allocate memory for test context\n");
+		kfree(ctx);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pks_cpu_enabled = cpu_feature_enabled(X86_FEATURE_PKS);
+	sprintf(ctx->data, "%s", "DEADBEEF");
+	return ctx;
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+	pks_key_free(ctx->pkey);
+	kfree(ctx);
+}
+
+static void run_exception_test(void)
+{
+	void *ptr = NULL;
+	bool pass = true;
+	struct pks_test_ctx *ctx;
+
+	pr_info("     ***** BEGIN: exception checking\n");
+
+	ctx = alloc_ctx("Exception test");
+	if (IS_ERR(ctx)) {
+		pr_err("     FAIL: no context\n");
+		pass = false;
+		goto result;
+	}
+	ctx->pass = true;
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("     FAIL: no vmalloc page\n");
+		pass = false;
+		goto free_context;
+	}
+
+	pks_mkread(ctx->pkey);
+
+	spin_lock(&test_lock);
+	WRITE_ONCE(test_exception_ctx, ctx);
+	WRITE_ONCE(test_armed_key, ctx->pkey);
+
+	memcpy(ptr, ctx->data, 8);
+
+	if (!exception_caught()) {
+		pr_err("     FAIL: did not get an exception\n");
+		pass = false;
+	}
+
+	/*
+	 * NOTE The exception code has to enable access (b00) to keep the
+	 * fault from looping forever.  So we don't see the write disabled
+	 * restored but rather full access restored.  Also note that as part
+	 * of this test the exception callback attempted to disable access
+	 * completely (b11) and so we ensure that we are seeing the proper
+	 * thread value restored here.
+	 */
+	if (!check_pkrs(test_armed_key, 0)) {
+		pr_err("     FAIL: PKRS not restored\n");
+		pass = false;
+	}
+
+	if (!ctx->pass)
+		pass = false;
+
+	WRITE_ONCE(test_armed_key, 0);
+	spin_unlock(&test_lock);
+
+	vfree(ptr);
+free_context:
+	free_ctx(ctx);
+result:
+	pr_info("     ***** END: exception checking : %s\n",
+		 pass ? "PASS" : "FAIL");
+}
+
+static void run_all(void)
+{
+	struct pks_test_ctx *ctx[PKS_NUM_KEYS];
+	static char name[PKS_NUM_KEYS][64];
+	int i;
+
+	for (i = 1; i < PKS_NUM_KEYS; i++) {
+		sprintf(name[i], "pks ctx %d", i);
+		ctx[i] = alloc_ctx((const char *)name[i]);
+	}
+
+	for (i = 1; i < PKS_NUM_KEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			pks_run_test(ctx[i]);
+	}
+
+	for (i = 1; i < PKS_NUM_KEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			free_ctx(ctx[i]);
+	}
+
+	run_exception_test();
+}
+
+static void crash_it(void)
+{
+	struct pks_test_ctx *ctx;
+	void *ptr;
+
+	pr_warn("     ***** BEGIN: Unhandled fault test *****\n");
+
+	ctx = alloc_ctx("crashing kernel\n");
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		ctx->pass = false;
+		return;
+	}
+
+	pks_mknoaccess(ctx->pkey);
+
+	spin_lock(&test_lock);
+	WRITE_ONCE(test_armed_key, 0);
+	/* This purposely faults */
+	memcpy(ptr, ctx->data, 8);
+	spin_unlock(&test_lock);
+
+	vfree(ptr);
+	free_ctx(ctx);
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+			      size_t count, loff_t *ppos)
+{
+	char buf[2];
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (copy_from_user(buf, user_buf, 1))
+		return -EFAULT;
+	buf[1] = '\0';
+
+	/*
+	 * WARNING: Test "9" will crash the kernel.
+	 *
+	 * So we arm the test and print a warning.  A second "9" will run the
+	 * test.
+	 */
+	if (!strcmp(buf, "9")) {
+		if (run_9) {
+			crash_it();
+			run_9 = false;
+		} else {
+			pr_warn("CAUTION: Test 9 will crash in the kernel.\n");
+			pr_warn("         Specify 9 a second time to run\n");
+			pr_warn("         run any other test to clear\n");
+			run_9 = true;
+		}
+	} else {
+		run_9 = false;
+	}
+
+	/*
+	 * Test "3" will test allocating all keys. Do it first without
+	 * using "ctx".
+	 */
+	if (!strcmp(buf, "3"))
+		run_all();
+
+	if (!ctx) {
+		ctx = alloc_ctx("pks test");
+		if (IS_ERR(ctx))
+			return -ENOMEM;
+		file->private_data = ctx;
+	}
+
+	if (!strcmp(buf, "0"))
+		pks_run_test(ctx);
+
+	/* start of context switch test */
+	if (!strcmp(buf, "1")) {
+		/* Ensure a known state to test context switch */
+		pks_mknoaccess(ctx->pkey);
+	}
+
+	/* After context switch msr should be restored */
+	if (!strcmp(buf, "2") && ctx->pks_cpu_enabled) {
+		unsigned long reg_pkrs;
+		int access;
+
+		rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+		access = (reg_pkrs >> (ctx->pkey * PKR_BITS_PER_PKEY)) &
+			  PKEY_ACCESS_MASK;
+		if (access != (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)) {
+			ctx->pass = false;
+			pr_err("Context switch check failed\n");
+		}
+	}
+
+	return count;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (!ctx)
+		return 0;
+
+	free_ctx(ctx);
+	return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+	.read = pks_read_file,
+	.write = pks_write_file,
+	.llseek = default_llseek,
+	.release = pks_release_file,
+};
+
+static int __init parse_pks_test_options(char *str)
+{
+	run_on_boot = true;
+
+	return 0;
+}
+early_param("pks-test-on-boot", parse_pks_test_options);
+
+static int __init pks_test_init(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_PKS)) {
+		if (run_on_boot)
+			run_all();
+
+		pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+						      NULL, &fops_init_pks);
+	}
+
+	return 0;
+}
+late_initcall(pks_test_init);
+
+static void __exit pks_test_exit(void)
+{
+	debugfs_remove(pks_test_dentry);
+	pr_info("test exit\n");
+}
+module_exit(pks_test_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 6703c7906b71..f5c80f952eab 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,8 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
 			check_initial_reg_state sigreturn iopl ioperm \
 			test_vdso test_vsyscall mov_ss_trap \
-			syscall_arg_fault fsgsbase_restore
+			syscall_arg_fault fsgsbase_restore test_pks
+
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
new file mode 100644
index 000000000000..8037a2a9ff5f
--- /dev/null
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <assert.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+int main(void)
+{
+	cpu_set_t cpuset;
+	char result[32];
+	pid_t pid;
+	int fd;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	/* Two processes run on CPU 0 so that they go through context switch. */
+	sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
+
+	pid = fork();
+	if (pid == 0) {
+		fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+		if (fd < 0) {
+			printf("cannot open file\n");
+			return -1;
+		}
+
+		/* Allocate test_pkey1 and run test. */
+		write(fd, "0", 1);
+
+		/* Arm for context switch test */
+		write(fd, "1", 1);
+
+		/* Context switch out... */
+		sleep(4);
+
+		/* Check msr restored */
+		write(fd, "2", 1);
+	} else {
+		sleep(2);
+
+		fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+		if (fd < 0) {
+			printf("cannot open file\n");
+			return -1;
+		}
+
+		/* run test with alternate pkey */
+		write(fd, "0", 1);
+	}
+
+	read(fd, result, 10);
+	printf("#PF, context switch, pkey allocation and free tests: %s\n",
+	       result);
+
+	close(fd);
+
+	return 0;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3
  2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
                   ` (8 preceding siblings ...)
  2020-10-09 19:42 ` [PATCH RFC V3 9/9] x86/pks: Add PKS test code ira.weiny
@ 2020-10-09 20:18 ` Ira Weiny
  9 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-09 20:18 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 09, 2020 at 12:42:49PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> This RFC series has been reviewed by Dave Hansen.
> 
> Introduce a new page protection mechanism for supervisor pages, Protection Key
> Supervisor (PKS).
> 
> 2 use cases for PKS are being developed, trusted keys and PMEM.

RFC patch sets for these use cases have also been posted:

PMEM:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/

Trusted Keys:
https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/

Ira

> Trusted keys
> is a newer use case which is still being explored.  PMEM was submitted as part
> of the RFC (v2) series[1].  However, since then it was found that some callers
> of kmap() require a global implementation of PKS.  Specifically some users of
> kmap() expect mappings to be available to all kernel threads.  While global use
> of PKS is rare it needs to be included for correctness.  Unfortunately the
> kmap() updates required a large patch series to make the needed changes at the
> various kmap() call sites so that patch set has been split out.  Because the
> global PKS feature is only required for that use case it will be deferred to
> that set as well.[2]  This patch set is being submitted as a precursor to both
> of the use cases.
> 
> For an overview of the entire PKS ecosystem, a git tree including this series
> and the 2 use cases can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> 
> 
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to those pages beyond the normal paging protections.  PKS works in
> a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
> checked in addition to normal paging protections and Access or Writes can be
> disabled via a MSR update without TLB flushes when permissions change.  Also
> like PKU, a page mapping is assigned to a domain by setting pkey bits in the
> page table entry for that mapping.
> 
> Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
> 
> XSAVE is not supported for the PKRS MSR.  Therefore the implementation
> saves/restores the MSR across context switches and during exceptions.  Nested
> exceptions are supported by each exception getting a new PKS state.
> 
> For consistent behavior with current paging protections, pkey 0 is reserved and
> configured to allow full access via the pkey mechanism, thus preserving the
> default paging protections on mappings with the default pkey value of 0.
> 
> Other keys, (1-15) are allocated by an allocator which prepares us for key
> contention from day one.  Kernel users should be prepared for the allocator to
> fail either because of key exhaustion or due to PKS not being supported on the
> arch and/or CPU instance.
> 
> The following are key attributes of PKS.
> 
>    1) Fast switching of permissions
> 	1a) Prevents access without page table manipulations
> 	1b) No TLB flushes required
>    2) Works on a per thread basis
> 
> PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
> the PTE to store the pkey within the entry.
> 
> 
> [1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/
> [2] https://github.com/weiny2/linux-kernel/commit/f10abb0f0d7b4e14f03fc8890313a5830cde1e49
> 	and a testing patch
>     https://github.com/weiny2/linux-kernel/commit/2a8e0fc7654a7c69b243d628f63b01ff26a5a797
> 
> 
> Fenghua Yu (3):
>   x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
>   x86/pks: Enable Protection Keys Supervisor (PKS)
>   x86/pks: Add PKS kernel API
> 
> Ira Weiny (6):
>   x86/pkeys: Create pkeys_common.h
>   x86/pks: Preserve the PKRS MSR on context switch
>   x86/entry: Pass irqentry_state_t by reference
>   x86/entry: Preserve PKRS MSR across exceptions
>   x86/fault: Report the PKRS state on fault
>   x86/pks: Add PKS test code
> 
>  Documentation/core-api/protection-keys.rst  | 102 ++-
>  arch/x86/Kconfig                            |   1 +
>  arch/x86/entry/common.c                     |  57 +-
>  arch/x86/include/asm/cpufeatures.h          |   1 +
>  arch/x86/include/asm/idtentry.h             |  29 +-
>  arch/x86/include/asm/msr-index.h            |   1 +
>  arch/x86/include/asm/pgtable.h              |  13 +-
>  arch/x86/include/asm/pgtable_types.h        |  12 +
>  arch/x86/include/asm/pkeys.h                |  15 +
>  arch/x86/include/asm/pkeys_common.h         |  36 +
>  arch/x86/include/asm/processor.h            |  13 +
>  arch/x86/include/uapi/asm/processor-flags.h |   2 +
>  arch/x86/kernel/cpu/common.c                |  17 +
>  arch/x86/kernel/cpu/mce/core.c              |   4 +
>  arch/x86/kernel/fpu/xstate.c                |  22 +-
>  arch/x86/kernel/kvm.c                       |   4 +-
>  arch/x86/kernel/nmi.c                       |   7 +-
>  arch/x86/kernel/process.c                   |  21 +
>  arch/x86/kernel/traps.c                     |  21 +-
>  arch/x86/mm/fault.c                         |  86 ++-
>  arch/x86/mm/pkeys.c                         | 188 +++++-
>  include/linux/entry-common.h                |  19 +-
>  include/linux/pgtable.h                     |   4 +
>  include/linux/pkeys.h                       |  23 +-
>  kernel/entry/common.c                       |  28 +-
>  lib/Kconfig.debug                           |  12 +
>  lib/Makefile                                |   3 +
>  lib/pks/Makefile                            |   3 +
>  lib/pks/pks_test.c                          | 690 ++++++++++++++++++++
>  mm/Kconfig                                  |   2 +
>  tools/testing/selftests/x86/Makefile        |   3 +-
>  tools/testing/selftests/x86/test_pks.c      |  65 ++
>  32 files changed, 1376 insertions(+), 128 deletions(-)
>  create mode 100644 arch/x86/include/asm/pkeys_common.h
>  create mode 100644 lib/pks/Makefile
>  create mode 100644 lib/pks/pks_test.c
>  create mode 100644 tools/testing/selftests/x86/test_pks.c
> 
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h
  2020-10-09 19:42 ` [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h ira.weiny
@ 2020-10-13 17:46   ` Dave Hansen
  2020-10-13 19:44     ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 17:46 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> in similar fashions and can share common defines.

Could we be a bit less abstract?  PKS and PKU each have:
1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.

> +++ b/arch/x86/include/asm/pkeys_common.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> +#define _ASM_X86_PKEYS_INTERNAL_H
> +
> +#define PKR_AD_BIT 0x1
> +#define PKR_WD_BIT 0x2
> +#define PKR_BITS_PER_PKEY 2
> +
> +#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))

Now that this has moved away from its use-site, it's a bit less
self-documenting.  Let's add a comment:

/*
 * Generate an Access-Disable mask for the given pkey.  Several of these
 * can be OR'd together to generate pkey register values.
 */

Once that's in place, along with the updated changelog:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-10-09 19:42 ` [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
@ 2020-10-13 17:50   ` Dave Hansen
  2020-10-13 23:56     ` Ira Weiny
  2020-10-16 10:57   ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 17:50 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> +/*
> + * Update the pk_reg value and return it.

How about:

	Replace disable bits for @pkey with values from @flags.

> + * Kernel users use the same flags as user space:
> + *     PKEY_DISABLE_ACCESS
> + *     PKEY_DISABLE_WRITE
> + */
> +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> +{
> +	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> +
> +	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +
> +	if (flags & PKEY_DISABLE_ACCESS)
> +		pk_reg |= PKR_AD_BIT << pkey_shift;
> +	if (flags & PKEY_DISABLE_WRITE)
> +		pk_reg |= PKR_WD_BIT << pkey_shift;

I still think this deserves two lines of comments:

	/* Mask out old bit values */

	/* Or in new values */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS)
  2020-10-09 19:42 ` [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
@ 2020-10-13 18:23   ` Dave Hansen
  2020-10-14  2:08     ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 18:23 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Fenghua Yu, Dan Williams, x86, Dave Hansen, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + * Configure PKS if the cpu supports the feature.
> + */

Let's at least be consistent about CPU vs. cpu in a single comment. :)

> +static void setup_pks(void)
> +{
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
> +		return;
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;

If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of
the explicit CONFIG_ check.

> +	cr4_set_bits(X86_CR4_PKS);
> +}
> +
>  /*
>   * This does the hard work of actually picking apart the CPU stuff...
>   */
> @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  
>  	x86_init_rdrand(c);
>  	setup_pku(c);
> +	setup_pks();
>  
>  	/*
>  	 * Clear/Set all flags overridden by options, need do it
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6c974888f86f..1b9bc004d9bc 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
>  	bool
>  config ARCH_HAS_PKEYS
>  	bool
> +config ARCH_HAS_SUPERVISOR_PKEYS
> +	bool
>  
>  config PERCPU_STATS
>  	bool "Collect percpu memory statistics"
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-09 19:42 ` [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
@ 2020-10-13 18:31   ` Dave Hansen
  2020-10-14 22:36     ` Ira Weiny
  2020-10-16 11:12     ` Peter Zijlstra
  2020-10-16 11:06   ` Peter Zijlstra
  1 sibling, 2 replies; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 18:31 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The PKRS MSR is defined as a per-logical-processor register.  This
> isolates memory access by logical CPU.  Unfortunately, the MSR is not
> managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> context switch.
> 
> Define a saved PKRS value in the task struct, as well as a cached
> per-logical-processor MSR value which mirrors the MSR value of the
> current CPU.  Initialize all tasks with the default MSR value.  Then, on
> schedule in, check the saved task MSR vs the per-cpu value.  If
> different proceed to write the MSR.  If not avoid the overhead of the
> MSR write and continue.

It's probably nice to note how the WRMSR is special here, in addition to
the comments below.

>  #endif /*_ASM_X86_PKEYS_INTERNAL_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 97143d87994c..da2381136b2d 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -18,6 +18,7 @@ struct vm86;
>  #include <asm/cpufeatures.h>
>  #include <asm/page.h>
>  #include <asm/pgtable_types.h>
> +#include <asm/pkeys_common.h>
>  #include <asm/percpu.h>
>  #include <asm/msr.h>
>  #include <asm/desc_defs.h>
> @@ -542,6 +543,11 @@ struct thread_struct {
>  
>  	unsigned int		sig_on_uaccess_err:1;
>  
> +#ifdef	CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> +	/* Saved Protection key register for supervisor mappings */
> +	u32			saved_pkrs;
> +#endif

Could you take a look around thread_struct and see if there are some
other MSRs near which you can stash this?  This seems like a bit of a
lonely place.

...
>  void flush_thread(void)
>  {
>  	struct task_struct *tsk = current;
> @@ -195,6 +212,8 @@ void flush_thread(void)
>  	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
>  
>  	fpu__clear_all(&tsk->thread.fpu);
> +
> +	pks_init_task(tsk);
>  }
>  
>  void disable_TSC(void)
> @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
>  
>  	if ((tifp ^ tifn) & _TIF_SLD)
>  		switch_to_sld(tifn);
> +
> +	pks_sched_in();
>  }
>  
>  /*
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 3cf8f775f36d..30f65dd3d0c5 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>  
>  	return pk_reg;
>  }
> +
> +DEFINE_PER_CPU(u32, pkrs_cache);
> +
> +/**
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + * 	WRPKRU will never execute transiently. Memory accesses
> + * 	affected by PKRU register will not execute (even transiently)
> + * 	until all prior executions of WRPKRU have completed execution
> + * 	and updated the PKRU register.
> + */
> +void write_pkrs(u32 new_pkrs)
> +{
> +	u32 *pkrs;
> +
> +	if (!static_cpu_has(X86_FEATURE_PKS))
> +		return;
> +
> +	pkrs = get_cpu_ptr(&pkrs_cache);
> +	if (*pkrs != new_pkrs) {
> +		*pkrs = new_pkrs;
> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> +	}
> +	put_cpu_ptr(pkrs);
> +}
> 

It bugs me a *bit* that this is being called in a preempt-disabled
region, but we still bother with the get/put_cpu jazz.  Are there other
future call-sites for this that aren't in preempt-disabled regions?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API
  2020-10-09 19:42 ` [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API ira.weiny
@ 2020-10-13 18:43   ` Dave Hansen
  2020-10-15  1:08     ` Ira Weiny
  2020-10-16 11:07   ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 18:43 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

> +static inline void pks_update_protection(int pkey, unsigned long protection)
> +{
> +	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> +						     pkey, protection);
> +	preempt_disable();
> +	write_pkrs(current->thread.saved_pkrs);
> +	preempt_enable();
> +}

Why does this need preempt count manipulation in addition to the
get/put_cpu_var() inside of write_pkrs()?

> +/**
> + * PKS access control functions
> + *
> + * Change the access of the domain specified by the pkey.  These are global
> + * updates.  They only affects the current running thread.  It is undefined and
> + * a bug for users to call this without having allocated a pkey and using it as
> + * pkey here.
> + *
> + * pks_mknoaccess()
> + *     Disable all access to the domain
> + * pks_mkread()
> + *     Make the domain Read only
> + * pks_mkrdwr()
> + *     Make the domain Read/Write
> + *
> + * @pkey the pkey for which the access should change.
> + *
> + */
> +void pks_mknoaccess(int pkey)
> +{
> +	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
> +}
> +EXPORT_SYMBOL_GPL(pks_mknoaccess);

These are named like PTE manipulation functions, which is kinda weird.

What's wrong with: pks_disable_access(pkey) ?

> +void pks_mkread(int pkey)
> +{
> +	pks_update_protection(pkey, PKEY_DISABLE_WRITE);
> +}
> +EXPORT_SYMBOL_GPL(pks_mkread);

I really don't like this name.  It doesn't make readable, or even
read-only, *especially* if it was already access-disabled.

> +static const char pks_key_user0[] = "kernel";
> +
> +/* Store names of allocated keys for debug.  Key 0 is reserved for the kernel.  */
> +static const char *pks_key_users[PKS_NUM_KEYS] = {
> +	pks_key_user0
> +};
> +
> +/*
> + * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved for
> + * its use.  We use ulong for the bit operations but only 16 bits are used.
> + */
> +static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
> +
> +/*
> + * pks_key_alloc - Allocate a PKS key
> + *
> + * @pkey_user: String stored for debugging of key exhaustion.  The caller is
> + * responsible to maintain this memory until pks_key_free().
> + */
> +int pks_key_alloc(const char * const pkey_user)
> +{
> +	int nr;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return -EINVAL;

I'm not sure I like -EINVAL for this.  I thought we returned -ENOSPC for
this case for user pkeys.

> +	while (1) {
> +		nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS);
> +		if (nr >= PKS_NUM_KEYS) {
> +			pr_info("Cannot allocate supervisor key for %s.\n",
> +				pkey_user);
> +			return -ENOSPC;
> +		}
> +		if (!test_and_set_bit_lock(nr, &pks_key_allocation_map))
> +			break;
> +	}
> +
> +	/* for debugging key exhaustion */
> +	pks_key_users[nr] = pkey_user;
> +
> +	return nr;
> +}
> +EXPORT_SYMBOL_GPL(pks_key_alloc);
> +
> +/*
> + * pks_key_free - Free a previously allocate PKS key
> + *
> + * @pkey: Key to be free'ed
> + */
> +void pks_key_free(int pkey)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)
> +		return;

This seems worthy of a WARN_ON_ONCE() at least.  It's essentially
corrupt data coming into a kernel API.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions
  2020-10-09 19:42 ` [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
@ 2020-10-13 18:52   ` Dave Hansen
  2020-10-15  3:46     ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 18:52 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Dave Hansen, x86, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
>  	/* Use the combo lockdep/tracing function */
>  	trace_hardirqs_off();
>  	instrumentation_end();
> +
> +done:
> +	irq_save_pkrs(state);
>  }

One nit: This saves *and* sets PKRS.  It's not obvious from the call
here that PKRS is altered at this site.  Seems like there could be a
better name.

Even if we did:

	irq_save_set_pkrs(state, INIT_VAL);

It would probably compile down to the same thing, but be *really*
obvious what's going on.

>  void irqentry_exit_cond_resched(void)
> @@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *state)
>  	/* Check whether this returns to user mode */
>  	if (user_mode(regs)) {
>  		irqentry_exit_to_user_mode(regs);
> -	} else if (!regs_irqs_disabled(regs)) {
> +		return;
> +	}
> +
> +	irq_restore_pkrs(state);
> +
> +	if (!regs_irqs_disabled(regs)) {
>  		/*
>  		 * If RCU was not watching on entry this needs to be done
>  		 * carefully and needs the same ordering of lockdep/tracing
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault
  2020-10-09 19:42 ` [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault ira.weiny
@ 2020-10-13 18:56   ` Dave Hansen
  2020-10-15  4:13     ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 18:56 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

> @@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
>  		 (error_code & X86_PF_PK)    ? "protection keys violation" :
>  					       "permissions violation");
>  
> +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> +	if (irq_state && (error_code & X86_PF_PK))
> +		pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
> +#endif

This means everyone will see 'PKRS: 0x0', even if they're on non-PKS
hardware.  I think I'd rather have this only show PKRS when we're on
cpu_feature_enabled(PKS) hardware.

...
> @@ -1148,14 +1156,15 @@ static int fault_in_kernel_space(unsigned long address)
>   */
>  static void
>  do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> -		   unsigned long address)
> +		   unsigned long address, irqentry_state_t *irq_state)
>  {
>  	/*
> -	 * Protection keys exceptions only happen on user pages.  We
> -	 * have no user pages in the kernel portion of the address
> -	 * space, so do not expect them here.
> +	 * If protection keys are not enabled for kernel space
> +	 * do not expect Pkey errors here.
>  	 */

Let's fix the double-negative:

	/*
	 * PF_PK is only expected on kernel addresses whenn
	 * supervisor pkeys are enabled:
	 */

> -	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS) ||
> +	    !cpu_feature_enabled(X86_FEATURE_PKS))
> +		WARN_ON_ONCE(hw_error_code & X86_PF_PK);

Yeah, please stick X86_FEATURE_PKS in disabled-features so you can use
cpu_feature_enabled(X86_FEATURE_PKS) by itself here..

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 9/9] x86/pks: Add PKS test code
  2020-10-09 19:42 ` [PATCH RFC V3 9/9] x86/pks: Add PKS test code ira.weiny
@ 2020-10-13 19:02   ` Dave Hansen
  2020-10-15  4:46     ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-13 19:02 UTC (permalink / raw)
  To: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
>  #ifdef CONFIG_X86_32
>  	/*
>  	 * We can fault-in kernel-space virtual memory on-demand. The
> diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> index cc3510cde64e..f9552bd9341f 100644
> --- a/include/linux/pkeys.h
> +++ b/include/linux/pkeys.h
> @@ -47,7 +47,6 @@ static inline bool arch_pkeys_enabled(void)
>  static inline void copy_init_pkru_to_fpregs(void)
>  {
>  }
> -
>  #endif /* ! CONFIG_ARCH_HAS_PKEYS */

^ Whitespace damage

>  #ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 0c781f912f9f..f015c09ba5a1 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2400,6 +2400,18 @@ config HYPERV_TESTING
>  	help
>  	  Select this option to enable Hyper-V vmbus testing.
>  
> +config PKS_TESTING
> +	bool "PKey(S)upervisor testing"

Seems like we need a space in there somewhere.

> +	pid = fork();
> +	if (pid == 0) {
> +		fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
> +		if (fd < 0) {
> +			printf("cannot open file\n");
> +			return -1;
> +		}
> +

Will this return code make anybody mad?  Should we have a nicer return
code for when this is running on non-PKS hardware?

I'm not going to be too picky about this.  I'll just ask one question:
Has this found real bugs for you?

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h
  2020-10-13 17:46   ` Dave Hansen
@ 2020-10-13 19:44     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-13 19:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, x86, Dave Hansen, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 10:46:16AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> > in similar fashions and can share common defines.
> 
> Could we be a bit less abstract?  PKS and PKU each have:
> 1. A single control register
> 2. The same number of keys
> 3. The same number of bits in the register per key
> 4. Access and Write disable in the same bit locations
> 
> That means that we can share all the macros that synthesize and
> manipulate register values between the two features.

Sure.  Done.

> 
> > +++ b/arch/x86/include/asm/pkeys_common.h
> > @@ -0,0 +1,11 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> > +#define _ASM_X86_PKEYS_INTERNAL_H
> > +
> > +#define PKR_AD_BIT 0x1
> > +#define PKR_WD_BIT 0x2
> > +#define PKR_BITS_PER_PKEY 2
> > +
> > +#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
> 
> Now that this has moved away from its use-site, it's a bit less
> self-documenting.  Let's add a comment:
> 
> /*
>  * Generate an Access-Disable mask for the given pkey.  Several of these
>  * can be OR'd together to generate pkey register values.
>  */

Fair enough. done.

> 
> Once that's in place, along with the updated changelog:
> 
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks,
Ira


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-10-13 17:50   ` Dave Hansen
@ 2020-10-13 23:56     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-13 23:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 10:50:05AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > +/*
> > + * Update the pk_reg value and return it.
> 
> How about:
> 
> 	Replace disable bits for @pkey with values from @flags.

Done.

> 
> > + * Kernel users use the same flags as user space:
> > + *     PKEY_DISABLE_ACCESS
> > + *     PKEY_DISABLE_WRITE
> > + */
> > +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> > +{
> > +	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > +
> > +	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > +
> > +	if (flags & PKEY_DISABLE_ACCESS)
> > +		pk_reg |= PKR_AD_BIT << pkey_shift;
> > +	if (flags & PKEY_DISABLE_WRITE)
> > +		pk_reg |= PKR_WD_BIT << pkey_shift;
> 
> I still think this deserves two lines of comments:
> 
> 	/* Mask out old bit values */
> 
> 	/* Or in new values */

Sure, done.
Ira


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS)
  2020-10-13 18:23   ` Dave Hansen
@ 2020-10-14  2:08     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-14  2:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, Dan Williams, x86, Dave Hansen,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 11:23:08AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > +/*
> > + * PKS is independent of PKU and either or both may be supported on a CPU.
> > + * Configure PKS if the cpu supports the feature.
> > + */
> 
> Let's at least be consistent about CPU vs. cpu in a single comment. :)

Sorry, done.

> 
> > +static void setup_pks(void)
> > +{
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
> > +		return;
> > +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +		return;
> 
> If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of
> the explicit CONFIG_ check.

Done.

> 
> > +	cr4_set_bits(X86_CR4_PKS);
> > +}
> > +
> >  /*
> >   * This does the hard work of actually picking apart the CPU stuff...
> >   */
> > @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> >  
> >  	x86_init_rdrand(c);
> >  	setup_pku(c);
> > +	setup_pks();
> >  
> >  	/*
> >  	 * Clear/Set all flags overridden by options, need do it
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 6c974888f86f..1b9bc004d9bc 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
> >  	bool
> >  config ARCH_HAS_PKEYS
> >  	bool
> > +config ARCH_HAS_SUPERVISOR_PKEYS
> > +	bool
> >  
> >  config PERCPU_STATS
> >  	bool "Collect percpu memory statistics"
> > 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-13 18:31   ` Dave Hansen
@ 2020-10-14 22:36     ` Ira Weiny
  2020-10-16 11:12     ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-14 22:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The PKRS MSR is defined as a per-logical-processor register.  This
> > isolates memory access by logical CPU.  Unfortunately, the MSR is not
> > managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> > context switch.
> > 
> > Define a saved PKRS value in the task struct, as well as a cached
> > per-logical-processor MSR value which mirrors the MSR value of the
> > current CPU.  Initialize all tasks with the default MSR value.  Then, on
> > schedule in, check the saved task MSR vs the per-cpu value.  If
> > different proceed to write the MSR.  If not avoid the overhead of the
> > MSR write and continue.
> 
> It's probably nice to note how the WRMSR is special here, in addition to
> the comments below.

Sure,

> 
> >  #endif /*_ASM_X86_PKEYS_INTERNAL_H */
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index 97143d87994c..da2381136b2d 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -18,6 +18,7 @@ struct vm86;
> >  #include <asm/cpufeatures.h>
> >  #include <asm/page.h>
> >  #include <asm/pgtable_types.h>
> > +#include <asm/pkeys_common.h>
> >  #include <asm/percpu.h>
> >  #include <asm/msr.h>
> >  #include <asm/desc_defs.h>
> > @@ -542,6 +543,11 @@ struct thread_struct {
> >  
> >  	unsigned int		sig_on_uaccess_err:1;
> >  
> > +#ifdef	CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +	/* Saved Protection key register for supervisor mappings */
> > +	u32			saved_pkrs;
> > +#endif
> 
> Could you take a look around thread_struct and see if there are some
> other MSRs near which you can stash this?  This seems like a bit of a
> lonely place.

Are you more concerned with aesthetics or the in memory struct layout?

How about I put it after error_code?

	unsigned long           error_code;                                     
+                                                                               
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS                                        
+       /* Saved Protection key register for supervisor mappings */             
+       u32                     saved_pkrs;                                     
+#endif                                                                         
+                                                                               

?

> 
> ...
> >  void flush_thread(void)
> >  {
> >  	struct task_struct *tsk = current;
> > @@ -195,6 +212,8 @@ void flush_thread(void)
> >  	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
> >  
> >  	fpu__clear_all(&tsk->thread.fpu);
> > +
> > +	pks_init_task(tsk);
> >  }
> >  
> >  void disable_TSC(void)
> > @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
> >  
> >  	if ((tifp ^ tifn) & _TIF_SLD)
> >  		switch_to_sld(tifn);
> > +
> > +	pks_sched_in();
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 3cf8f775f36d..30f65dd3d0c5 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> >  
> >  	return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> > +
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * 	WRPKRU will never execute transiently. Memory accesses
> > + * 	affected by PKRU register will not execute (even transiently)
> > + * 	until all prior executions of WRPKRU have completed execution
> > + * 	and updated the PKRU register.
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +	u32 *pkrs;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	pkrs = get_cpu_ptr(&pkrs_cache);
> > +	if (*pkrs != new_pkrs) {
> > +		*pkrs = new_pkrs;
> > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +	}
> > +	put_cpu_ptr(pkrs);
> > +}
> > 
> 
> It bugs me a *bit* that this is being called in a preempt-disabled
> region, but we still bother with the get/put_cpu jazz.  Are there other
> future call-sites for this that aren't in preempt-disabled regions?

I'm not specifically disabling preempt before calling write_pkrs except in the
next patch (which is buggy because I meant to have it around the modification
of thread.saved_pkrs as well).  But that was to protect the thread variable not
the percpu cache vs MSR.

My thought above was it is safer for this call to ensure the per-cpu variable
is consistent with the register.  The other calls to write_pkrs() may require
preemption disable but for reasons unrelated to write_pkrs' state.

After some research I've now fully confused myself if this is needed in patch
7/9 where write_pkrs() is called from the exception handing code.  But I think
it is needed there.  Isn't it?

Since preempt_disable() is nestable I think this is ok correct?

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API
  2020-10-13 18:43   ` Dave Hansen
@ 2020-10-15  1:08     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-15  1:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 11:43:57AM -0700, Dave Hansen wrote:
> > +static inline void pks_update_protection(int pkey, unsigned long protection)
> > +{
> > +	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> > +						     pkey, protection);
> > +	preempt_disable();
> > +	write_pkrs(current->thread.saved_pkrs);
> > +	preempt_enable();
> > +}
> 
> Why does this need preempt count manipulation in addition to the
> get/put_cpu_var() inside of write_pkrs()?

This is a bug.  The disable should be around the update_pkey_val().

> 
> > +/**
> > + * PKS access control functions
> > + *
> > + * Change the access of the domain specified by the pkey.  These are global
> > + * updates.  They only affects the current running thread.  It is undefined and
> > + * a bug for users to call this without having allocated a pkey and using it as
> > + * pkey here.
> > + *
> > + * pks_mknoaccess()
> > + *     Disable all access to the domain
> > + * pks_mkread()
> > + *     Make the domain Read only
> > + * pks_mkrdwr()
> > + *     Make the domain Read/Write
> > + *
> > + * @pkey the pkey for which the access should change.
> > + *
> > + */
> > +void pks_mknoaccess(int pkey)
> > +{
> > +	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
> > +}
> > +EXPORT_SYMBOL_GPL(pks_mknoaccess);
> 
> These are named like PTE manipulation functions, which is kinda weird.
> 
> What's wrong with: pks_disable_access(pkey) ?

Internal review suggested these names.  I'm not dead set on them.

FWIW I would rather they not get to wordy.

I was trying to get some consistency with pks_mk*() as meaning PKS 'make' X.

Do me 'disable' implies a state transition where 'make' implies we are
'setting' an absolute value.  I think the later is a better name.  And 'make'
made more sense because 'set' is so overloaded IHO.

> 
> > +void pks_mkread(int pkey)
> > +{
> > +	pks_update_protection(pkey, PKEY_DISABLE_WRITE);
> > +}
> > +EXPORT_SYMBOL_GPL(pks_mkread);
> 
> I really don't like this name.  It doesn't make readable, or even
> read-only, *especially* if it was already access-disabled.

Ok.

But it does sense if going from access-disable to read, correct?.  I could see
this being better named pks_mkreadonly() so that going from RW to this would
make more sense.  Especially after thinking about it above 'read only' needs to
be in the name.

Before I change anything I'd like to get consensus on naming.

How about the following?

pks_mk_noaccess()
pks_mk_readonly()
pks_mk_readwrite()

?

> 
> > +static const char pks_key_user0[] = "kernel";
> > +
> > +/* Store names of allocated keys for debug.  Key 0 is reserved for the kernel.  */
> > +static const char *pks_key_users[PKS_NUM_KEYS] = {
> > +	pks_key_user0
> > +};
> > +
> > +/*
> > + * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved for
> > + * its use.  We use ulong for the bit operations but only 16 bits are used.
> > + */
> > +static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
> > +
> > +/*
> > + * pks_key_alloc - Allocate a PKS key
> > + *
> > + * @pkey_user: String stored for debugging of key exhaustion.  The caller is
> > + * responsible to maintain this memory until pks_key_free().
> > + */
> > +int pks_key_alloc(const char * const pkey_user)
> > +{
> > +	int nr;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +		return -EINVAL;
> 
> I'm not sure I like -EINVAL for this.  I thought we returned -ENOSPC for
> this case for user pkeys.

-ENOTSUP?

I'm not really sure anyone will need to know the difference between the
platform not supporting the key vs running out of them.  But they are 2
different error conditions.

> 
> > +	while (1) {
> > +		nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS);
> > +		if (nr >= PKS_NUM_KEYS) {
> > +			pr_info("Cannot allocate supervisor key for %s.\n",
> > +				pkey_user);
> > +			return -ENOSPC;

We return -ENOSPC here when running out of keys.

> > +		}
> > +		if (!test_and_set_bit_lock(nr, &pks_key_allocation_map))
> > +			break;
> > +	}
> > +
> > +	/* for debugging key exhaustion */
> > +	pks_key_users[nr] = pkey_user;
> > +
> > +	return nr;
> > +}
> > +EXPORT_SYMBOL_GPL(pks_key_alloc);
> > +
> > +/*
> > + * pks_key_free - Free a previously allocate PKS key
> > + *
> > + * @pkey: Key to be free'ed
> > + */
> > +void pks_key_free(int pkey)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)
> > +		return;
> 
> This seems worthy of a WARN_ON_ONCE() at least.  It's essentially
> corrupt data coming into a kernel API.

Ok, Done,
Ira


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions
  2020-10-13 18:52   ` Dave Hansen
@ 2020-10-15  3:46     ` Ira Weiny
  2020-10-15  4:06       ` Dave Hansen
  0 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2020-10-15  3:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, x86, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
> >  	/* Use the combo lockdep/tracing function */
> >  	trace_hardirqs_off();
> >  	instrumentation_end();
> > +
> > +done:
> > +	irq_save_pkrs(state);
> >  }
> 
> One nit: This saves *and* sets PKRS.  It's not obvious from the call
> here that PKRS is altered at this site.  Seems like there could be a
> better name.
> 
> Even if we did:
> 
> 	irq_save_set_pkrs(state, INIT_VAL);
> 
> It would probably compile down to the same thing, but be *really*
> obvious what's going on.

I suppose that is true.  But I think it is odd having a parameter which is the
same for every call site.

But I'm not going to quibble over something like this.

Changed,
Ira

> 
> >  void irqentry_exit_cond_resched(void)
> > @@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *state)
> >  	/* Check whether this returns to user mode */
> >  	if (user_mode(regs)) {
> >  		irqentry_exit_to_user_mode(regs);
> > -	} else if (!regs_irqs_disabled(regs)) {
> > +		return;
> > +	}
> > +
> > +	irq_restore_pkrs(state);
> > +
> > +	if (!regs_irqs_disabled(regs)) {
> >  		/*
> >  		 * If RCU was not watching on entry this needs to be done
> >  		 * carefully and needs the same ordering of lockdep/tracing
> > 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions
  2020-10-15  3:46     ` Ira Weiny
@ 2020-10-15  4:06       ` Dave Hansen
  2020-10-15  4:18         ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2020-10-15  4:06 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, x86, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On 10/14/20 8:46 PM, Ira Weiny wrote:
> On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
>> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
>>> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
>>>  	/* Use the combo lockdep/tracing function */
>>>  	trace_hardirqs_off();
>>>  	instrumentation_end();
>>> +
>>> +done:
>>> +	irq_save_pkrs(state);
>>>  }
>> One nit: This saves *and* sets PKRS.  It's not obvious from the call
>> here that PKRS is altered at this site.  Seems like there could be a
>> better name.
>>
>> Even if we did:
>>
>> 	irq_save_set_pkrs(state, INIT_VAL);
>>
>> It would probably compile down to the same thing, but be *really*
>> obvious what's going on.
> I suppose that is true.  But I think it is odd having a parameter which is the
> same for every call site.

Well, it depends on what you optimize for.  I'm trying to optimize for
the code being understood quickly the first time someone reads it.  To
me, that's more important than minimizing the number of function
parameters (which are essentially free).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault
  2020-10-13 18:56   ` Dave Hansen
@ 2020-10-15  4:13     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-15  4:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, x86, Dave Hansen, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 11:56:53AM -0700, Dave Hansen wrote:
> > @@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
> >  		 (error_code & X86_PF_PK)    ? "protection keys violation" :
> >  					       "permissions violation");
> >  
> > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +	if (irq_state && (error_code & X86_PF_PK))
> > +		pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
> > +#endif
> 
> This means everyone will see 'PKRS: 0x0', even if they're on non-PKS
> hardware.  I think I'd rather have this only show PKRS when we're on
> cpu_feature_enabled(PKS) hardware.

Good catch, thanks.

> 
> ...
> > @@ -1148,14 +1156,15 @@ static int fault_in_kernel_space(unsigned long address)
> >   */
> >  static void
> >  do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> > -		   unsigned long address)
> > +		   unsigned long address, irqentry_state_t *irq_state)
> >  {
> >  	/*
> > -	 * Protection keys exceptions only happen on user pages.  We
> > -	 * have no user pages in the kernel portion of the address
> > -	 * space, so do not expect them here.
> > +	 * If protection keys are not enabled for kernel space
> > +	 * do not expect Pkey errors here.
> >  	 */
> 
> Let's fix the double-negative:
> 
> 	/*
> 	 * PF_PK is only expected on kernel addresses whenn
> 	 * supervisor pkeys are enabled:
> 	 */

done. thanks.

> 
> > -	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS) ||
> > +	    !cpu_feature_enabled(X86_FEATURE_PKS))
> > +		WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> 
> Yeah, please stick X86_FEATURE_PKS in disabled-features so you can use
> cpu_feature_enabled(X86_FEATURE_PKS) by itself here..

done.

thanks,
Ira


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions
  2020-10-15  4:06       ` Dave Hansen
@ 2020-10-15  4:18         ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-15  4:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, x86, Dan Williams, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Wed, Oct 14, 2020 at 09:06:44PM -0700, Dave Hansen wrote:
> On 10/14/20 8:46 PM, Ira Weiny wrote:
> > On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
> >> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> >>> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
> >>>  	/* Use the combo lockdep/tracing function */
> >>>  	trace_hardirqs_off();
> >>>  	instrumentation_end();
> >>> +
> >>> +done:
> >>> +	irq_save_pkrs(state);
> >>>  }
> >> One nit: This saves *and* sets PKRS.  It's not obvious from the call
> >> here that PKRS is altered at this site.  Seems like there could be a
> >> better name.
> >>
> >> Even if we did:
> >>
> >> 	irq_save_set_pkrs(state, INIT_VAL);
> >>
> >> It would probably compile down to the same thing, but be *really*
> >> obvious what's going on.
> > I suppose that is true.  But I think it is odd having a parameter which is the
> > same for every call site.
> 
> Well, it depends on what you optimize for.  I'm trying to optimize for
> the code being understood quickly the first time someone reads it.  To
> me, that's more important than minimizing the number of function
> parameters (which are essentially free).
>

Agreed.  Sorry I was not trying to be confrontational.  There is just enough
other things which are going to take me time to get right I need to focus on
them...  :-D

Sorry,
Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 9/9] x86/pks: Add PKS test code
  2020-10-13 19:02   ` Dave Hansen
@ 2020-10-15  4:46     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-15  4:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 12:02:07PM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> >  #ifdef CONFIG_X86_32
> >  	/*
> >  	 * We can fault-in kernel-space virtual memory on-demand. The
> > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > index cc3510cde64e..f9552bd9341f 100644
> > --- a/include/linux/pkeys.h
> > +++ b/include/linux/pkeys.h
> > @@ -47,7 +47,6 @@ static inline bool arch_pkeys_enabled(void)
> >  static inline void copy_init_pkru_to_fpregs(void)
> >  {
> >  }
> > -
> >  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
> 
> ^ Whitespace damage

Done.

> 
> >  #ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 0c781f912f9f..f015c09ba5a1 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -2400,6 +2400,18 @@ config HYPERV_TESTING
> >  	help
> >  	  Select this option to enable Hyper-V vmbus testing.
> >  
> > +config PKS_TESTING
> > +	bool "PKey(S)upervisor testing"
> 
> Seems like we need a space in there somewhere.

heheh...  yea...

> 
> > +	pid = fork();
> > +	if (pid == 0) {
> > +		fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
> > +		if (fd < 0) {
> > +			printf("cannot open file\n");
> > +			return -1;
> > +		}
> > +
> 
> Will this return code make anybody mad?  Should we have a nicer return
> code for when this is running on non-PKS hardware?

I'm not sure it will matter much but I think it is better to report the missing
file.[1]

> 
> I'm not going to be too picky about this.  I'll just ask one question:
> Has this found real bugs for you?

Many, especially regressions as things have changed.

> 
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> 

Thanks,
Ira

[1]

diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 8037a2a9ff5f..11be4e212d54 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -11,6 +11,8 @@
 #include <sys/stat.h>
 #include <fcntl.h>
 
+#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks"
+
 int main(void)
 {
        cpu_set_t cpuset;
@@ -25,9 +27,9 @@ int main(void)
 
        pid = fork();
        if (pid == 0) {
-               fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+               fd = open(PKS_TEST_FILE, O_RDWR);
                if (fd < 0) {
-                       printf("cannot open file\n");
+                       printf("cannot open %s\n", PKS_TEST_FILE);
                        return -1;
                }
 
@@ -45,9 +47,9 @@ int main(void)
        } else {
                sleep(2);
 
-               fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+               fd = open(PKS_TEST_FILE, O_RDWR);
                if (fd < 0) {
-                       printf("cannot open file\n");
+                       printf("cannot open %s\n", PKS_TEST_FILE);
                        return -1;
                }
 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-10-09 19:42 ` [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
  2020-10-13 17:50   ` Dave Hansen
@ 2020-10-16 10:57   ` Peter Zijlstra
  2020-10-17  3:32     ` Ira Weiny
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-16 10:57 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 09, 2020 at 12:42:51PM -0700, ira.weiny@intel.com wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> Define a helper, update_pkey_val(), which will be used to support both
> Protection Key User (PKU) and the new Protection Key for Supervisor
> (PKS) in subsequent patches.
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
>  arch/x86/include/asm/pkeys.h |  2 ++
>  arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
>  arch/x86/mm/pkeys.c          | 21 +++++++++++++++++++++
>  3 files changed, 27 insertions(+), 18 deletions(-)

This is not from Fenghua.

  https://lkml.kernel.org/r/20200717085442.GX10769@hirez.programming.kicks-ass.net

This is your patch based on the code I wrote.

> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index f5efb4007e74..3cf8f775f36d 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -208,3 +208,24 @@ static __init int setup_init_pkru(char *opt)
>  	return 1;
>  }
>  __setup("init_pkru=", setup_init_pkru);
> +
> +/*
> + * Update the pk_reg value and return it.
> + *
> + * Kernel users use the same flags as user space:
> + *     PKEY_DISABLE_ACCESS
> + *     PKEY_DISABLE_WRITE
> + */
> +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> +{
> +	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> +
> +	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +
> +	if (flags & PKEY_DISABLE_ACCESS)
> +		pk_reg |= PKR_AD_BIT << pkey_shift;
> +	if (flags & PKEY_DISABLE_WRITE)
> +		pk_reg |= PKR_WD_BIT << pkey_shift;
> +
> +	return pk_reg;
> +}
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-09 19:42 ` [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
  2020-10-13 18:31   ` Dave Hansen
@ 2020-10-16 11:06   ` Peter Zijlstra
  2020-10-17  5:37     ` Ira Weiny
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-16 11:06 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 09, 2020 at 12:42:53PM -0700, ira.weiny@intel.com wrote:

> @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
>  
>  	if ((tifp ^ tifn) & _TIF_SLD)
>  		switch_to_sld(tifn);
> +
> +	pks_sched_in();
>  }
>  

You seem to have lost the comment proposed here:

  https://lkml.kernel.org/r/20200717083140.GW10769@hirez.programming.kicks-ass.net

It is useful and important information that the wrmsr normally doesn't
happen.

> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 3cf8f775f36d..30f65dd3d0c5 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>  
>  	return pk_reg;
>  }
> +
> +DEFINE_PER_CPU(u32, pkrs_cache);
> +
> +/**
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + * 	WRPKRU will never execute transiently. Memory accesses
> + * 	affected by PKRU register will not execute (even transiently)
> + * 	until all prior executions of WRPKRU have completed execution
> + * 	and updated the PKRU register.

(whitespace damage; space followed by tabstop)

> + */
> +void write_pkrs(u32 new_pkrs)
> +{
> +	u32 *pkrs;
> +
> +	if (!static_cpu_has(X86_FEATURE_PKS))
> +		return;
> +
> +	pkrs = get_cpu_ptr(&pkrs_cache);
> +	if (*pkrs != new_pkrs) {
> +		*pkrs = new_pkrs;
> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> +	}
> +	put_cpu_ptr(pkrs);
> +}

looks familiar that... :-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API
  2020-10-09 19:42 ` [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API ira.weiny
  2020-10-13 18:43   ` Dave Hansen
@ 2020-10-16 11:07   ` Peter Zijlstra
  2020-10-17  5:42     ` Ira Weiny
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-16 11:07 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 09, 2020 at 12:42:54PM -0700, ira.weiny@intel.com wrote:
> +static inline void pks_update_protection(int pkey, unsigned long protection)
> +{
> +	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> +						     pkey, protection);
> +	preempt_disable();
> +	write_pkrs(current->thread.saved_pkrs);
> +	preempt_enable();
> +}

write_pkrs() already disables preemption itself. Wrapping it in yet
another layer is useless.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-13 18:31   ` Dave Hansen
  2020-10-14 22:36     ` Ira Weiny
@ 2020-10-16 11:12     ` Peter Zijlstra
  2020-10-17  5:14       ` Ira Weiny
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-16 11:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * 	WRPKRU will never execute transiently. Memory accesses
> > + * 	affected by PKRU register will not execute (even transiently)
> > + * 	until all prior executions of WRPKRU have completed execution
> > + * 	and updated the PKRU register.
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +	u32 *pkrs;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	pkrs = get_cpu_ptr(&pkrs_cache);
> > +	if (*pkrs != new_pkrs) {
> > +		*pkrs = new_pkrs;
> > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +	}
> > +	put_cpu_ptr(pkrs);
> > +}
> > 
> 
> It bugs me a *bit* that this is being called in a preempt-disabled
> region, but we still bother with the get/put_cpu jazz.  Are there other
> future call-sites for this that aren't in preempt-disabled regions?

So the previous version had a useful comment that got lost. This stuff
needs to fundamentally be preempt disabled, so it either needs to
explicitly do so, or have an assertion that preemption is indeed
disabled.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-09 19:42 ` [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference ira.weiny
@ 2020-10-16 11:45   ` Peter Zijlstra
  2020-10-16 12:55     ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-16 11:45 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.weiny@intel.com wrote:
> -noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
> +noinstr void idtentry_enter_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
>  {
> -	bool irq_state = lockdep_hardirqs_enabled();
> +	irq_state->exit_rcu = lockdep_hardirqs_enabled();
>  
>  	__nmi_enter();
>  	lockdep_hardirqs_off(CALLER_ADDR0);
> @@ -222,15 +222,13 @@ noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
>  	trace_hardirqs_off_finish();
>  	ftrace_nmi_enter();
>  	instrumentation_end();
> -
> -	return irq_state;
>  }
>  
> -noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
> +noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
>  {
>  	instrumentation_begin();
>  	ftrace_nmi_exit();
> -	if (restore) {
> +	if (irq_state->exit_rcu) {
>  		trace_hardirqs_on_prepare();
>  		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>  	}
> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
>  
>  	rcu_nmi_exit();
>  	lockdep_hardirq_exit();
> -	if (restore)
> +	if (irq_state->exit_rcu)
>  		lockdep_hardirqs_on(CALLER_ADDR0);
>  	__nmi_exit();
>  }

That's not nice.. The NMI path is different from the IRQ path and has a
different variable. Yes, this works, but *groan*.

Maybe union them if you want to avoid bloating the structure, but the
above makes it really hard to read.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-16 11:45   ` Peter Zijlstra
@ 2020-10-16 12:55     ` Thomas Gleixner
  2020-10-19  5:37       ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2020-10-16 12:55 UTC (permalink / raw)
  To: Peter Zijlstra, ira.weiny
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, x86, Dave Hansen,
	Dan Williams, Andrew Morton, Fenghua Yu, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

On Fri, Oct 16 2020 at 13:45, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.weiny@intel.com wrote:
>> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
>>  
>>  	rcu_nmi_exit();
>>  	lockdep_hardirq_exit();
>> -	if (restore)
>> +	if (irq_state->exit_rcu)
>>  		lockdep_hardirqs_on(CALLER_ADDR0);
>>  	__nmi_exit();
>>  }
>
> That's not nice.. The NMI path is different from the IRQ path and has a
> different variable. Yes, this works, but *groan*.
>
> Maybe union them if you want to avoid bloating the structure, but the
> above makes it really hard to read.

Right, and also that nmi entry thing should not be in x86. Something
like the untested below as first cleanup.

Thanks,

        tglx
----
Subject: x86/entry: Move nmi entry/exit into common code
From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 11 Sep 2020 10:09:56 +0200

Add blurb here.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c         |   34 ----------------------------------
 arch/x86/include/asm/idtentry.h |    3 ---
 arch/x86/kernel/cpu/mce/core.c  |    6 +++---
 arch/x86/kernel/nmi.c           |    6 +++---
 arch/x86/kernel/traps.c         |   13 +++++++------
 include/linux/entry-common.h    |   20 ++++++++++++++++++++
 kernel/entry/common.c           |   36 ++++++++++++++++++++++++++++++++++++
 7 files changed, 69 insertions(+), 49 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
-	bool irq_state = lockdep_hardirqs_enabled();
-
-	__nmi_enter();
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	lockdep_hardirq_enter();
-	rcu_nmi_enter();
-
-	instrumentation_begin();
-	trace_hardirqs_off_finish();
-	ftrace_nmi_enter();
-	instrumentation_end();
-
-	return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
-	instrumentation_begin();
-	ftrace_nmi_exit();
-	if (restore) {
-		trace_hardirqs_on_prepare();
-		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	}
-	instrumentation_end();
-
-	rcu_nmi_exit();
-	lockdep_hardirq_exit();
-	if (restore)
-		lockdep_hardirqs_on(CALLER_ADDR0);
-	__nmi_exit();
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@
 
 #include <asm/irq_stack.h>
 
-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1983,7 +1983,7 @@ void (*machine_check_vector)(struct pt_r
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
-	bool irq_state;
+	irqentry_state_t irq_state;
 
 	WARN_ON_ONCE(user_mode(regs));
 
@@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_
 	    mce_check_crashing_cpu())
 		return;
 
-	irq_state = idtentry_enter_nmi(regs);
+	irq_state = irqentry_nmi_enter(regs);
 	/*
 	 * The call targets are marked noinstr, but objtool can't figure
 	 * that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_
 	if (regs->flags & X86_EFLAGS_IF)
 		trace_hardirqs_on_prepare();
 	instrumentation_end();
-	idtentry_exit_nmi(regs, irq_state);
+	irqentry_nmi_exit(regs, irq_state);
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-	bool irq_state;
+	irqentry_state_t irq_state;
 
 	/*
 	 * Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
 	this_cpu_write(nmi_dr7, local_db_save());
 
-	irq_state = idtentry_enter_nmi(regs);
+	irq_state = irqentry_nmi_enter(regs);
 
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
-	idtentry_exit_nmi(regs, irq_state);
+	irqentry_nmi_exit(regs, irq_state);
 
 	local_db_restore(this_cpu_read(nmi_dr7));
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -405,7 +405,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	}
 #endif
 
-	idtentry_enter_nmi(regs);
+	irqentry_nmi_enter(regs);
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -651,12 +651,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		instrumentation_end();
 		irqentry_exit_to_user_mode(regs);
 	} else {
-		bool irq_state = idtentry_enter_nmi(regs);
+		irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
 		instrumentation_begin();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
 		instrumentation_end();
-		idtentry_exit_nmi(regs, irq_state);
+		irqentry_nmi_exit(regs, irq_state);
 	}
 }
 
@@ -864,7 +865,7 @@ static __always_inline void exc_debug_ke
 	 * includes the entry stack is excluded for everything.
 	 */
 	unsigned long dr7 = local_db_save();
-	bool irq_state = idtentry_enter_nmi(regs);
+	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
 	instrumentation_begin();
 
 	/*
@@ -907,7 +908,7 @@ static __always_inline void exc_debug_ke
 		regs->flags &= ~X86_EFLAGS_TF;
 out:
 	instrumentation_end();
-	idtentry_exit_nmi(regs, irq_state);
+	irqentry_nmi_exit(regs, irq_state);
 
 	local_db_restore(dr7);
 }
@@ -925,7 +926,7 @@ static __always_inline void exc_debug_us
 
 	/*
 	 * NB: We can't easily clear DR7 here because
-	 * idtentry_exit_to_usermode() can invoke ptrace, schedule, access
+	 * irqentry_exit_to_usermode() can invoke ptrace, schedule, access
 	 * user memory, etc.  This means that a recursive #DB is possible.  If
 	 * this happens, that #DB will hit exc_debug_kernel() and clear DR7.
 	 * Since we're not on the IST stack right now, everything will be
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
 #ifndef irqentry_state
 typedef struct irqentry_state {
 	bool	exit_rcu;
+	bool	lockdep;
 } irqentry_state_t;
 #endif
 
@@ -402,4 +403,23 @@ void irqentry_exit_cond_resched(void);
  */
 void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
 
+/**
+ * irqentry_nmi_enter - Handle NMI entry
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Similar to irqentry_enter() but taking care of the NMI constraints.
+ */
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+
+/**
+ * irqentry_nmi_exit - Handle return from NMI handling
+ * @regs:	Pointer to pt_regs (NMI entry regs)
+ * @state:	Return value from matching call to irqentry_nmi_enter()
+ *
+ * Last action before returning to the low level assmenbly code.
+ *
+ * Counterpart to irqentry_nmi_enter().
+ */
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -398,3 +398,39 @@ noinstr void irqentry_exit(struct pt_reg
 			rcu_irq_exit();
 	}
 }
+
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+{
+	irqentry_state_t irq_state;
+
+	irq_state.lockdep = lockdep_hardirqs_enabled();
+
+	__nmi_enter();
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	lockdep_hardirq_enter();
+	rcu_nmi_enter();
+
+	instrumentation_begin();
+	trace_hardirqs_off_finish();
+	ftrace_nmi_enter();
+	instrumentation_end();
+
+	return irq_state;
+}
+
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+{
+	instrumentation_begin();
+	ftrace_nmi_exit();
+	if (irq_state.lockdep) {
+		trace_hardirqs_on_prepare();
+		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	}
+	instrumentation_end();
+
+	rcu_nmi_exit();
+	lockdep_hardirq_exit();
+	if (irq_state.lockdep)
+		lockdep_hardirqs_on(CALLER_ADDR0);
+	__nmi_exit();
+}






^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-10-16 10:57   ` Peter Zijlstra
@ 2020-10-17  3:32     ` Ira Weiny
  2020-10-19  9:35       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2020-10-17  3:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 16, 2020 at 12:57:43PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:51PM -0700, ira.weiny@intel.com wrote:
> > From: Fenghua Yu <fenghua.yu@intel.com>
> > 
> > Define a helper, update_pkey_val(), which will be used to support both
> > Protection Key User (PKU) and the new Protection Key for Supervisor
> > (PKS) in subsequent patches.
> > 
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > ---
> >  arch/x86/include/asm/pkeys.h |  2 ++
> >  arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
> >  arch/x86/mm/pkeys.c          | 21 +++++++++++++++++++++
> >  3 files changed, 27 insertions(+), 18 deletions(-)
> 
> This is not from Fenghua.
> 
>   https://lkml.kernel.org/r/20200717085442.GX10769@hirez.programming.kicks-ass.net
> 
> This is your patch based on the code I wrote.

Ok, I apologize.  Yes the code below was all yours.

Is it ok to add?

Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>

?

Thanks,
Ira

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index f5efb4007e74..3cf8f775f36d 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -208,3 +208,24 @@ static __init int setup_init_pkru(char *opt)
> >  	return 1;
> >  }
> >  __setup("init_pkru=", setup_init_pkru);
> > +
> > +/*
> > + * Update the pk_reg value and return it.
> > + *
> > + * Kernel users use the same flags as user space:
> > + *     PKEY_DISABLE_ACCESS
> > + *     PKEY_DISABLE_WRITE
> > + */
> > +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> > +{
> > +	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > +
> > +	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > +
> > +	if (flags & PKEY_DISABLE_ACCESS)
> > +		pk_reg |= PKR_AD_BIT << pkey_shift;
> > +	if (flags & PKEY_DISABLE_WRITE)
> > +		pk_reg |= PKR_WD_BIT << pkey_shift;
> > +
> > +	return pk_reg;
> > +}
> > -- 
> > 2.28.0.rc0.12.gb6a658bd00c9
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-16 11:12     ` Peter Zijlstra
@ 2020-10-17  5:14       ` Ira Weiny
  2020-10-19  9:37         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2020-10-17  5:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Oct 16, 2020 at 01:12:26PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> > > +/**
> > > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > > + * serializing but still maintains ordering properties similar to WRPKRU.
> > > + * The current SDM section on PKRS needs updating but should be the same as
> > > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > > + *
> > > + * 	WRPKRU will never execute transiently. Memory accesses
> > > + * 	affected by PKRU register will not execute (even transiently)
> > > + * 	until all prior executions of WRPKRU have completed execution
> > > + * 	and updated the PKRU register.
> > > + */
> > > +void write_pkrs(u32 new_pkrs)
> > > +{
> > > +	u32 *pkrs;
> > > +
> > > +	if (!static_cpu_has(X86_FEATURE_PKS))
> > > +		return;
> > > +
> > > +	pkrs = get_cpu_ptr(&pkrs_cache);
> > > +	if (*pkrs != new_pkrs) {
> > > +		*pkrs = new_pkrs;
> > > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > > +	}
> > > +	put_cpu_ptr(pkrs);
> > > +}
> > > 
> > 
> > It bugs me a *bit* that this is being called in a preempt-disabled
> > region, but we still bother with the get/put_cpu jazz.  Are there other
> > future call-sites for this that aren't in preempt-disabled regions?
> 
> So the previous version had a useful comment that got lost.

Ok Looking back I see what happened...  This comment...

 /*
  * PKRS is only temporarily changed during specific code paths.
  * Only a preemption during these windows away from the default
  * value would require updating the MSR.
  */

... was added to pks_sched_in() but that got simplified down because cleaning
up write_pkrs() made the code there obsolete.

> This stuff
> needs to fundamentally be preempt disabled,

I agree, the update to the percpu cache value and MSR can not be torn.

> so it either needs to
> explicitly do so, or have an assertion that preemption is indeed
> disabled.

However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
handle the preempt_disable() for us?  Is it not sufficient to rely on that?

Dave's comment seems to be the opposite where we need to eliminate preempt
disable before calling write_pkrs().

FWIW I think I'm mistaken in my response to Dave regarding the
preempt_disable() in pks_update_protection().

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-16 11:06   ` Peter Zijlstra
@ 2020-10-17  5:37     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-17  5:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 16, 2020 at 01:06:36PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:53PM -0700, ira.weiny@intel.com wrote:
> 
> > @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
> >  
> >  	if ((tifp ^ tifn) & _TIF_SLD)
> >  		switch_to_sld(tifn);
> > +
> > +	pks_sched_in();
> >  }
> >  
> 
> You seem to have lost the comment proposed here:
> 
>   https://lkml.kernel.org/r/20200717083140.GW10769@hirez.programming.kicks-ass.net
> 
> It is useful and important information that the wrmsr normally doesn't
> happen.

Added back in here.

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 3cf8f775f36d..30f65dd3d0c5 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> >  
> >  	return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> > +
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * 	WRPKRU will never execute transiently. Memory accesses
> > + * 	affected by PKRU register will not execute (even transiently)
> > + * 	until all prior executions of WRPKRU have completed execution
> > + * 	and updated the PKRU register.
> 
> (whitespace damage; space followed by tabstop)

Fixed thanks.

> 
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +	u32 *pkrs;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	pkrs = get_cpu_ptr(&pkrs_cache);
> > +	if (*pkrs != new_pkrs) {
> > +		*pkrs = new_pkrs;
> > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +	}
> > +	put_cpu_ptr(pkrs);
> > +}
> 
> looks familiar that... :-)

Added you as a co-developer if that is ok?

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API
  2020-10-16 11:07   ` Peter Zijlstra
@ 2020-10-17  5:42     ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-17  5:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 16, 2020 at 01:07:47PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:54PM -0700, ira.weiny@intel.com wrote:
> > +static inline void pks_update_protection(int pkey, unsigned long protection)
> > +{
> > +	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> > +						     pkey, protection);
> > +	preempt_disable();
> > +	write_pkrs(current->thread.saved_pkrs);
> > +	preempt_enable();
> > +}
> 
> write_pkrs() already disables preemption itself. Wrapping it in yet
> another layer is useless.

I was thinking the update to saved_pkrs needed this protection as well and that
was to be included in the preemption disable.  But that too is incorrect.

I've removed this preemption disable.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-16 12:55     ` Thomas Gleixner
@ 2020-10-19  5:37       ` Ira Weiny
  2020-10-19  9:32         ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2020-10-19  5:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 16 2020 at 13:45, Peter Zijlstra wrote:
> > On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.weiny@intel.com wrote:
> >> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
> >>  
> >>  	rcu_nmi_exit();
> >>  	lockdep_hardirq_exit();
> >> -	if (restore)
> >> +	if (irq_state->exit_rcu)
> >>  		lockdep_hardirqs_on(CALLER_ADDR0);
> >>  	__nmi_exit();
> >>  }
> >
> > That's not nice.. The NMI path is different from the IRQ path and has a
> > different variable. Yes, this works, but *groan*.
> >
> > Maybe union them if you want to avoid bloating the structure, but the
> > above makes it really hard to read.
> 
> Right, and also that nmi entry thing should not be in x86. Something
> like the untested below as first cleanup.

Ok, I see what Peter was talking about.  I've added this patch to the series.

> 
> Thanks,
> 
>         tglx
> ----
> Subject: x86/entry: Move nmi entry/exit into common code
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Fri, 11 Sep 2020 10:09:56 +0200
> 
> Add blurb here.

How about:

To prepare for saving PKRS values across NMI's we lift the
idtentry_[enter|exit]_nmi() to the common code.  Rename them to
irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
state in the same irqentry_state_t structure as the other irqentry_*()
functions.  Finally, differentiate the state being stored between the NMI and
IRQ path by adding 'lockdep' to irqentry_state_t.

?

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/entry/common.c         |   34 ----------------------------------
>  arch/x86/include/asm/idtentry.h |    3 ---
>  arch/x86/kernel/cpu/mce/core.c  |    6 +++---
>  arch/x86/kernel/nmi.c           |    6 +++---
>  arch/x86/kernel/traps.c         |   13 +++++++------
>  include/linux/entry-common.h    |   20 ++++++++++++++++++++
>  kernel/entry/common.c           |   36 ++++++++++++++++++++++++++++++++++++
>  7 files changed, 69 insertions(+), 49 deletions(-)
> 

[snip]

> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>  	bool	exit_rcu;
> +	bool	lockdep;
>  } irqentry_state_t;

Building on what Peter said do you agree this should be made into a union?

It may not be strictly necessary in this patch but I think it would reflect the
mutual exclusivity better and could be changed easy enough in the follow on
patch which adds the pkrs state.

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-19  5:37       ` Ira Weiny
@ 2020-10-19  9:32         ` Thomas Gleixner
  2020-10-19 20:26           ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2020-10-19  9:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Sun, Oct 18 2020 at 22:37, Ira Weiny wrote:
> On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
>> Subject: x86/entry: Move nmi entry/exit into common code
>> From: Thomas Gleixner <tglx@linutronix.de>
>> Date: Fri, 11 Sep 2020 10:09:56 +0200
>> 
>> Add blurb here.
>
> How about:
>
> To prepare for saving PKRS values across NMI's we lift the
> idtentry_[enter|exit]_nmi() to the common code.  Rename them to
> irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
> state in the same irqentry_state_t structure as the other irqentry_*()
> functions.  Finally, differentiate the state being stored between the NMI and
> IRQ path by adding 'lockdep' to irqentry_state_t.

No. This has absolutely nothing to do with PKRS. It's a cleanup valuable
by itself and that's how it should have been done right away.

So the proper changelog is:

  Lockdep state handling on NMI enter and exit is nothing specific to
  X86. It's not any different on other architectures. Also the extra
  state type is not necessary, irqentry_state_t can carry the necessary
  information as well.

  Move it to common code and extend irqentry_state_t to carry lockdep
  state.

>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
>>  #ifndef irqentry_state
>>  typedef struct irqentry_state {
>>  	bool	exit_rcu;
>> +	bool	lockdep;
>>  } irqentry_state_t;
>
> Building on what Peter said do you agree this should be made into a union?
>
> It may not be strictly necessary in this patch but I think it would reflect the
> mutual exclusivity better and could be changed easy enough in the follow on
> patch which adds the pkrs state.

Why the heck should it be changed in a patch which adds something
completely different?

Either it's mutually exclusive or not and if so it want's to be done in
this patch and not in a change which extends the struct for other
reasons.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-10-17  3:32     ` Ira Weiny
@ 2020-10-19  9:35       ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-19  9:35 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Andrew Morton,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Oct 16, 2020 at 08:32:03PM -0700, Ira Weiny wrote:
> On Fri, Oct 16, 2020 at 12:57:43PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 09, 2020 at 12:42:51PM -0700, ira.weiny@intel.com wrote:
> > > From: Fenghua Yu <fenghua.yu@intel.com>
> > > 
> > > Define a helper, update_pkey_val(), which will be used to support both
> > > Protection Key User (PKU) and the new Protection Key for Supervisor
> > > (PKS) in subsequent patches.
> > > 
> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > > ---
> > >  arch/x86/include/asm/pkeys.h |  2 ++
> > >  arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
> > >  arch/x86/mm/pkeys.c          | 21 +++++++++++++++++++++
> > >  3 files changed, 27 insertions(+), 18 deletions(-)
> > 
> > This is not from Fenghua.
> > 
> >   https://lkml.kernel.org/r/20200717085442.GX10769@hirez.programming.kicks-ass.net
> > 
> > This is your patch based on the code I wrote.
> 
> Ok, I apologize.  Yes the code below was all yours.
> 
> Is it ok to add?
> 
> Co-developed-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> 

Sure, thanks!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-17  5:14       ` Ira Weiny
@ 2020-10-19  9:37         ` Peter Zijlstra
  2020-10-19 18:48           ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2020-10-19  9:37 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Oct 16, 2020 at 10:14:10PM -0700, Ira Weiny wrote:
> > so it either needs to
> > explicitly do so, or have an assertion that preemption is indeed
> > disabled.
> 
> However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
> handle the preempt_disable() for us? 

It does.

> Is it not sufficient to rely on that?

It is.

> Dave's comment seems to be the opposite where we need to eliminate preempt
> disable before calling write_pkrs().
> 
> FWIW I think I'm mistaken in my response to Dave regarding the
> preempt_disable() in pks_update_protection().

Dave's concern is that we're calling with with preemption already
disabled so disabling it again is superfluous.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch
  2020-10-19  9:37         ` Peter Zijlstra
@ 2020-10-19 18:48           ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-19 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Mon, Oct 19, 2020 at 11:37:14AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 16, 2020 at 10:14:10PM -0700, Ira Weiny wrote:
> > > so it either needs to
> > > explicitly do so, or have an assertion that preemption is indeed
> > > disabled.
> > 
> > However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
> > handle the preempt_disable() for us? 
> 
> It does.
> 
> > Is it not sufficient to rely on that?
> 
> It is.
> 
> > Dave's comment seems to be the opposite where we need to eliminate preempt
> > disable before calling write_pkrs().
> > 
> > FWIW I think I'm mistaken in my response to Dave regarding the
> > preempt_disable() in pks_update_protection().
> 
> Dave's concern is that we're calling with with preemption already
> disabled so disabling it again is superfluous.

Ok, thanks, and after getting my head straight I think I agree with him, and
you.

Thanks I've reworked the code to removed the superfluous calls.  Sorry about
being so dense...  :-D

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-19  9:32         ` Thomas Gleixner
@ 2020-10-19 20:26           ` Ira Weiny
  2020-10-19 21:12             ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2020-10-19 20:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> On Sun, Oct 18 2020 at 22:37, Ira Weiny wrote:
> > On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
> >> Subject: x86/entry: Move nmi entry/exit into common code
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> Date: Fri, 11 Sep 2020 10:09:56 +0200
> >> 
> >> Add blurb here.
> >
> > How about:
> >
> > To prepare for saving PKRS values across NMI's we lift the
> > idtentry_[enter|exit]_nmi() to the common code.  Rename them to
> > irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
> > state in the same irqentry_state_t structure as the other irqentry_*()
> > functions.  Finally, differentiate the state being stored between the NMI and
> > IRQ path by adding 'lockdep' to irqentry_state_t.
> 
> No. This has absolutely nothing to do with PKRS. It's a cleanup valuable
> by itself and that's how it should have been done right away.
> 
> So the proper changelog is:
> 
>   Lockdep state handling on NMI enter and exit is nothing specific to
>   X86. It's not any different on other architectures. Also the extra
>   state type is not necessary, irqentry_state_t can carry the necessary
>   information as well.
> 
>   Move it to common code and extend irqentry_state_t to carry lockdep
>   state.

Ok sounds good, thanks.

> 
> >> --- a/include/linux/entry-common.h
> >> +++ b/include/linux/entry-common.h
> >> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
> >>  #ifndef irqentry_state
> >>  typedef struct irqentry_state {
> >>  	bool	exit_rcu;
> >> +	bool	lockdep;
> >>  } irqentry_state_t;
> >
> > Building on what Peter said do you agree this should be made into a union?
> >
> > It may not be strictly necessary in this patch but I think it would reflect the
> > mutual exclusivity better and could be changed easy enough in the follow on
> > patch which adds the pkrs state.
> 
> Why the heck should it be changed in a patch which adds something
> completely different?

Because the PKRS stuff is used in both NMI and IRQ state.

> 
> Either it's mutually exclusive or not and if so it want's to be done in
> this patch and not in a change which extends the struct for other
> reasons.

Sorry, let me clarify.  After this patch we have.

typedef union irqentry_state {
	bool	exit_rcu;
	bool	lockdep;
} irqentry_state_t;

Which reflects the mutual exclusion of the 2 variables.

But then when the pkrs stuff is added the union changes back to a structure and
looks like this.

typedef struct irqentry_state {
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
        u32 pkrs;
        u32 thread_pkrs;
#endif  
	union {
		bool    exit_rcu;
		bool	lockdep;
	};
} irqentry_state_t;

Because the pkrs information is in addition to exit_rcu OR lockdep.

So this is what I meant by 'could be changed easy enough in the follow on
patch'.

Is that clear?

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-19 20:26           ` Ira Weiny
@ 2020-10-19 21:12             ` Thomas Gleixner
  2020-10-20 14:10               ` Ira Weiny
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2020-10-19 21:12 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Mon, Oct 19 2020 at 13:26, Ira Weiny wrote:
> On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> Sorry, let me clarify.  After this patch we have.
>
> typedef union irqentry_state {
> 	bool	exit_rcu;
> 	bool	lockdep;
> } irqentry_state_t;
>
> Which reflects the mutual exclusion of the 2 variables.

Huch? From the patch I gave you:

 #ifndef irqentry_state
 typedef struct irqentry_state {
 	bool    exit_rcu;
+       bool    lockdep;
 } irqentry_state_t;
 #endif

How is that a union?

> But then when the pkrs stuff is added the union changes back to a structure and
> looks like this.

So you want:

  1) Move stuff to struct irqentry_state (my patch)

  2) Change it to a union and pass it as pointer at the same time

  3) Change it back to struct to add PKRS

> Is that clear?

What's clear is that the above is nonsense. We can just do

 #ifndef irqentry_state
 typedef struct irqentry_state {
 	union {
         	bool    exit_rcu;
                bool    lockdep;
        };        
 } irqentry_state_t;
 #endif

right in the patch which I gave you. Because that actually makes sense.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference
  2020-10-19 21:12             ` Thomas Gleixner
@ 2020-10-20 14:10               ` Ira Weiny
  0 siblings, 0 replies; 47+ messages in thread
From: Ira Weiny @ 2020-10-20 14:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Andrew Morton, Fenghua Yu,
	linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

On Mon, Oct 19, 2020 at 11:12:44PM +0200, Thomas Gleixner wrote:
> On Mon, Oct 19 2020 at 13:26, Ira Weiny wrote:
> > On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> > Sorry, let me clarify.  After this patch we have.
> >
> > typedef union irqentry_state {
> > 	bool	exit_rcu;
> > 	bool	lockdep;
> > } irqentry_state_t;
> >
> > Which reflects the mutual exclusion of the 2 variables.
> 
> Huch? From the patch I gave you:
> 
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>  	bool    exit_rcu;
> +       bool    lockdep;
>  } irqentry_state_t;
>  #endif
> 
> How is that a union?

I was proposing to make it a union.

> 
> > But then when the pkrs stuff is added the union changes back to a structure and
> > looks like this.
> 
> So you want:
> 
>   1) Move stuff to struct irqentry_state (my patch)
> 
>   2) Change it to a union and pass it as pointer at the same time

No, I would have made it a union in your patch.

Pass by reference would remain largely the same.

> 
>   3) Change it back to struct to add PKRS

Yes.  :-/

> 
> > Is that clear?
> 
> What's clear is that the above is nonsense. We can just do
> 
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>  	union {
>          	bool    exit_rcu;
>                 bool    lockdep;
>         };        
>  } irqentry_state_t;
>  #endif
> 
> right in the patch which I gave you. Because that actually makes sense.

Ok I'm very sorry.  I was thinking that having a struct containing nothing but
an anonymous union would be unacceptable as a stand alone item in your patch.
In my experience other maintainers would have rejected such a change and
would have asked; 'why not just make it a union'?

I'm very happy skipping the gymnastics on individual patches in favor of making
the whole series work out in the end.

Thank you for your help again.  :-)

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, back to index

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-09 19:42 [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 ira.weiny
2020-10-09 19:42 ` [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h ira.weiny
2020-10-13 17:46   ` Dave Hansen
2020-10-13 19:44     ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
2020-10-13 17:50   ` Dave Hansen
2020-10-13 23:56     ` Ira Weiny
2020-10-16 10:57   ` Peter Zijlstra
2020-10-17  3:32     ` Ira Weiny
2020-10-19  9:35       ` Peter Zijlstra
2020-10-09 19:42 ` [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
2020-10-13 18:23   ` Dave Hansen
2020-10-14  2:08     ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
2020-10-13 18:31   ` Dave Hansen
2020-10-14 22:36     ` Ira Weiny
2020-10-16 11:12     ` Peter Zijlstra
2020-10-17  5:14       ` Ira Weiny
2020-10-19  9:37         ` Peter Zijlstra
2020-10-19 18:48           ` Ira Weiny
2020-10-16 11:06   ` Peter Zijlstra
2020-10-17  5:37     ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API ira.weiny
2020-10-13 18:43   ` Dave Hansen
2020-10-15  1:08     ` Ira Weiny
2020-10-16 11:07   ` Peter Zijlstra
2020-10-17  5:42     ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference ira.weiny
2020-10-16 11:45   ` Peter Zijlstra
2020-10-16 12:55     ` Thomas Gleixner
2020-10-19  5:37       ` Ira Weiny
2020-10-19  9:32         ` Thomas Gleixner
2020-10-19 20:26           ` Ira Weiny
2020-10-19 21:12             ` Thomas Gleixner
2020-10-20 14:10               ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
2020-10-13 18:52   ` Dave Hansen
2020-10-15  3:46     ` Ira Weiny
2020-10-15  4:06       ` Dave Hansen
2020-10-15  4:18         ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault ira.weiny
2020-10-13 18:56   ` Dave Hansen
2020-10-15  4:13     ` Ira Weiny
2020-10-09 19:42 ` [PATCH RFC V3 9/9] x86/pks: Add PKS test code ira.weiny
2020-10-13 19:02   ` Dave Hansen
2020-10-15  4:46     ` Ira Weiny
2020-10-09 20:18 ` [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3 Ira Weiny

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org
	public-inbox-index linux-kselftest

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git