linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support
@ 2020-07-17  7:20 ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 01/17] x86/pkeys: Create pkeys_internal.h ira.weiny
                   ` (17 more replies)
  0 siblings, 18 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

This RFC series has been reviewed by Dave Hansen.

Changes from RFC:
	Clean up commit messages based on Peter Zijlstra's and Dave Hansen's
		feedback
	Fix static branch anti-pattern
	New patch:
	(memremap: Convert devmap static branch to {inc,dec})
		This was the code I used as a model for my static branch which
		I believe is wrong now.
	New Patch:
	(x86/entry: Preserve PKRS MSR through exceptions)
		This attempts to preserve the per-logical-processor MSR, and
		reference counting during exceptions.  I'd really like feed
		back on this because I _think_ it should work but I'm afraid
		I'm missing something as my testing has shown a lot of spotty
		crashes which don't make sense to me.

This patch set introduces a new page protection mechanism for supervisor pages,
Protection Key Supervisor (PKS) and an initial user of them, persistent memory,
PMEM.

PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  They work in
a similar fashion to user space pkeys.  Like User page pkeys (PKU), supervisor
pkeys are checked in addition to normal paging protections and Access or Writes
can be disabled via a MSR update without TLB flushes when permissions change.
A page mapping is assigned to a domain by setting a pkey in the page table
entry.

Unlike User pkeys no new instructions are added; rather WRMSR/RDMSR are used to
update the PKRS register.

XSAVE is not supported for the PKRS MSR.  To reduce software complexity the
implementation saves/restores the MSR across context switches but not during
irqs.  This is a compromise which results is a hardening of unwanted access
without absolute restriction.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

Protecting against stray writes is particularly important for PMEM because,
unlike writes to anonymous memory, writes to PMEM persists across a reboot.
Thus data corruption could result in permanent loss of data.

The following attributes of PKS makes it perfect as a mechanism to protect PMEM
from stray access within the kernel:

   1) Fast switching of permissions
   2) Prevents access without page table manipulations
   3) Works on a per thread basis
   4) No TLB flushes required

The second half of this series thus uses the PKS mechanism to protect PMEM from
stray access.

PKS is available with 4 and 5 level paging.  Like PKRU is takes 4 bits from the
PTE to store the pkey within the entry.


Implementation details
----------------------

Modifications of task struct in patches:
	(x86/pks: Preserve the PKRS MSR on context switch)
	(memremap: Add zone device access protection)

Because pkey access is per-thread 2 modifications are made to the task struct.
The first is a saved copy of the MSR during context switches.  The second
reference counts access to the device domain to correctly handle kmap nesting
properly.


Maintain PKS setting in a re-entrant manner in patch:
	(memremap: Add zone device access protection)
	(x86/entry: Preserve PKRS MSR through exceptions)

Using local_irq_save() seems to be the safest and fastest way to maintain kmap
as re-entrant.  But there may be a better way.  spin_lock_irq() and atomic
counters were considered.  But atomic counters do not properly protect the pkey
update and spin_lock_irq() would deadlock.  Suggestions are welcome.

Also preserving the pks state requires the exception handling code to store the
ref count during exception processing.  This seems like a layering violation
but it works.


The use of kmap in patch:
	(kmap: Add stray write protection for device pages)

To keep general access to PMEM pages general, we piggy back on the kmap()
interface as there are many places in the kernel who do not have, nor should be
required to have, a priori knowledge that a page is PMEM.  The modifications to
the kmap code is careful to quickly determine which pages don't require special
handling to reduce overhead for non PMEM pages.



Breakdown of patches
--------------------

Implement PKS within x86 arch:

	x86/pkeys: Create pkeys_internal.h
	x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
	x86/pks: Enable Protection Keys Supervisor (PKS)
	x86/pks: Preserve the PKRS MSR on context switch
	x86/pks: Add PKS kernel API
	x86/pks: Add a debugfs file for allocated PKS keys
	Documentation/pkeys: Update documentation for kernel pkeys
	x86/pks: Add PKS Test code

pre-req bug fixes for dax:

	fs/dax: Remove unused size parameter
	drivers/dax: Expand lock scope to cover the use of addresses

Add stray write protection to PMEM:

	memremap: Add zone device access protection
	kmap: Add stray write protection for device pages
	dax: Stray write protection for dax_direct_access()
	nvdimm/pmem: Stray write protection for pmem->virt_addr
	[dax|pmem]: Enable stray write protection


Fenghua Yu (4):
  x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  x86/pks: Enable Protection Keys Supervisor (PKS)
  x86/pks: Add PKS kernel API
  x86/pks: Add a debugfs file for allocated PKS keys

Ira Weiny (13):
  x86/pkeys: Create pkeys_internal.h
  x86/pks: Preserve the PKRS MSR on context switch
  Documentation/pkeys: Update documentation for kernel pkeys
  x86/pks: Add PKS Test code
  memremap: Convert devmap static branch to {inc,dec}
  fs/dax: Remove unused size parameter
  drivers/dax: Expand lock scope to cover the use of addresses
  memremap: Add zone device access protection
  kmap: Add stray write protection for device pages
  dax: Stray write protection for dax_direct_access()
  nvdimm/pmem: Stray write protection for pmem->virt_addr
  [dax|pmem]: Enable stray write protection
  x86/entry: Preserve PKRS MSR across exceptions

 Documentation/core-api/protection-keys.rst  |  81 +++-
 arch/x86/Kconfig                            |   1 +
 arch/x86/entry/common.c                     |  78 +++-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/idtentry.h             |   2 +
 arch/x86/include/asm/msr-index.h            |   1 +
 arch/x86/include/asm/pgtable.h              |  13 +-
 arch/x86/include/asm/pgtable_types.h        |   4 +
 arch/x86/include/asm/pkeys.h                |  43 ++
 arch/x86/include/asm/pkeys_internal.h       |  36 ++
 arch/x86/include/asm/processor.h            |  13 +
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  17 +
 arch/x86/kernel/fpu/xstate.c                |  17 +-
 arch/x86/kernel/process.c                   |  34 ++
 arch/x86/mm/fault.c                         |  16 +-
 arch/x86/mm/pkeys.c                         | 174 +++++++-
 drivers/dax/device.c                        |   2 +
 drivers/dax/super.c                         |   5 +-
 drivers/nvdimm/pmem.c                       |   6 +
 fs/dax.c                                    |  13 +-
 include/linux/highmem.h                     |  32 +-
 include/linux/memremap.h                    |   1 +
 include/linux/mm.h                          |  33 ++
 include/linux/pkeys.h                       |  18 +
 include/linux/sched.h                       |   3 +
 init/init_task.c                            |   3 +
 kernel/fork.c                               |   3 +
 lib/Kconfig.debug                           |  12 +
 lib/Makefile                                |   3 +
 lib/pks/Makefile                            |   3 +
 lib/pks/pks_test.c                          | 452 ++++++++++++++++++++
 mm/Kconfig                                  |  15 +
 mm/memremap.c                               | 105 ++++-
 tools/testing/selftests/x86/Makefile        |   3 +-
 tools/testing/selftests/x86/test_pks.c      |  65 +++
 36 files changed, 1243 insertions(+), 67 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_internal.h
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 01/17] x86/pkeys: Create pkeys_internal.h
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions.

Share code between them by creating a header with common defines, move
those defines into this header, change their names to reflect the new
use, and include the header where needed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/pgtable.h        | 13 ++++++-------
 arch/x86/include/asm/pkeys.h          |  2 ++
 arch/x86/include/asm/pkeys_internal.h | 11 +++++++++++
 arch/x86/include/asm/processor.h      |  1 +
 arch/x86/kernel/fpu/xstate.c          |  8 ++++----
 arch/x86/mm/pkeys.c                   | 14 ++++++--------
 6 files changed, 30 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_internal.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 76aa21e8128d..30e97fc8a683 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1364,9 +1364,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include <asm/pkeys_internal.h>
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -1376,18 +1374,19 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
 	/*
 	 * Access-disable disables writes too so we need to check
 	 * both bits here.
 	 */
-	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..be8b3e448f76 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
+#include <asm/pkeys_internal.h>
+
 #define ARCH_DEFAULT_PKEY	0
 
 /*
diff --git a/arch/x86/include/asm/pkeys_internal.h b/arch/x86/include/asm/pkeys_internal.h
new file mode 100644
index 000000000000..a9f086f1e4b4
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_internal.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 03b7c4ca425a..7da9855b5068 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
 #include <asm/unwind_hints.h>
 #include <asm/vmxfeatures.h>
 #include <asm/vdso/processor.h>
+#include <asm/pkeys_internal.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index bda2e5eaca0e..fc1ec2986e03 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -955,7 +955,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val)
 {
 	u32 old_pkru;
-	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
 	/*
@@ -974,16 +974,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 
 	/* Set the bits we need in PKRU:  */
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKRU_AD_BIT;
+		new_pkru_bits |= PKR_AD_BIT;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKRU_WD_BIT;
+		new_pkru_bits |= PKR_WD_BIT;
 
 	/* Shift the bits in to the correct place in PKRU for pkey: */
 	new_pkru_bits <<= pkey_shift;
 
 	/* Get old PKRU and mask off any old bits in place: */
 	old_pkru = read_pkru();
-	old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
 
 	/* Write old part along with new part: */
 	write_pkru(old_pkru | new_pkru_bits);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 8873ed1438a9..f5efb4007e74 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -111,19 +111,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
 	return vma_pkey(vma);
 }
 
-#define PKRU_AD_KEY(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
-
 /*
  * Make the default PKRU value (at execve() time) as restrictive
  * as possible.  This ensures that any threads clone()'d early
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
-		      PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
-		      PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
-		      PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
-		      PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);
+u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
+		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
+		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
+		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
+		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
 
 /*
  * Called from the FPU code when creating a fresh set of FPU
@@ -173,7 +171,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 	 * up immediately if someone attempts to disable access
 	 * or writes to pkey 0.
 	 */
-	if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT))
+	if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT))
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 01/17] x86/pkeys: Create pkeys_internal.h ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  8:54   ` Peter Zijlstra
  2020-07-17  7:20 ` [PATCH RFC V2 03/17] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, Ira Weiny, x86, Dave Hansen, Dan Williams,
	Vishal Verma, Andrew Morton, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

Define a helper, get_new_pkr(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 17 +++--------------
 arch/x86/mm/pkeys.c          | 28 ++++++++++++++++++++++++++++
 3 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index be8b3e448f76..34cef29fed20 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index fc1ec2986e03..1def71dc8105 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -954,9 +954,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val)
 {
-	u32 old_pkru;
-	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
-	u32 new_pkru_bits = 0;
+	u32 old_pkru, new_pkru;
 
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
@@ -972,21 +970,12 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	 */
 	WARN_ON_ONCE(pkey >= arch_max_pkey());
 
-	/* Set the bits we need in PKRU:  */
-	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKR_AD_BIT;
-	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKR_WD_BIT;
-
-	/* Shift the bits in to the correct place in PKRU for pkey: */
-	new_pkru_bits <<= pkey_shift;
-
 	/* Get old PKRU and mask off any old bits in place: */
 	old_pkru = read_pkru();
-	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
+	new_pkru = get_new_pkr(old_pkru, pkey, init_val);
 
 	/* Write old part along with new part: */
-	write_pkru(old_pkru | new_pkru_bits);
+	write_pkru(new_pkru);
 
 	return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..a5c680d32930 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,31 @@ static __init int setup_init_pkru(char *opt)
 	return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Get a new pkey register value from the user values specified.
+ *
+ * Kernel users use the same flags as user space:
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ */
+u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val)
+{
+	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
+	u32 new_pkr_bits = 0;
+
+	/* Set the bits we need in the register:  */
+	if (init_val & PKEY_DISABLE_ACCESS)
+		new_pkr_bits |= PKR_AD_BIT;
+	if (init_val & PKEY_DISABLE_WRITE)
+		new_pkr_bits |= PKR_WD_BIT;
+
+	/* Shift the bits in to the correct place: */
+	new_pkr_bits <<= pkey_shift;
+
+	/* Mask off any old bits in place: */
+	old_pkr &= ~((PKR_AD_BIT | PKR_WD_BIT) << pkey_shift);
+
+	/* Return the old part along with the new part: */
+	return old_pkr | new_pkr_bits;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 03/17] x86/pks: Enable Protection Keys Supervisor (PKS)
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 01/17] x86/pkeys: Create pkeys_internal.h ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, Ira Weiny, x86, Dave Hansen, Dan Williams,
	Vishal Verma, Andrew Morton, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Define ARCH_HAS_SUPERVISOR_PKEYS to distinguish this functionality from
the existing ARCH_HAS_PKEYS and then enable PKS when configured and
indicated by the CPU instance.  While not strictly necessary in this
patch, ARCH_HAS_SUPERVISOR_PKEYS separates this functionality through
the patch series so it is introduced here.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/Kconfig                            |  1 +
 arch/x86/include/asm/cpufeatures.h          |  1 +
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c                | 15 +++++++++++++++
 mm/Kconfig                                  |  2 ++
 5 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..c3ecbed2cfa0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1872,6 +1872,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_SUPERVISOR_PKEYS
 	help
 	  Memory Protection Keys provides a mechanism for enforcing
 	  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 02dabc9e77b0..a832ed8820c0 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -351,6 +351,7 @@
 #define X86_FEATURE_CLDEMOTE		(16*32+25) /* CLDEMOTE instruction */
 #define X86_FEATURE_MOVDIRI		(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B		(16*32+28) /* MOVDIR64B instruction */
+#define X86_FEATURE_PKS			(16*32+31) /* Protection Keys for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV	(17*32+ 0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 95c090a45b4b..f34bcefeda42 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1430,6 +1430,20 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c)
 #endif
 }
 
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the cpu supports the feature.
+ */
+static void setup_pks(void)
+{
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
+		return;
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	cr4_set_bits(X86_CR4_PKS);
+}
+
 /*
  * This does the hard work of actually picking apart the CPU stuff...
  */
@@ -1521,6 +1535,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	setup_pks();
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/mm/Kconfig b/mm/Kconfig
index f2104cc0d35c..e541d2c0dcac 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+	bool
 
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (2 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 03/17] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  8:31   ` Peter Zijlstra
  2020-07-17  8:59   ` Peter Zijlstra
  2020-07-17  7:20 ` [PATCH RFC V2 05/17] x86/pks: Add PKS kernel API ira.weiny
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Vishal Verma, Andrew Morton, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by CPU.  Unfortunately, the MSR is not managed
by XSAVE.  Therefore, We must preserve the protections for individual
tasks even if they are context switched out and placed on another cpu
later.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU.  Initialize, all tasks with the default MSR value.  Then,
on schedule in, check the saved task MSR vs the per-cpu value.  If
different proceed to write the MSR.  If not we avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/msr-index.h      |  1 +
 arch/x86/include/asm/pkeys_internal.h | 20 +++++++++++++++
 arch/x86/include/asm/processor.h      | 12 +++++++++
 arch/x86/kernel/cpu/common.c          |  2 ++
 arch/x86/kernel/process.c             | 35 +++++++++++++++++++++++++++
 arch/x86/mm/pkeys.c                   | 13 ++++++++++
 6 files changed, 83 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e8370e64a155..b6ffdfc3f388 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -727,6 +727,7 @@
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 
+#define MSR_IA32_PKRS			0x000006E1
 
 #define MSR_TSX_FORCE_ABORT		0x0000010F
 
diff --git a/arch/x86/include/asm/pkeys_internal.h b/arch/x86/include/asm/pkeys_internal.h
index a9f086f1e4b4..05257cdc7200 100644
--- a/arch/x86/include/asm/pkeys_internal.h
+++ b/arch/x86/include/asm/pkeys_internal.h
@@ -8,4 +8,24 @@
 
 #define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
 
+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction.  All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+			 PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+			 PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+			 PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+			 PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 pkrs_val);
+#else
+static inline void write_pkrs(u32 pkrs_val) { }
+#endif
+
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7da9855b5068..704d9f28fd4e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -545,6 +545,11 @@ struct thread_struct {
 
 	unsigned int		sig_on_uaccess_err:1;
 
+#ifdef	CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+	/* Saved Protection key register for supervisor mappings */
+	u32			saved_pkrs;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
@@ -907,8 +912,15 @@ static inline void spin_lock_prefetch(const void *x)
 #define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS	.saved_pkrs = INIT_PKRS_VALUE,
+#else
+#define INIT_THREAD_PKRS
+#endif
+
 #define INIT_THREAD  {						\
 	.addr_limit		= KERNEL_DS,			\
+	INIT_THREAD_PKRS					\
 }
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f34bcefeda42..b8241936cbbf 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -56,6 +56,7 @@
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
 #include <asm/uv/uv.h>
+#include <asm/pkeys_internal.h>
 
 #include "cpu.h"
 
@@ -1442,6 +1443,7 @@ static void setup_pks(void)
 		return;
 
 	cr4_set_bits(X86_CR4_PKS);
+	write_pkrs(INIT_PKRS_VALUE);
 }
 
 /*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f362ce0d5ac0..d69250a7c1bf 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -42,6 +42,7 @@
 #include <asm/spec-ctrl.h>
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
+#include <asm/pkeys_internal.h>
 
 #include "process.h"
 
@@ -184,6 +185,36 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	return ret;
 }
 
+/*
+ * NOTE: We wrap pks_init_task() and pks_sched_in() with
+ * CONFIG_ARCH_HAS_SUPERVISOR_PKEYS because using IS_ENABLED() fails
+ * due to the lack of task_struct->saved_pkrs in this configuration.
+ * Furthermore, we place them here because of the complexity introduced by
+ * header conflicts introduced to get the task_struct definition in the pkeys
+ * headers.
+ */
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+DECLARE_PER_CPU(u32, pkrs_cache);
+static inline void pks_init_task(struct task_struct *tsk)
+{
+	/* New tasks get the most restrictive PKRS value */
+	tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
+}
+static inline void pks_sched_in(void)
+{
+	u64 current_pkrs = current->thread.saved_pkrs;
+
+	/* Only update the MSR when current's pkrs is different from the MSR. */
+	if (this_cpu_read(pkrs_cache) == current_pkrs)
+		return;
+
+	write_pkrs(current_pkrs);
+}
+#else
+static inline void pks_init_task(struct task_struct *tsk) { }
+static inline void pks_sched_in(void) { }
+#endif
+
 void flush_thread(void)
 {
 	struct task_struct *tsk = current;
@@ -192,6 +223,8 @@ void flush_thread(void)
 	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
 
 	fpu__clear_all(&tsk->thread.fpu);
+
+	pks_init_task(tsk);
 }
 
 void disable_TSC(void)
@@ -655,6 +688,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
 
 	if ((tifp ^ tifn) & _TIF_SLD)
 		switch_to_sld(tifn);
+
+	pks_sched_in();
 }
 
 /*
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index a5c680d32930..0f86f2374bd7 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -236,3 +236,16 @@ u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val)
 	/* Return the old part along with the new part: */
 	return old_pkr | new_pkr_bits;
 }
+
+DEFINE_PER_CPU(u32, pkrs_cache);
+
+/*
+ * Write the PKey Register Supervisor.  This must be run with preemption
+ * disabled as it does not guarantee the atomicity of updating the pkrs_cache
+ * and MSR on its own.
+ */
+void write_pkrs(u32 pkrs_val)
+{
+	this_cpu_write(pkrs_cache, pkrs_val);
+	wrmsrl(MSR_IA32_PKRS, pkrs_val);
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 05/17] x86/pks: Add PKS kernel API
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (3 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 06/17] x86/pks: Add a debugfs file for allocated PKS keys ira.weiny
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, Ira Weiny, x86, Dave Hansen, Dan Williams,
	Vishal Verma, Andrew Morton, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.

We export 2 new symbols pks_key_alloc() and pks_key_free() while
pks_update_protection() is exposed as an inline function via header
file.

pks_key_alloc() reserves pkey 0 for default kernel pages.  The other 15
keys are dynamically allocated to allow better use of the limited key
space.

This, and the fact that PKS may not be available on all arch's, means
callers of the allocator _must_ be prepared for it to fail and take
appropriate action to run without their allocated domain.  This is not
anticipated to be a problem as these protections only serve to harden
memory and users should be no worse off than before the introduction of
PKS.

PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey) aid in setting page table
entry bits by kernel users.  Note these defines will be used in follow
on patches but are included here for a complete interface.

pks_update_protection() is inlined for performance and allows kernel
users the ability to change the protections for the domain identified by
the pkey specified.  It is undefined behavior to call this on a pkey not
allocated by the allocator.  And will WARN_ON if called on architectures
which do not support PKS.  (Again callers are expected to check the
return of pks_key_alloc() before using this API further.)

It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
serializing but still maintains ordering properties similar to WRPKRU.
The current SDM section on PKRS needs updating but should be the same as
that of WRPKRU.  So to quote from the WRPKRU text:

	WRPKRU will never execute speculatively. Memory accesses
	affected by PKRU register will not execute (even speculatively)
	until all prior executions of WRPKRU have completed execution
	and updated the PKRU register.

Finally, pks_key_free() allows a user to return the key to the
allocator for use by others.

The interface maintains Access Disabled (AD=1) for all keys not
currently allocated.  Therefore, the user can depend on access being
disabled when pks_key_alloc() returns a key.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/pgtable_types.h  |  4 ++
 arch/x86/include/asm/pkeys.h          | 30 ++++++++++
 arch/x86/include/asm/pkeys_internal.h |  4 ++
 arch/x86/mm/pkeys.c                   | 79 +++++++++++++++++++++++++++
 include/linux/pkeys.h                 | 14 +++++
 5 files changed, 131 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..2ab45ef89c7d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -73,6 +73,8 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
 #else
@@ -229,6 +231,8 @@ enum page_cache_mode {
 #define PAGE_KERNEL_IO		__pgprot_mask(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE	__pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
 
+#define PAGE_KERNEL_PKEY(pkey)	__pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+
 #endif	/* __ASSEMBLY__ */
 
 /*         xwr */
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 34cef29fed20..e30ea907abb6 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -138,4 +138,34 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 
 u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val);
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+int pks_key_alloc(const char *const pkey_user);
+void pks_key_free(int pkey);
+u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val);
+
+/*
+ * pks_update_protection - Update the protection of the specified key
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ *
+ * This is not a global update.  It only affects the current running thread.
+ *
+ * It is undefined and a bug for users to call this without having allocated a
+ * pkey and using it as pkey here.
+ */
+static inline void pks_update_protection(int pkey, unsigned long protection)
+{
+	current->thread.saved_pkrs = get_new_pkr(current->thread.saved_pkrs,
+						 pkey, protection);
+	preempt_disable();
+	write_pkrs(current->thread.saved_pkrs);
+	preempt_enable();
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/include/asm/pkeys_internal.h b/arch/x86/include/asm/pkeys_internal.h
index 05257cdc7200..e34f380c66d1 100644
--- a/arch/x86/include/asm/pkeys_internal.h
+++ b/arch/x86/include/asm/pkeys_internal.h
@@ -22,6 +22,10 @@
 			 PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
 			 PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
 
+/*  PKS supports 16 keys. Key 0 is reserved for the kernel. */
+#define        PKS_KERN_DEFAULT_KEY    0
+#define        PKS_NUM_KEYS            16
+
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
 void write_pkrs(u32 pkrs_val);
 #else
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 0f86f2374bd7..16f735c12fcd 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -3,6 +3,9 @@
  * Intel Memory Protection Keys management
  * Copyright (c) 2015, Intel Corporation.
  */
+#undef pr_fmt
+#define pr_fmt(fmt) "x86/pkeys: " fmt
+
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
@@ -249,3 +252,79 @@ void write_pkrs(u32 pkrs_val)
 	this_cpu_write(pkrs_cache, pkrs_val);
 	wrmsrl(MSR_IA32_PKRS, pkrs_val);
 }
+
+DEFINE_MUTEX(pks_lock);
+static const char pks_key_user0[] = "kernel";
+
+/* Store names of allocated keys for debug.  Key 0 is reserved for the kernel.  */
+static const char *pks_key_users[PKS_NUM_KEYS] = {
+	pks_key_user0
+};
+
+/*
+ * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved for
+ * its use.  We use ulong for the bit operations but only 16 bits are used.
+ */
+static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
+
+/*
+ * pks_key_alloc - Allocate a PKS key
+ *
+ * @pkey_user: String stored for debugging of key exhaustion.  The caller is
+ * responsible to maintain this memory until pks_key_free().
+ */
+int pks_key_alloc(const char * const pkey_user)
+{
+	int nr, old_bit, pkey;
+
+	might_sleep();
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return -EINVAL;
+
+	mutex_lock(&pks_lock);
+	/* Find a free bit (0) in the bit map. */
+	old_bit = 1;
+	while (old_bit) {
+		nr = ffz(pks_key_allocation_map);
+		old_bit = __test_and_set_bit(nr, &pks_key_allocation_map);
+	}
+
+	if (nr < PKS_NUM_KEYS) {
+		pkey = nr;
+		/* for debugging key exhaustion */
+		pks_key_users[pkey] = pkey_user;
+	} else {
+		pkey = -ENOSPC;
+		pr_info("Cannot allocate supervisor key for %s.\n",
+			pkey_user);
+	}
+
+	mutex_unlock(&pks_lock);
+	return pkey;
+}
+EXPORT_SYMBOL_GPL(pks_key_alloc);
+
+/*
+ * pks_key_free - Free a previously allocate PKS key
+ *
+ * @pkey: Key to be free'ed
+ */
+void pks_key_free(int pkey)
+{
+	might_sleep();
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)
+		return;
+
+	mutex_lock(&pks_lock);
+	__clear_bit(pkey, &pks_key_allocation_map);
+	pks_key_users[pkey] = NULL;
+	/* Restore to default AD=1 and WD=0. */
+	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+	mutex_unlock(&pks_lock);
+}
+EXPORT_SYMBOL_GPL(pks_key_free);
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 2955ba976048..e4bff77d7b49 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -50,4 +50,18 @@ static inline void copy_init_pkru_to_fpregs(void)
 
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
+#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+static inline int pks_key_alloc(const char * const pkey_user)
+{
+	return -EINVAL;
+}
+static inline void pks_key_free(int pkey)
+{
+}
+static inline void pks_update_protection(int pkey, unsigned long protection)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #endif /* _LINUX_PKEYS_H */
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 06/17] x86/pks: Add a debugfs file for allocated PKS keys
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (4 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 05/17] x86/pks: Add PKS kernel API ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 07/17] Documentation/pkeys: Update documentation for kernel pkeys ira.weiny
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Ira Weiny, Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Fenghua Yu <fenghua.yu@intel.com>

The sysadmin may need to know which PKS keys are currently being used.

Add a debugfs file to show the allocated PKS keys and their names.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/mm/pkeys.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 16f735c12fcd..e565fadd74d7 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -328,3 +328,43 @@ void pks_key_free(int pkey)
 	mutex_unlock(&pks_lock);
 }
 EXPORT_SYMBOL_GPL(pks_key_free);
+
+static int pks_keys_allocated_show(struct seq_file *m, void *p)
+{
+	int i;
+
+	mutex_lock(&pks_lock);
+	for (i = PKS_KERN_DEFAULT_KEY; i < PKS_NUM_KEYS; i++) {
+		/* It is ok for pks_key_users[i] to be NULL */
+		if (test_bit(i, &pks_key_allocation_map))
+			seq_printf(m, "%d: %s\n", i, pks_key_users[i]);
+	}
+	mutex_unlock(&pks_lock);
+
+	return 0;
+}
+
+static int pks_keys_allocated_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, pks_keys_allocated_show, NULL);
+}
+
+static const struct file_operations pks_keys_allocated_fops = {
+	.open		= pks_keys_allocated_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init pks_keys_initcall(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_PKS)) {
+		/* Create a debugfs file to show allocated PKS keys. */
+		debugfs_create_file("pks_keys_allocated", 0400,
+				    arch_debugfs_dir, NULL,
+				    &pks_keys_allocated_fops);
+	}
+
+	return 0;
+}
+late_initcall(pks_keys_initcall);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 07/17] Documentation/pkeys: Update documentation for kernel pkeys
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (5 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 06/17] x86/pks: Add a debugfs file for allocated PKS keys ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 08/17] x86/pks: Add PKS Test code ira.weiny
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Future Intel CPUS will support Protection Key Supervisor (PKS).

Update the protection key documentation to cover pkeys on supervisor
pages.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/core-api/protection-keys.rst | 81 +++++++++++++++++-----
 1 file changed, 63 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..5ac400a5a306 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ======================
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkes work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
+
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkes are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-========
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+============================
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,37 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+PKS is intended to harden against unwanted access to kernel pages.  But it does
+not completely restrict access under all conditions.  For example the MSR
+setting is not saved/restored during irqs.  Thus the use of PKS is a mitigation
+strategy rather than a form of strict security.
+
+The following calls are used to allocate, use, and deallocate a pkey which
+defines a 'protection domain' within the kernel.  Setting a pkey value in a
+supervisor mapping adds that mapping to the protection domain.  Then calls can be
+used to enable/disable read and/or write access to all of the pages mapped with
+that key:
+
+        int pks_key_alloc(const char * const pkey_user);
+        #define PAGE_KERNEL_PKEY(pkey)
+        #define _PAGE_KEY(pkey)
+        int pks_update_protection(int pkey, unsigned long protection);
+        void pks_key_free(int pkey);
+
+In-kernel users must be prepared to set PAGE_KERNEL_PKEY() permission in the
+page table entries for the mappings they want to ptorect.
+
+WARNING: It is imperative that callers check for errors from pks_key_alloc()
+because pkeys are a limited resource and so callers should be prepared to work
+without PKS support.
+
+For admins a debugfs interface provides a list of the current keys in use at:
+
+        /sys/kernel/debug/x86/pks_keys_allocated
+
+Some example code can be found in lib/pks/pks_test.c
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 08/17] x86/pks: Add PKS Test code
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (6 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 07/17] Documentation/pkeys: Update documentation for kernel pkeys ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 09/17] memremap: Convert devmap static branch to {inc,dec} ira.weiny
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Vishal Verma, Andrew Morton, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry.  Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
       this fd
'2' -- Check that the pkey register updated in '1' is still the same.
       (To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
       DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to fully
exercise context switch testing a user space program is provided in:

	.../tools/testing/selftests/x86/test_pks.c

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/pkeys.h           |   9 +
 arch/x86/mm/fault.c                    |  16 +-
 include/linux/pkeys.h                  |   4 +
 lib/Kconfig.debug                      |  12 +
 lib/Makefile                           |   3 +
 lib/pks/Makefile                       |   3 +
 lib/pks/pks_test.c                     | 452 +++++++++++++++++++++++++
 tools/testing/selftests/x86/Makefile   |   3 +-
 tools/testing/selftests/x86/test_pks.c |  65 ++++
 9 files changed, 562 insertions(+), 5 deletions(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index e30ea907abb6..097abca7784c 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -168,4 +168,13 @@ static inline void pks_update_protection(int pkey, unsigned long protection)
 }
 #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
 
+#if defined(CONFIG_PKS_TESTING)
+bool pks_test_armed_and_clear(void);
+#else
+static inline bool pks_test_armed_and_clear(void)
+{
+	return false;
+}
+#endif
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5e41949453cc..1384190a900c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/pkeys.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1105,11 +1106,18 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
 	/*
-	 * Protection keys exceptions only happen on user pages.  We
-	 * have no user pages in the kernel portion of the address
-	 * space, so do not expect them here.
+	 * If we get a protection key exception it could be because we are
+	 * running the PKS test.  If so, pks_test_armed_and_clear() will clear
+	 * the protection mechanism and we can safely return.
+	 *
+	 * Otherwise we warn the user that something has gone wrong and
+	 * continue with the fault.
 	 */
-	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+	if (hw_error_code & X86_PF_PK) {
+		if (pks_test_armed_and_clear())
+			return;
+		WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+	}
 
 	/* Was the fault spurious, caused by lazy TLB invalidation? */
 	if (spurious_kernel_fault(hw_error_code, address))
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index e4bff77d7b49..1d84ab7c12d4 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -48,6 +48,10 @@ static inline void copy_init_pkru_to_fpregs(void)
 {
 }
 
+static inline bool pks_test_armed_and_clear(void)
+{
+	return false;
+}
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9ad9210d70a1..aa876ebb4c8b 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2329,6 +2329,18 @@ config HYPERV_TESTING
 	help
 	  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TESTING
+	bool "PKey(S)upervisor testing"
+	default n
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	help
+	  Select this option to enable testing of PKS core software and
+	  hardware.  The PKS core provides a mechanism to allocate keys as well
+	  as maintain the protection settings across context switches.
+	  Answer N if you don't know what supervisor keys are.
+
+	  If unsure, say N.
+
 endmenu # "Kernel Testing and Coverage"
 
 endmenu # Kernel hacking
diff --git a/lib/Makefile b/lib/Makefile
index b1c42c10073b..667dea28cf7b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -318,3 +318,6 @@ obj-$(CONFIG_OBJAGG) += objagg.o
 # KUnit tests
 obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o
 obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..7d1df7563db9
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TESTING) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..6d8172734f97
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,452 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2020 Intel Corporation. All rights reserved.
+ *
+ * Implement PKS testing
+ * Access to run this test can be with a command line parameter
+ * ("pks-test-on-boot") or more detailed tests can be triggered through:
+ *
+ *    /sys/kernel/debug/x86/run_pks
+ *
+ *  debugfs controls are:
+ *
+ *  '0' -- Run access tests with a single pkey
+ *
+ *  '1' -- Set up the pkey register with no access for the pkey allocated to
+ *         this fd
+ *  '2' -- Check that the pkey register updated in '1' is still the same.  (To
+ *         be used after a forced context switch.)
+ *
+ *  '3' -- Allocate all pkeys possible and run tests on each pkey allocated.
+ *         DEFAULT when run at boot.
+ *
+ *  Closing the fd will cleanup and release the pkey.
+ *
+ *  A companion user space program is provided in:
+ *
+ *          .../tools/testing/selftests/x86/test_pks.c
+ *
+ *  which will better test the context switching.
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/pgtable.h>
+
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mman.h>
+#include <linux/module.h>
+#include <linux/pkeys.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+/*
+ * run_on_boot default '= false' which checkpatch complains about initializing;
+ * so we don't
+ */
+static bool run_on_boot;
+static struct dentry *pks_test_dentry;
+
+/*
+ * We must lock the following globals for brief periods while the fault handler
+ * checks/updates them.
+ */
+static DEFINE_SPINLOCK(test_lock);
+static int test_armed_key;
+static unsigned long prev_cnt;
+static unsigned long fault_cnt;
+
+struct pks_test_ctx {
+	bool pass;
+	bool pks_cpu_enabled;
+	int pkey;
+	char data[64];
+};
+
+static pte_t *walk_table(void *ptr)
+{
+	struct page *page = NULL;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ret = NULL;
+
+	pgdp = pgd_offset_k((unsigned long)ptr);
+	if (pgd_none(*pgdp) || pgd_bad(*pgdp))
+		goto error;
+
+	p4dp = p4d_offset(pgdp, (unsigned long)ptr);
+	if (p4d_none(*p4dp) || p4d_bad(*p4dp))
+		goto error;
+
+	pudp = pud_offset(p4dp, (unsigned long)ptr);
+	if (pud_none(*pudp) || pud_bad(*pudp))
+		goto error;
+
+	pmdp = pmd_offset(pudp, (unsigned long)ptr);
+	if (pmd_none(*pmdp) || pmd_bad(*pmdp))
+		goto error;
+
+	ret = pte_offset_map(pmdp, (unsigned long)ptr);
+	if (pte_present(*ret)) {
+		page = pte_page(*ret);
+		if (!page) {
+			pte_unmap(ret);
+			goto error;
+		}
+		pr_info("page 0x%lx; flags 0x%lx\n",
+		       (unsigned long)page, page->flags);
+	}
+
+error:
+	return ret;
+}
+
+/**
+ * pks_test_armed_and_clear() is exported so that the fault handler can detect
+ * and report back status of intentional faults.
+ *
+ * NOTE: It clears the protection key from the page such that the fault handler
+ * will not re-trigger.
+ */
+bool pks_test_armed_and_clear(void)
+{
+	bool armed = (test_armed_key != 0);
+
+	if (armed) {
+		/* Enable read and write to stop faults */
+		pks_update_protection(test_armed_key, 0);
+		fault_cnt++;
+	}
+
+	return armed;
+}
+EXPORT_SYMBOL(pks_test_armed_and_clear);
+
+static bool exception_caught(void)
+{
+	bool ret = (fault_cnt != prev_cnt);
+
+	prev_cnt = fault_cnt;
+	return ret;
+}
+
+static void report_pkey_settings(void *unused)
+{
+	u8 pkey;
+	unsigned long long msr = 0;
+	unsigned int cpu = smp_processor_id();
+
+	rdmsrl(MSR_IA32_PKRS, msr);
+
+	pr_info("for CPU %d : 0x%llx\n", cpu, msr);
+	for (pkey = 0; pkey < PKS_NUM_KEYS; pkey++) {
+		int ad, wd;
+
+		ad = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_ACCESS;
+		wd = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_WRITE;
+		pr_info("   %u: A:%d W:%d\n", pkey, ad, wd);
+	}
+}
+
+struct pks_access_test {
+	int ad;
+	int wd;
+	bool write;
+	bool exception;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+	/*  disable both */
+	{ PKEY_DISABLE_ACCESS, PKEY_DISABLE_WRITE,   true,  true },
+	{ PKEY_DISABLE_ACCESS, PKEY_DISABLE_WRITE,   false, true },
+
+	/*  enable both */
+	{ 0, 0, true,  false },
+	{ 0, 0, false, false },
+
+	/*  enable read only */
+	{ 0, PKEY_DISABLE_WRITE,  true,  true },
+	{ 0, PKEY_DISABLE_WRITE,  false, false },
+};
+
+static int run_access_test(struct pks_test_ctx *ctx,
+			   struct pks_access_test *test,
+			   void *ptr)
+{
+	int ret = 0;
+	bool exception;
+
+	pks_update_protection(ctx->pkey, test->ad | test->wd);
+
+	spin_lock(&test_lock);
+	test_armed_key = ctx->pkey;
+
+	if (test->write)
+		memcpy(ptr, ctx->data, 8);
+	else
+		memcpy(ctx->data, ptr, 8);
+
+	exception = exception_caught();
+
+	test_armed_key = 0;
+	spin_unlock(&test_lock);
+
+	if (test->exception != exception) {
+		pr_err("pkey test FAILED: ad %d; wd %d; write %s; exception %s != %s\n",
+			test->ad, test->wd,
+			test->write ? "TRUE" : "FALSE",
+			test->exception ? "TRUE" : "FALSE",
+			exception ? "TRUE" : "FALSE");
+		ret = -EFAULT;
+	}
+
+	return ret;
+}
+
+static void test_mem_access(struct pks_test_ctx *ctx)
+{
+	int i, rc;
+	u8 pkey;
+	void *ptr = NULL;
+	pte_t *ptep;
+
+	ptr = __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END,
+				   GFP_KERNEL, PAGE_KERNEL_PKEY(ctx->pkey),
+				   0, NUMA_NO_NODE, __builtin_return_address(0));
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		ctx->pass = false;
+		return;
+	}
+
+	ptep = walk_table(ptr);
+	if (!ptep) {
+		pr_err("Failed to walk table???\n");
+		ctx->pass = false;
+		goto done;
+	}
+
+	pkey = pte_flags_pkey(ptep->pte);
+	pr_info("ptep flags 0x%lx pkey %u\n",
+	       (unsigned long)ptep->pte, pkey);
+
+	if (pkey != ctx->pkey) {
+		pr_err("invalid pkey found: %u, test_pkey: %u\n",
+			pkey, ctx->pkey);
+		ctx->pass = false;
+		goto unmap;
+	}
+
+	if (!ctx->pks_cpu_enabled) {
+		pr_err("not CPU enabled; skipping access tests...\n");
+		ctx->pass = true;
+		goto unmap;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+		rc = run_access_test(ctx, &pkey_test_ary[i], ptr);
+
+		/*  only save last error is fine */
+		if (rc)
+			ctx->pass = false;
+	}
+
+unmap:
+	pte_unmap(ptep);
+done:
+	vfree(ptr);
+}
+
+static void pks_run_test(struct pks_test_ctx *ctx)
+{
+	ctx->pass = true;
+
+	pr_info("\n");
+	pr_info("\n");
+	pr_info("     ***** BEGIN: Testing (CPU enabled : %s) *****\n",
+		ctx->pks_cpu_enabled ? "TRUE" : "FALSE");
+
+	if (ctx->pks_cpu_enabled)
+		on_each_cpu(report_pkey_settings, NULL, 1);
+
+	pr_info("           BEGIN: pkey %d Testing\n", ctx->pkey);
+	test_mem_access(ctx);
+	pr_info("           END: PAGE_KERNEL_PKEY Testing : %s\n",
+		ctx->pass ? "PASS" : "FAIL");
+
+	pr_info("     ***** END: Testing *****\n");
+	pr_info("\n");
+	pr_info("\n");
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	struct pks_test_ctx *ctx = file->private_data;
+	char buf[32];
+	unsigned int len;
+
+	if (!ctx)
+		len = sprintf(buf, "not run\n");
+	else
+		len = sprintf(buf, "%s\n", ctx->pass ? "PASS" : "FAIL");
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static struct pks_test_ctx *alloc_ctx(const char *name)
+{
+	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+	if (!ctx) {
+		pr_err("Failed to allocate memory for test context\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pkey = pks_key_alloc(name);
+	if (ctx->pkey <= 0) {
+		pr_err("Failed to allocate memory for test context\n");
+		kfree(ctx);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pks_cpu_enabled = cpu_feature_enabled(X86_FEATURE_PKS);
+	sprintf(ctx->data, "%s", "DEADBEEF");
+	return ctx;
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+	pks_key_free(ctx->pkey);
+	kfree(ctx);
+}
+
+static void run_all(void)
+{
+	struct pks_test_ctx *ctx[PKS_NUM_KEYS];
+	static char name[PKS_NUM_KEYS][64];
+	int i;
+
+	for (i = 1; i < PKS_NUM_KEYS; i++) {
+		sprintf(name[i], "pks ctx %d", i);
+		ctx[i] = alloc_ctx((const char *)name[i]);
+	}
+
+	for (i = 1; i < PKS_NUM_KEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			pks_run_test(ctx[i]);
+	}
+
+	for (i = 1; i < PKS_NUM_KEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			free_ctx(ctx[i]);
+	}
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+			      size_t count, loff_t *ppos)
+{
+	char buf[2];
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (copy_from_user(buf, user_buf, 1))
+		return -EFAULT;
+	buf[1] = '\0';
+
+	/*
+	 * Test "3" will test allocating all keys. Do it first without
+	 * using "ctx".
+	 */
+	if (!strcmp(buf, "3"))
+		run_all();
+
+	if (!ctx) {
+		ctx = alloc_ctx("pks test");
+		if (IS_ERR(ctx))
+			return -ENOMEM;
+		file->private_data = ctx;
+	}
+
+	if (!strcmp(buf, "0"))
+		pks_run_test(ctx);
+
+	/* start of context switch test */
+	if (!strcmp(buf, "1")) {
+		/* Ensure a known state to test context switch */
+		pks_update_protection(ctx->pkey,
+				      PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
+	}
+
+	/* After context switch msr should be restored */
+	if (!strcmp(buf, "2") && ctx->pks_cpu_enabled) {
+		unsigned long reg_pkrs;
+		int access;
+
+		rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+		access = (reg_pkrs >> (ctx->pkey * PKR_BITS_PER_PKEY)) &
+			  PKEY_ACCESS_MASK;
+		if (access != (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)) {
+			ctx->pass = false;
+			pr_err("Context switch check failed\n");
+		}
+	}
+
+	return count;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (!ctx)
+		return 0;
+
+	free_ctx(ctx);
+	return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+	.read = pks_read_file,
+	.write = pks_write_file,
+	.llseek = default_llseek,
+	.release = pks_release_file,
+};
+
+static int __init parse_pks_test_options(char *str)
+{
+	run_on_boot = true;
+
+	return 0;
+}
+early_param("pks-test-on-boot", parse_pks_test_options);
+
+static int __init pks_test_init(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_PKS)) {
+		if (run_on_boot)
+			run_all();
+
+		pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+						      NULL, &fops_init_pks);
+	}
+
+	return 0;
+}
+late_initcall(pks_test_init);
+
+static void __exit pks_test_exit(void)
+{
+	debugfs_remove(pks_test_dentry);
+	pr_info("test exit\n");
+}
+module_exit(pks_test_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index d2796ea98c5a..3572dfb25c0a 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,8 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
 			check_initial_reg_state sigreturn iopl ioperm \
 			test_vdso test_vsyscall mov_ss_trap \
-			syscall_arg_fault
+			syscall_arg_fault test_pks
+
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
new file mode 100644
index 000000000000..8037a2a9ff5f
--- /dev/null
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <assert.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+int main(void)
+{
+	cpu_set_t cpuset;
+	char result[32];
+	pid_t pid;
+	int fd;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	/* Two processes run on CPU 0 so that they go through context switch. */
+	sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
+
+	pid = fork();
+	if (pid == 0) {
+		fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+		if (fd < 0) {
+			printf("cannot open file\n");
+			return -1;
+		}
+
+		/* Allocate test_pkey1 and run test. */
+		write(fd, "0", 1);
+
+		/* Arm for context switch test */
+		write(fd, "1", 1);
+
+		/* Context switch out... */
+		sleep(4);
+
+		/* Check msr restored */
+		write(fd, "2", 1);
+	} else {
+		sleep(2);
+
+		fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+		if (fd < 0) {
+			printf("cannot open file\n");
+			return -1;
+		}
+
+		/* run test with alternate pkey */
+		write(fd, "0", 1);
+	}
+
+	read(fd, result, 10);
+	printf("#PF, context switch, pkey allocation and free tests: %s\n",
+	       result);
+
+	close(fd);
+
+	return 0;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 09/17] memremap: Convert devmap static branch to {inc,dec}
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (7 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 08/17] x86/pks: Add PKS Test code ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 10/17] fs/dax: Remove unused size parameter ira.weiny
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

While reviewing Protection Key Supervisor support it was pointed out
that using a counter to track static branch enable was an anti-pattern
which was better solved using the provided static_branch_{inc,dec}
functions.[1]

Fix up devmap_managed_key to work the same way.  Also this should be
safer because there is a very small (very unlikely) race when multiple
callers try to enable at the same time.

[1] https://lore.kernel.org/lkml/20200714194031.GI5523@worktop.programming.kicks-ass.net/

To: Dan Williams <dan.j.williams@intel.com>
To: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 mm/memremap.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 03e38b7a38f1..9fb9ad500e78 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -40,12 +40,10 @@ EXPORT_SYMBOL_GPL(memremap_compat_align);
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
 EXPORT_SYMBOL(devmap_managed_key);
-static atomic_t devmap_managed_enable;
 
 static void devmap_managed_enable_put(void)
 {
-	if (atomic_dec_and_test(&devmap_managed_enable))
-		static_branch_disable(&devmap_managed_key);
+	static_branch_dec(&devmap_managed_key);
 }
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
@@ -56,8 +54,7 @@ static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 		return -EINVAL;
 	}
 
-	if (atomic_inc_return(&devmap_managed_enable) == 1)
-		static_branch_enable(&devmap_managed_key);
+	static_branch_inc(&devmap_managed_key);
 	return 0;
 }
 #else
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 10/17] fs/dax: Remove unused size parameter
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (8 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 09/17] memremap: Convert devmap static branch to {inc,dec} ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 11/17] drivers/dax: Expand lock scope to cover the use of addresses ira.weiny
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Ben Widawsky, Dan Williams, x86, Dave Hansen,
	Vishal Verma, Andrew Morton, Fenghua Yu, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Passing size to copy_user_dax implies it can copy variable sizes of data
when in fact it calls copy_user_page() which is exactly a page.

We are safe because the only caller uses PAGE_SIZE anyway so just remove
the variable for clarity.

While we are at it change copy_user_dax() to copy_cow_page_dax() to make
it clear it is a singleton helper for this one case not implementing
what dax_iomap_actor() does.

Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/dax.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..3e0babeb0365 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -680,21 +680,20 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 	return __dax_invalidate_entry(mapping, index, false);
 }
 
-static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
-		sector_t sector, size_t size, struct page *to,
-		unsigned long vaddr)
+static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_dev,
+			     sector_t sector, struct page *to, unsigned long vaddr)
 {
 	void *vto, *kaddr;
 	pgoff_t pgoff;
 	long rc;
 	int id;
 
-	rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
 	if (rc)
 		return rc;
 
 	id = dax_read_lock();
-	rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, NULL);
+	rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(PAGE_SIZE), &kaddr, NULL);
 	if (rc < 0) {
 		dax_read_unlock(id);
 		return rc;
@@ -1305,8 +1304,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			clear_user_highpage(vmf->cow_page, vaddr);
 			break;
 		case IOMAP_MAPPED:
-			error = copy_user_dax(iomap.bdev, iomap.dax_dev,
-					sector, PAGE_SIZE, vmf->cow_page, vaddr);
+			error = copy_cow_page_dax(iomap.bdev, iomap.dax_dev,
+						  sector, vmf->cow_page, vaddr);
 			break;
 		default:
 			WARN_ON_ONCE(1);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 11/17] drivers/dax: Expand lock scope to cover the use of addresses
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (9 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 10/17] fs/dax: Remove unused size parameter ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 12/17] memremap: Add zone device access protection ira.weiny
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Dan Williams, x86, Dave Hansen, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The addition of PKS protection to dax read lock/unlock will require that
the address returned by dax_direct_access() be protected by this lock.

While not technically necessary for this series, this corrects the
locking by ensuring that the use of kaddr and end_kaddr are covered by
the dax read lock/unlock.

Change the lock scope to cover the kaddr and end_kaddr use.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/dax/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 8e32345be0f7..021739768093 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -103,11 +103,11 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
 	id = dax_read_lock();
 	len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
 	len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
-	dax_read_unlock(id);
 
 	if (len < 1 || len2 < 1) {
 		pr_debug("%s: error: dax access failed (%ld)\n",
 				bdevname(bdev, buf), len < 1 ? len : len2);
+		dax_read_unlock(id);
 		return false;
 	}
 
@@ -137,6 +137,7 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
 		put_dev_pagemap(end_pgmap);
 
 	}
+	dax_read_unlock(id);
 
 	if (!dax_enabled) {
 		pr_debug("%s: error: dax support not enabled\n",
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (10 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 11/17] drivers/dax: Expand lock scope to cover the use of addresses ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  9:10   ` Peter Zijlstra
                     ` (2 more replies)
  2020-07-17  7:20 ` [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages ira.weiny
                   ` (5 subsequent siblings)
  17 siblings, 3 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Device managed memory exposes itself to the kernel direct map which
allows stray pointers to access these device memories.

Stray pointers to normal memory may result in a crash or other
undesirable behavior which, while unfortunate, are usually recoverable
with a reboot.  Stray writes to areas such as non-volatile memory are
permanent in nature and thus are more likely to result in permanent user
data loss vs a stray write to other memory areas

Set up an infrastructure for extra device access protection. Then
implement the new protection using the new Protection Keys Supervisor
(PKS) on architectures which support it.

To enable this extra protection devices specify a flag in the pgmap to
indicate that these areas wish to use additional protection.

Kernel code which intends to access this memory can do so automatically
through the use of the kmap infrastructure calling into
dev_access_[enable|disable]() described here.  The kmap infrastructure
is implemented in a follow on patch.

In addition, users can directly enable/disable the access through
dev_access_[enable|disable]() if they have a priori knowledge of the
type of pages they are accessing.

All calls to enable/disable protection flow through
dev_access_[enable|disable]() and are nestable by the use of a per task
reference count.  This reference count does 2 things.

1) Allows a thread to nest calls to disable protection such that the
   first call to re-enable protection does not 'break' the last access of
   the pmem device memory.

2) Provides faster performance by avoiding lots of MSR writes.  For
   example, looping over a sequence of pmem pages.

IRQ context borrows the reference count of the interrupted task.  This
is a trade off vs saving/restoring on interrupt entry/exit.  The
following example shows how this works:

...
	// ref == 0
        dev_access_enable()  // ref += 1 ==> disable protection
                irq()
                        dev_access_enable()   // ref += 1 ==> 2
                        dev_access_disable()  // ref -= 1 ==> 1
        do_pmem_thing()  // all good here
        dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
...

While this does leave some openings for stray writes during irq's the
over all protection is much stronger after this patch and implementing
save/restore during irq's would have been a much more complicated
implementation.  So we compromise.

The pkey value is never free'ed as this too optimizes the implementation
to be either on or off using the static branch conditional in the fast
paths.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/memremap.h |  1 +
 include/linux/mm.h       | 33 ++++++++++++++
 include/linux/sched.h    |  3 ++
 init/init_task.c         |  3 ++
 kernel/fork.c            |  3 ++
 mm/Kconfig               | 13 ++++++
 mm/memremap.c            | 98 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 154 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..87a9772b1aa7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -90,6 +90,7 @@ struct dev_pagemap_ops {
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
+#define PGMAP_PROT_ENABLED	(1 << 1)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..99d0914e26f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1123,6 +1123,39 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+DECLARE_STATIC_KEY_FALSE(dev_protection_static_key);
+
+/*
+ * We make page_is_access_protected() as quick as possible.
+ *    1) If no mappings have been enabled with extra protection we skip this
+ *       entirely
+ *    2) Skip pages which are not ZONE_DEVICE
+ *    3) Only then check if this particular page was mapped with extra
+ *       protections.
+ */
+static inline bool page_is_access_protected(struct page *page)
+{
+	if (!static_branch_unlikely(&dev_protection_static_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	if (page->pgmap->flags & PGMAP_PROT_ENABLED)
+		return true;
+	return false;
+}
+
+void dev_access_enable(void);
+void dev_access_disable(void);
+#else
+static inline bool page_is_access_protected(struct page *page)
+{
+	return false;
+}
+static inline void dev_access_enable(void) { }
+static inline void dev_access_disable(void) { }
+#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define page_ref_zero_or_close_to_overflow(page) \
 	((unsigned int) page_ref_count(page) + 127u <= 127u)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 692e327d7455..2a8dbbb371ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1313,6 +1313,9 @@ struct task_struct {
 	struct callback_head		mce_kill_me;
 #endif
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	unsigned int			dev_page_access_ref;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 15089d15010a..17766b059606 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -204,6 +204,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECURITY
 	.security	= NULL,
 #endif
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	.dev_page_access_ref = 0,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493203ae..a6c14b962a27 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -957,6 +957,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+#endif
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	tsk->dev_page_access_ref = 0;
 #endif
 	return tsk;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index e541d2c0dcac..f6029f3c2c89 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -798,6 +798,19 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config ZONE_DEVICE_ACCESS_PROTECTION
+	bool "Device memory access protection"
+	depends on ZONE_DEVICE
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+
+	help
+	  Enable the option of having access protections on device memory
+	  areas.  This protects against access to device memory which is not
+	  intended such as stray writes.  This feature is particularly useful
+	  to protect against corruption of persistent memory.
+
+	  If in doubt, say 'Y'.
+
 config DEV_PAGEMAP_OPS
 	bool
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 9fb9ad500e78..513d9c3a48de 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -6,12 +6,16 @@
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
 #include <linux/pfn_t.h>
+#include <linux/pkeys.h>
 #include <linux/swap.h>
 #include <linux/mmzone.h>
 #include <linux/swapops.h>
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
+#include <uapi/asm-generic/mman-common.h>
+
+#define PKEY_INVALID (INT_MIN)
 
 static DEFINE_XARRAY(pgmap_array);
 
@@ -67,6 +71,97 @@ static void devmap_managed_enable_put(void)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+/*
+ * Note all devices which have asked for protections share the same key.  The
+ * key may, or may not, have been provided by the core.  If not, protection
+ * will remain disabled.  The key acquisition is attempted at init time and
+ * never again.  So we don't have to worry about dev_page_pkey changing.
+ */
+static int dev_page_pkey = PKEY_INVALID;
+DEFINE_STATIC_KEY_FALSE(dev_protection_static_key);
+EXPORT_SYMBOL(dev_protection_static_key);
+
+static pgprot_t dev_protection_enable_get(struct dev_pagemap *pgmap, pgprot_t prot)
+{
+	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) {
+		pgprotval_t val = pgprot_val(prot);
+
+		static_branch_inc(&dev_protection_static_key);
+		prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey));
+	}
+	return prot;
+}
+
+static void dev_protection_enable_put(struct dev_pagemap *pgmap)
+{
+	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID)
+		static_branch_dec(&dev_protection_static_key);
+}
+
+void dev_access_disable(void)
+{
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&dev_protection_static_key))
+		return;
+
+	local_irq_save(flags);
+	current->dev_page_access_ref--;
+	if (current->dev_page_access_ref == 0)
+		pks_update_protection(dev_page_pkey, PKEY_DISABLE_ACCESS);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(dev_access_disable);
+
+void dev_access_enable(void)
+{
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&dev_protection_static_key))
+		return;
+
+	local_irq_save(flags);
+	/* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */
+	if (current->dev_page_access_ref == 0)
+		pks_update_protection(dev_page_pkey, 0);
+	current->dev_page_access_ref++;
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(dev_access_enable);
+
+/**
+ * dev_access_protection_init: Configure a PKS key domain for device pages
+ *
+ * The domain defaults to the protected state.  Device page mappings should set
+ * the PGMAP_PROT_ENABLED flag when mapping pages.
+ *
+ * Note the pkey is never free'ed.  This is run at init time and we either get
+ * the key or we do not.  We need to do this to maintian a constant key (or
+ * not) as device memory is added or removed.
+ */
+static int __init __dev_access_protection_init(void)
+{
+	int pkey = pks_key_alloc("Device Memory");
+
+	if (pkey < 0)
+		return 0;
+
+	dev_page_pkey = pkey;
+
+	return 0;
+}
+subsys_initcall(__dev_access_protection_init);
+#else
+static pgprot_t dev_protection_enable_get(struct dev_pagemap *pgmap, pgprot_t prot)
+{
+	return prot;
+}
+static void dev_protection_enable_put(struct dev_pagemap *pgmap)
+{
+}
+#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */
+
 static void pgmap_array_delete(struct resource *res)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end),
@@ -156,6 +251,7 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 	pgmap_array_delete(res);
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put();
+	dev_protection_enable_put(pgmap);
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -191,6 +287,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 	int error, is_ram;
 	bool need_devmap_managed = true;
 
+	params.pgprot = dev_protection_enable_get(pgmap, params.pgprot);
+
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (11 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 12/17] memremap: Add zone device access protection ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  9:21   ` Peter Zijlstra
  2020-07-17  7:20 ` [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access() ira.weiny
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Device managed pages may have additional protections.  These protections
need to be removed prior to valid use by kernel users.

Check for special treatment of device managed pages in kmap and take
action if needed.  We use kmap as an interface for generic kernel code
because under normal circumstances it would be a bug for general kernel
code to not use kmap prior to accessing kernel memory.  Therefore, this
should allow any valid kernel users to seamlessly use these pages
without issues.

Because of the critical nature of kmap it must be pointed out that the
over head on regular DRAM is carefully implemented to be as fast as
possible.  Furthermore the underlying MSR write required on device pages
when protected is better than a normal MSR write.

Specifically, WRMSR(MSR_IA32_PKRS) is not serializing but still
maintains ordering properties similar to WRPKRU.  The current SDM
section on PKRS needs updating but should be the same as that of WRPKRU.
So to quote from the WRPKRU text:

	WRPKRU will never execute speculatively. Memory accesses
	affected by PKRU register will not execute (even speculatively)
	until all prior executions of WRPKRU have completed execution
	and updated the PKRU register.

Still this will make accessing pmem more expensive from the kernel but
the overhead is minimized and many pmem users access this memory through
user page mappings which are not affected at all.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/highmem.h | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index d6e82e3de027..7f809d8d5a94 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/memremap.h>
 
 #include <asm/cacheflush.h>
 
@@ -31,6 +32,20 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
 
 #include <asm/kmap_types.h>
 
+static inline void enable_access(struct page *page)
+{
+	if (!page_is_access_protected(page))
+		return;
+	dev_access_enable();
+}
+
+static inline void disable_access(struct page *page)
+{
+	if (!page_is_access_protected(page))
+		return;
+	dev_access_disable();
+}
+
 #ifdef CONFIG_HIGHMEM
 extern void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
 extern void kunmap_atomic_high(void *kvaddr);
@@ -55,6 +70,11 @@ static inline void *kmap(struct page *page)
 	else
 		addr = kmap_high(page);
 	kmap_flush_tlb((unsigned long)addr);
+	/*
+	 * Even non-highmem pages may have additional access protections which
+	 * need to be checked and potentially enabled.
+	 */
+	enable_access(page);
 	return addr;
 }
 
@@ -63,6 +83,11 @@ void kunmap_high(struct page *page);
 static inline void kunmap(struct page *page)
 {
 	might_sleep();
+	/*
+	 * Even non-highmem pages may have additional access protections which
+	 * need to be checked and potentially disabled.
+	 */
+	disable_access(page);
 	if (!PageHighMem(page))
 		return;
 	kunmap_high(page);
@@ -85,6 +110,7 @@ static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
 	preempt_disable();
 	pagefault_disable();
+	enable_access(page);
 	if (!PageHighMem(page))
 		return page_address(page);
 	return kmap_atomic_high_prot(page, prot);
@@ -137,6 +163,7 @@ static inline unsigned long totalhigh_pages(void) { return 0UL; }
 static inline void *kmap(struct page *page)
 {
 	might_sleep();
+	enable_access(page);
 	return page_address(page);
 }
 
@@ -146,6 +173,7 @@ static inline void kunmap_high(struct page *page)
 
 static inline void kunmap(struct page *page)
 {
+	disable_access(page);
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(page_address(page));
 #endif
@@ -155,6 +183,7 @@ static inline void *kmap_atomic(struct page *page)
 {
 	preempt_disable();
 	pagefault_disable();
+	enable_access(page);
 	return page_address(page);
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
@@ -216,7 +245,8 @@ static inline void kmap_atomic_idx_pop(void)
 #define kunmap_atomic(addr)                                     \
 do {                                                            \
 	BUILD_BUG_ON(__same_type((addr), struct page *));       \
-	kunmap_atomic_high(addr);                                  \
+	disable_access(kmap_to_page(addr));                     \
+	kunmap_atomic_high(addr);                               \
 	pagefault_enable();                                     \
 	preempt_enable();                                       \
 } while (0)
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access()
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (12 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  9:22   ` Peter Zijlstra
  2020-07-17  7:20 ` [PATCH RFC V2 15/17] nvdimm/pmem: Stray write protection for pmem->virt_addr ira.weiny
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

dax_direct_access() is a special case of accessing pmem via a page
offset and without a struct page.

Because the dax driver is well aware of the special protections it has
mapped memory with, call dev_access_[en|dis]able() directly instead of
the unnecessary overhead of trying to get a page to kmap.

Like kmap though, leverage the existing dax_read[un]lock() functions
because they are already required to surround the use of the memory
returned from dax_direct_access().

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/dax/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 021739768093..e8d0a28e6ed2 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -30,12 +30,14 @@ static DEFINE_SPINLOCK(dax_host_lock);
 
 int dax_read_lock(void)
 {
+	dev_access_enable();
 	return srcu_read_lock(&dax_srcu);
 }
 EXPORT_SYMBOL_GPL(dax_read_lock);
 
 void dax_read_unlock(int id)
 {
+	dev_access_disable();
 	srcu_read_unlock(&dax_srcu, id);
 }
 EXPORT_SYMBOL_GPL(dax_read_unlock);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 15/17] nvdimm/pmem: Stray write protection for pmem->virt_addr
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (13 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access() ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  7:20 ` [PATCH RFC V2 16/17] [dax|pmem]: Enable stray write protection ira.weiny
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The pmem driver uses a cached virtual address to access its memory
directly.  Because the nvdimm driver is well aware of the special
protections it has mapped memory with, we call dev_access_[en|dis]able()
around the direct pmem->virt_addr (pmem_addr) usage instead of the
unnecessary overhead of trying to get a page to kmap.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/nvdimm/pmem.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d25e66fd942d..46c11a09b813 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -148,7 +148,9 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
 		return BLK_STS_IOERR;
 
+	dev_access_enable();
 	rc = read_pmem(page, page_off, pmem_addr, len);
+	dev_access_disable();
 	flush_dcache_page(page);
 	return rc;
 }
@@ -180,11 +182,13 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 	 * after clear poison.
 	 */
 	flush_dcache_page(page);
+	dev_access_enable();
 	write_pmem(pmem_addr, page, page_off, len);
 	if (unlikely(bad_pmem)) {
 		rc = pmem_clear_poison(pmem, pmem_off, len);
 		write_pmem(pmem_addr, page, page_off, len);
 	}
+	dev_access_disable();
 
 	return rc;
 }
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 16/17] [dax|pmem]: Enable stray write protection
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (14 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 15/17] nvdimm/pmem: Stray write protection for pmem->virt_addr ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  9:25   ` Peter Zijlstra
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
  2020-07-24 22:19 ` [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support Kees Cook
  17 siblings, 1 reply; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

Protecting against stray writes is particularly important for PMEM
because, unlike writes to anonymous memory, writes to PMEM persists
across a reboot.  Thus data corruption could result in permanent loss of
data.  Therefore, there is no option presented to the user.

Enable stray write protection by setting the flag in pgmap which
requests it.  Note if Zone Device Access Protection not be supported
this flag will have no affect.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/dax/device.c  | 2 ++
 drivers/nvdimm/pmem.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 4c0af2eb7e19..884f66d73d32 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -430,6 +430,8 @@ int dev_dax_probe(struct device *dev)
 	}
 
 	dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
+	dev_dax->pgmap.flags |= PGMAP_PROT_ENABLED;
+
 	addr = devm_memremap_pages(dev, &dev_dax->pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 46c11a09b813..9416a660eede 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -427,6 +427,8 @@ static int pmem_attach_disk(struct device *dev,
 		return -EBUSY;
 	}
 
+	pmem->pgmap.flags |= PGMAP_PROT_ENABLED;
+
 	q = blk_alloc_queue(pmem_make_request, dev_to_node(dev));
 	if (!q)
 		return -ENOMEM;
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (15 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 16/17] [dax|pmem]: Enable stray write protection ira.weiny
@ 2020-07-17  7:20 ` ira.weiny
  2020-07-17  9:30   ` Peter Zijlstra
                     ` (4 more replies)
  2020-07-24 22:19 ` [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support Kees Cook
  17 siblings, 5 replies; 73+ messages in thread
From: ira.weiny @ 2020-07-17  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra
  Cc: Ira Weiny, Dave Hansen, x86, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  It is already preserved through a
context switch but this support leaves exception handling code open to
memory accesses which the interrupted process has allowed.

Close this hole by preserve the current task's PKRS MSR, reset the PKRS
MSR value on exception entry, and then restore the state on exception
exit.

Notice that, in addition to the MSR value itself, the saved task state
must also include the device memory protection reference count which is
maintained to keep kmap re-entrant.  The following shows how this works:

...
	// ref == 0
	dev_access_enable()  // ref += 1 ==> disable protection
		irq()
			// enable protection
			// ref = 0
			_handler()
				dev_access_enable()   // ref += 1 ==> disable protection
				dev_access_disable()  // ref -= 1 ==> enable protection
			// WARN_ON(ref != 0)
			// disable protection
	do_pmem_thing()  // all good here
	dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
...

Nested exceptions should also operate the same way with each exception
storing the previous reference count all the way down.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
RFC NOTES:

First I'm not sure if adding this state to idtentry_state and having
that state copied is the right way to go.  It seems like we should start
passing this by reference instead of value.  But for now this works as
an RFC.  Comments?

Second, I'm not 100% happy with having to save the reference count in
the exception handler.  It seems like a very ugly layering violation but
I don't see a way around it at the moment.

Third, this patch has gone through a couple of revisions as I've had
crashes which just don't make sense to me.  One particular issue I've
had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
The code path was a pmem copy and the ref count should have been
elevated due to dev_access_enable() but why was
idtentry_enter()->idt_save_pkrs() not called I don't know.

Finally, it looks like the entry/exit code is being refactored into
common code.  So likely this is best handled somewhat there.  Because
this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
in a generic fashion.  But that is a ways off I think.

This patch depends on:
	https://lore.kernel.org/lkml/159411021855.4006.13113751062324360868.tip-bot2@tip-bot2/
	Which has yet to land upstream.  I just pulled it into my test
	branch which is based on 5.8-rc5 to get the exception state
	tracking.  Hopefully it is self contained enough I'm not causing
	other issues with my tests.
---
 arch/x86/entry/common.c               | 78 ++++++++++++++++++++++++++-
 arch/x86/include/asm/idtentry.h       |  2 +
 arch/x86/include/asm/pkeys_internal.h |  3 +-
 arch/x86/kernel/process.c             |  1 -
 arch/x86/mm/pkeys.c                   |  2 +-
 5 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0521546022cb..012bf7ecca0d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -41,6 +41,7 @@
 #include <asm/io_bitmap.h>
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
+#include <asm/pkeys_internal.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -558,6 +559,72 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the default PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the access of the
+ * interrupted task.
+ *
+ * Furthermore, Zone Device Access Protection maintains access in a re-entrant
+ * manner through a reference count which also needs to be maintained should
+ * exception handlers use those interfaces for memory access.  Here we start
+ * off the exception handler ref count to 0 and ensure it is 0 when the
+ * exception is done.  Then restore it for the interrupted task.
+ */
+
+static void noinstr idt_save_pkrs(idtentry_state_t state)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	/*
+	 * Save the ref count of the current running process and set it to 0
+	 * for any irq users to properly track re-entrance
+	 */
+	state.pkrs_ref = current->dev_page_access_ref;
+	current->dev_page_access_ref = 0;
+#endif
+
+	state.pkrs = this_cpu_read(pkrs_cache);
+	if (state.pkrs != INIT_PKRS_VALUE)
+		write_pkrs(INIT_PKRS_VALUE);
+}
+
+static void noinstr idt_restore_pkrs(idtentry_state_t state)
+{
+	u32 pkrs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	pkrs = this_cpu_read(pkrs_cache);
+	if (state.pkrs != pkrs)
+		write_pkrs(state.pkrs);
+
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	WARN_ON_ONCE(current->dev_page_access_ref != 0);
+	/* Restore the interrupted process reference */
+	current->dev_page_access_ref = state.pkrs_ref;
+#endif
+}
+#else
+/* Define as macros to prevent conflict of inline and noinstr */
+#define idt_save_pkrs(state)
+#define idt_restore_pkrs(state)
+#endif
+
+
 /**
  * idtentry_enter - Handle state tracking on ordinary idtentries
  * @regs:	Pointer to pt_regs of interrupted context
@@ -604,6 +671,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
 		return ret;
 	}
 
+	idt_save_pkrs(ret);
+
 	/*
 	 * If this entry hit the idle task invoke rcu_irq_enter() whether
 	 * RCU is watching or not.
@@ -694,7 +763,12 @@ void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		prepare_exit_to_usermode(regs);
-	} else if (regs->flags & X86_EFLAGS_IF) {
+		return;
+	}
+
+	idt_restore_pkrs(state);
+
+	if (regs->flags & X86_EFLAGS_IF) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -819,6 +893,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+		/* Normally called by idtentry_exit, we must restore pkrs here */
+		idt_restore_pkrs(state);
 		instrumentation_begin();
 		idtentry_exit_cond_resched(regs, true);
 		instrumentation_end();
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 7227225cf45d..92d5e43cda7f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -14,6 +14,8 @@ void idtentry_enter_user(struct pt_regs *regs);
 void idtentry_exit_user(struct pt_regs *regs);
 
 typedef struct idtentry_state {
+	unsigned int pkrs_ref;
+	u32 pkrs;
 	bool exit_rcu;
 } idtentry_state_t;
 
diff --git a/arch/x86/include/asm/pkeys_internal.h b/arch/x86/include/asm/pkeys_internal.h
index e34f380c66d1..60e7b55dd141 100644
--- a/arch/x86/include/asm/pkeys_internal.h
+++ b/arch/x86/include/asm/pkeys_internal.h
@@ -27,7 +27,8 @@
 #define        PKS_NUM_KEYS            16
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 pkrs_val);
+DECLARE_PER_CPU(u32, pkrs_cache);
+void noinstr write_pkrs(u32 pkrs_val);
 #else
 static inline void write_pkrs(u32 pkrs_val) { }
 #endif
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index d69250a7c1bf..85f0cbd5faa5 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -194,7 +194,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
  * headers.
  */
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-DECLARE_PER_CPU(u32, pkrs_cache);
 static inline void pks_init_task(struct task_struct *tsk)
 {
 	/* New tasks get the most restrictive PKRS value */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e565fadd74d7..cf9915fed6c9 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -247,7 +247,7 @@ DEFINE_PER_CPU(u32, pkrs_cache);
  * disabled as it does not guarantee the atomicity of updating the pkrs_cache
  * and MSR on its own.
  */
-void write_pkrs(u32 pkrs_val)
+void noinstr write_pkrs(u32 pkrs_val)
 {
 	this_cpu_write(pkrs_cache, pkrs_val);
 	wrmsrl(MSR_IA32_PKRS, pkrs_val);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-17  7:20 ` [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
@ 2020-07-17  8:31   ` Peter Zijlstra
  2020-07-17 21:39     ` Ira Weiny
  2020-07-17  8:59   ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  8:31 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:43AM -0700, ira.weiny@intel.com wrote:

> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index f362ce0d5ac0..d69250a7c1bf 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -42,6 +42,7 @@
>  #include <asm/spec-ctrl.h>
>  #include <asm/io_bitmap.h>
>  #include <asm/proto.h>
> +#include <asm/pkeys_internal.h>
>  
>  #include "process.h"
>  
> @@ -184,6 +185,36 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
>  	return ret;
>  }
>  
> +/*
> + * NOTE: We wrap pks_init_task() and pks_sched_in() with
> + * CONFIG_ARCH_HAS_SUPERVISOR_PKEYS because using IS_ENABLED() fails
> + * due to the lack of task_struct->saved_pkrs in this configuration.
> + * Furthermore, we place them here because of the complexity introduced by
> + * header conflicts introduced to get the task_struct definition in the pkeys
> + * headers.
> + */

I don't see anything much useful in that comment.

> +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> +DECLARE_PER_CPU(u32, pkrs_cache);
> +static inline void pks_init_task(struct task_struct *tsk)
> +{
> +	/* New tasks get the most restrictive PKRS value */
> +	tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
> +}
> +static inline void pks_sched_in(void)
> +{
> +	u64 current_pkrs = current->thread.saved_pkrs;
> +
> +	/* Only update the MSR when current's pkrs is different from the MSR. */
> +	if (this_cpu_read(pkrs_cache) == current_pkrs)
> +		return;
> +
> +	write_pkrs(current_pkrs);

Should we write that like:

	/*
	 * PKRS is only temporarily changed during specific code paths.
	 * Only a preemption during these windows away from the default
	 * value would require updating the MSR.
	 */
	if (unlikely(this_cpu_read(pkrs_cache) != current_pkrs))
		write_pkrs(current_pkrs);

?

> +}
> +#else
> +static inline void pks_init_task(struct task_struct *tsk) { }
> +static inline void pks_sched_in(void) { }
> +#endif

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-07-17  7:20 ` [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
@ 2020-07-17  8:54   ` Peter Zijlstra
  2020-07-17 20:52     ` Ira Weiny
  2020-07-17 22:36     ` Dave Hansen
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  8:54 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:41AM -0700, ira.weiny@intel.com wrote:
> +/*
> + * Get a new pkey register value from the user values specified.
> + *
> + * Kernel users use the same flags as user space:
> + *     PKEY_DISABLE_ACCESS
> + *     PKEY_DISABLE_WRITE
> + */
> +u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val)
> +{
> +	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
> +	u32 new_pkr_bits = 0;
> +
> +	/* Set the bits we need in the register:  */
> +	if (init_val & PKEY_DISABLE_ACCESS)
> +		new_pkr_bits |= PKR_AD_BIT;
> +	if (init_val & PKEY_DISABLE_WRITE)
> +		new_pkr_bits |= PKR_WD_BIT;
> +
> +	/* Shift the bits in to the correct place: */
> +	new_pkr_bits <<= pkey_shift;
> +
> +	/* Mask off any old bits in place: */
> +	old_pkr &= ~((PKR_AD_BIT | PKR_WD_BIT) << pkey_shift);
> +
> +	/* Return the old part along with the new part: */
> +	return old_pkr | new_pkr_bits;
> +}

This is unbelievable junk...

How about something like:

u32 update_pkey_reg(u32 pk_reg, int pkey, unsigned int flags)
{
	int pkey_shift = pkey * PKR_BITS_PER_PKEY;

	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);

	if (flags & PKEY_DISABLE_ACCESS)
		pk_reg |= PKR_AD_BIT << pkey_shift;
	if (flags & PKEY_DISABLE_WRITE)
		pk_reg |= PKR_WD_BIT << pkey_shift;

	return pk_reg;
}

Then we at least have a little clue wtf the thing does.. Yes I started
with a rename and then got annoyed at the implementation too.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-17  7:20 ` [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
  2020-07-17  8:31   ` Peter Zijlstra
@ 2020-07-17  8:59   ` Peter Zijlstra
  2020-07-17 22:34     ` Ira Weiny
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  8:59 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:43AM -0700, ira.weiny@intel.com wrote:
> +/*
> + * Write the PKey Register Supervisor.  This must be run with preemption
> + * disabled as it does not guarantee the atomicity of updating the pkrs_cache
> + * and MSR on its own.
> + */
> +void write_pkrs(u32 pkrs_val)
> +{
> +	this_cpu_write(pkrs_cache, pkrs_val);
> +	wrmsrl(MSR_IA32_PKRS, pkrs_val);
> +}

Should we write that like:

void write_pkrs(u32 pkr)
{
	u32 *pkrs = get_cpu_ptr(pkrs_cache);
	if (*pkrs != pkr) {
		*pkrs = pkr;
		wrmsrl(MSR_IA32_PKRS, pkr);
	}
	put_cpu_ptrpkrs_cache);
}

given that we fundamentally need to serialize againt schedule() here.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-17  7:20 ` [PATCH RFC V2 12/17] memremap: Add zone device access protection ira.weiny
@ 2020-07-17  9:10   ` Peter Zijlstra
  2020-07-18  5:06     ` Ira Weiny
  2020-07-17  9:17   ` Peter Zijlstra
  2020-07-17  9:20   ` Peter Zijlstra
  2 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:10 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:51AM -0700, ira.weiny@intel.com wrote:
> +static pgprot_t dev_protection_enable_get(struct dev_pagemap *pgmap, pgprot_t prot)
> +{
> +	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) {
> +		pgprotval_t val = pgprot_val(prot);
> +
> +		static_branch_inc(&dev_protection_static_key);
> +		prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey));
> +	}
> +	return prot;
> +}

Every other pgprot modifying function is called pgprot_*(), although I
suppose we have the exceptions phys_mem_access_prot() and dma_pgprot().

How about we call this one devm_pgprot() ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-17  7:20 ` [PATCH RFC V2 12/17] memremap: Add zone device access protection ira.weiny
  2020-07-17  9:10   ` Peter Zijlstra
@ 2020-07-17  9:17   ` Peter Zijlstra
  2020-07-18  5:51     ` Ira Weiny
  2020-07-17  9:20   ` Peter Zijlstra
  2 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:17 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:51AM -0700, ira.weiny@intel.com wrote:
> +void dev_access_disable(void)
> +{
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&dev_protection_static_key))
> +		return;
> +
> +	local_irq_save(flags);
> +	current->dev_page_access_ref--;
> +	if (current->dev_page_access_ref == 0)

	if (!--current->dev_page_access_ref)

> +		pks_update_protection(dev_page_pkey, PKEY_DISABLE_ACCESS);
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(dev_access_disable);
> +
> +void dev_access_enable(void)
> +{
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&dev_protection_static_key))
> +		return;
> +
> +	local_irq_save(flags);
> +	/* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */
> +	if (current->dev_page_access_ref == 0)
> +		pks_update_protection(dev_page_pkey, 0);
> +	current->dev_page_access_ref++;

	if (!current->dev_page_access_ref++)

> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(dev_access_enable);


Also, you probably want something like:

static __always_inline devm_access_disable(void)
{
	if (static_branch_unlikely(&dev_protection_static_key))
		__devm_access_disable();
}

static __always_inline devm_access_enable(void)
{
	if (static_branch_unlikely(&dev_protection_static_key))
		__devm_access_enable();
}

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-17  7:20 ` [PATCH RFC V2 12/17] memremap: Add zone device access protection ira.weiny
  2020-07-17  9:10   ` Peter Zijlstra
  2020-07-17  9:17   ` Peter Zijlstra
@ 2020-07-17  9:20   ` Peter Zijlstra
  2 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:20 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:51AM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Device managed memory exposes itself to the kernel direct map which
> allows stray pointers to access these device memories.
> 
> Stray pointers to normal memory may result in a crash or other
> undesirable behavior which, while unfortunate, are usually recoverable
> with a reboot.  Stray writes to areas such as non-volatile memory are
> permanent in nature and thus are more likely to result in permanent user
> data loss vs a stray write to other memory areas

> +		pks_update_protection(dev_page_pkey, PKEY_DISABLE_ACCESS);

So on the one hand you talk about the problem of stray writes, but then
you disable all access.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages
  2020-07-17  7:20 ` [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages ira.weiny
@ 2020-07-17  9:21   ` Peter Zijlstra
  2020-07-19  4:13     ` Ira Weiny
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:21 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:52AM -0700, ira.weiny@intel.com wrote:
> @@ -31,6 +32,20 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>  
>  #include <asm/kmap_types.h>
>  
> +static inline void enable_access(struct page *page)
> +{
> +	if (!page_is_access_protected(page))
> +		return;
> +	dev_access_enable();
> +}
> +
> +static inline void disable_access(struct page *page)
> +{
> +	if (!page_is_access_protected(page))
> +		return;
> +	dev_access_disable();
> +}

These are some very generic names, do we want them to be a little more
specific?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access()
  2020-07-17  7:20 ` [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access() ira.weiny
@ 2020-07-17  9:22   ` Peter Zijlstra
  2020-07-19  4:41     ` Ira Weiny
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:22 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:53AM -0700, ira.weiny@intel.com wrote:

> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -30,12 +30,14 @@ static DEFINE_SPINLOCK(dax_host_lock);
>  
>  int dax_read_lock(void)
>  {
> +	dev_access_enable();
>  	return srcu_read_lock(&dax_srcu);
>  }
>  EXPORT_SYMBOL_GPL(dax_read_lock);
>  
>  void dax_read_unlock(int id)
>  {
> +	dev_access_disable();
>  	srcu_read_unlock(&dax_srcu, id);
>  }
>  EXPORT_SYMBOL_GPL(dax_read_unlock);

This is inconsistently ordered.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 16/17] [dax|pmem]: Enable stray write protection
  2020-07-17  7:20 ` [PATCH RFC V2 16/17] [dax|pmem]: Enable stray write protection ira.weiny
@ 2020-07-17  9:25   ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:25 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:55AM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Protecting against stray writes is particularly important for PMEM
> because, unlike writes to anonymous memory, writes to PMEM persists
> across a reboot.  Thus data corruption could result in permanent loss of
> data.  Therefore, there is no option presented to the user.
> 
> Enable stray write protection by setting the flag in pgmap which
> requests it.  Note if Zone Device Access Protection not be supported
> this flag will have no affect.

The actual implementation is stray-access-protection, as noted ealier.
This inconsisteny is throughout.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
@ 2020-07-17  9:30   ` Peter Zijlstra
  2020-07-21 18:01     ` Ira Weiny
  2020-07-17  9:34   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:30 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> +static void noinstr idt_save_pkrs(idtentry_state_t state)

noinstr goes in the same place you would normally put inline, that's
before the return type, not after it.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
  2020-07-17  9:30   ` Peter Zijlstra
@ 2020-07-17  9:34   ` Peter Zijlstra
  2020-07-17 10:06   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17  9:34 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> +/* Define as macros to prevent conflict of inline and noinstr */
> +#define idt_save_pkrs(state)
> +#define idt_restore_pkrs(state)

Use __always_inline

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
  2020-07-17  9:30   ` Peter Zijlstra
  2020-07-17  9:34   ` Peter Zijlstra
@ 2020-07-17 10:06   ` Peter Zijlstra
  2020-07-22  5:27     ` Ira Weiny
  2020-07-22 16:21   ` Andy Lutomirski
  2020-07-23 19:53   ` Thomas Gleixner
  4 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-17 10:06 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> First I'm not sure if adding this state to idtentry_state and having
> that state copied is the right way to go.  It seems like we should start
> passing this by reference instead of value.  But for now this works as
> an RFC.  Comments?

As long as you keep sizeof(struct idtentry_state_t) <= sizeof(u64) or
possibly 2*sizeof(unsigned long), code gen shouldn't be too horrid IIRC.
You'll have to look at what the compiler makes of it.

> Second, I'm not 100% happy with having to save the reference count in
> the exception handler.  It seems like a very ugly layering violation but
> I don't see a way around it at the moment.

So I've been struggling with that API, all the way from
pks_update_protection() to that dev_access_{en,dis}able(). I _really_
hate it, but I see how you ended up with it.

I wanted to propose something like:

u32 current_pkey_save(int pkey, unsigned flags)
{
	u32 *lpkr = get_cpu_ptr(&local_pkr);
	u32 pkr, saved = *lpkr;

	pkr = update_pkey_reg(saved, pkey, flags);
	if (pkr != saved)
		wrpkr(pkr);

	put_cpu_ptr(&local_pkr);
	return saved;
}

void current_pkey_restore(u32 pkr)
{
	u32 *lpkr = get_cpu_ptr(&local_pkr);
	if (*lpkr != pkr)
		wrpkr(pkr);
	put_cpu_ptr(&local_pkr);
}

Together with:

void pkey_switch(struct task_struct *prev, struct task_struct *next)
{
	prev->pkr = this_cpu_read(local_pkr);
	if (prev->pkr != next->pkr)
		wrpkr(next->pkr);
}

But that's actually hard to frob into the kmap() model :-( The upside is
that you only have 1 word of state, instead of the 2 you have now.

> Third, this patch has gone through a couple of revisions as I've had
> crashes which just don't make sense to me.  One particular issue I've
> had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> The code path was a pmem copy and the ref count should have been
> elevated due to dev_access_enable() but why was
> idtentry_enter()->idt_save_pkrs() not called I don't know.

Because MCEs are NMI-like and don't go through the normal interrupt
path. MCEs are an abomination, please wear all the protective devices
you can lay hands on when delving into that.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-07-17  8:54   ` Peter Zijlstra
@ 2020-07-17 20:52     ` Ira Weiny
  2020-07-20  9:14       ` Peter Zijlstra
  2020-07-17 22:36     ` Dave Hansen
  1 sibling, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-17 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 10:54:42AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:41AM -0700, ira.weiny@intel.com wrote:
> > +/*
> > + * Get a new pkey register value from the user values specified.
> > + *
> > + * Kernel users use the same flags as user space:
> > + *     PKEY_DISABLE_ACCESS
> > + *     PKEY_DISABLE_WRITE
> > + */
> > +u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val)
> > +{
> > +	int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
> > +	u32 new_pkr_bits = 0;
> > +
> > +	/* Set the bits we need in the register:  */
> > +	if (init_val & PKEY_DISABLE_ACCESS)
> > +		new_pkr_bits |= PKR_AD_BIT;
> > +	if (init_val & PKEY_DISABLE_WRITE)
> > +		new_pkr_bits |= PKR_WD_BIT;
> > +
> > +	/* Shift the bits in to the correct place: */
> > +	new_pkr_bits <<= pkey_shift;
> > +
> > +	/* Mask off any old bits in place: */
> > +	old_pkr &= ~((PKR_AD_BIT | PKR_WD_BIT) << pkey_shift);
> > +
> > +	/* Return the old part along with the new part: */
> > +	return old_pkr | new_pkr_bits;
> > +}
> 
> This is unbelievable junk...
> 
> How about something like:
> 
> u32 update_pkey_reg(u32 pk_reg, int pkey, unsigned int flags)
> {
> 	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> 
> 	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> 
> 	if (flags & PKEY_DISABLE_ACCESS)
> 		pk_reg |= PKR_AD_BIT << pkey_shift;
> 	if (flags & PKEY_DISABLE_WRITE)
> 		pk_reg |= PKR_WD_BIT << pkey_shift;
> 
> 	return pk_reg;
> }
> 
> Then we at least have a little clue wtf the thing does.. Yes I started
> with a rename and then got annoyed at the implementation too.

On the code I think this is fair.  I've also updated the calling function to be
a bit cleaner as well.

However, I think the name 'update' is a bit misleading.  Here is the new
calling code:

...
        pkru = read_pkru();
	pkru = update_pkey_reg(pkru, pkey, init_val);
	write_pkru(pkru);
...


I think it is odd to have a function called update_pkey_reg() called right
before a write_pkru().  Can we call this update_pkey_value?  or just 'val'?
Because write_pkru() actually updates the register.

Thanks for the review,
Ira


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-17  8:31   ` Peter Zijlstra
@ 2020-07-17 21:39     ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-17 21:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 10:31:40AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:43AM -0700, ira.weiny@intel.com wrote:
> 
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index f362ce0d5ac0..d69250a7c1bf 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -42,6 +42,7 @@
> >  #include <asm/spec-ctrl.h>
> >  #include <asm/io_bitmap.h>
> >  #include <asm/proto.h>
> > +#include <asm/pkeys_internal.h>
> >  
> >  #include "process.h"
> >  
> > @@ -184,6 +185,36 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
> >  	return ret;
> >  }
> >  
> > +/*
> > + * NOTE: We wrap pks_init_task() and pks_sched_in() with
> > + * CONFIG_ARCH_HAS_SUPERVISOR_PKEYS because using IS_ENABLED() fails
> > + * due to the lack of task_struct->saved_pkrs in this configuration.
> > + * Furthermore, we place them here because of the complexity introduced by
> > + * header conflicts introduced to get the task_struct definition in the pkeys
> > + * headers.
> > + */
> 
> I don't see anything much useful in that comment.

I'm happy to delete.  Internal reviews questioned the motive here so I added
the comment to inform why this style was chosen rather than the preferred
IS_ENABLED().

I've deleted it now.

> 
> > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +DECLARE_PER_CPU(u32, pkrs_cache);
> > +static inline void pks_init_task(struct task_struct *tsk)
> > +{
> > +	/* New tasks get the most restrictive PKRS value */
> > +	tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
> > +}
> > +static inline void pks_sched_in(void)
> > +{
> > +	u64 current_pkrs = current->thread.saved_pkrs;
> > +
> > +	/* Only update the MSR when current's pkrs is different from the MSR. */
> > +	if (this_cpu_read(pkrs_cache) == current_pkrs)
> > +		return;
> > +
> > +	write_pkrs(current_pkrs);
> 
> Should we write that like:
> 
> 	/*
> 	 * PKRS is only temporarily changed during specific code paths.
> 	 * Only a preemption during these windows away from the default
> 	 * value would require updating the MSR.
> 	 */
> 	if (unlikely(this_cpu_read(pkrs_cache) != current_pkrs))
> 		write_pkrs(current_pkrs);
> 
> ?

Yes I think the unlikely is better.

Thanks,
Ira

> 
> > +}
> > +#else
> > +static inline void pks_init_task(struct task_struct *tsk) { }
> > +static inline void pks_sched_in(void) { }
> > +#endif

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-17  8:59   ` Peter Zijlstra
@ 2020-07-17 22:34     ` Ira Weiny
  2020-07-20  9:15       ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-17 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 10:59:54AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:43AM -0700, ira.weiny@intel.com wrote:
> > +/*
> > + * Write the PKey Register Supervisor.  This must be run with preemption
> > + * disabled as it does not guarantee the atomicity of updating the pkrs_cache
> > + * and MSR on its own.
> > + */
> > +void write_pkrs(u32 pkrs_val)
> > +{
> > +	this_cpu_write(pkrs_cache, pkrs_val);
> > +	wrmsrl(MSR_IA32_PKRS, pkrs_val);
> > +}
> 
> Should we write that like:
> 
> void write_pkrs(u32 pkr)
> {
> 	u32 *pkrs = get_cpu_ptr(pkrs_cache);
> 	if (*pkrs != pkr) {
> 		*pkrs = pkr;
> 		wrmsrl(MSR_IA32_PKRS, pkr);
> 	}
> 	put_cpu_ptrpkrs_cache);
> }
> 
> given that we fundamentally need to serialize againt schedule() here.

Yes.  That seems better.

That also means pks_sched_in() can be simplified to just

static inline void pks_sched_in(void)
{
	write_pkrs(current->thread.saved_pkrs);
}

Because of the built WRMSR avoidance.

However, pkrs_cache is static so I think I need to use {get,put}_cpu_var() here
don't I?

Thanks!
Ira


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-07-17  8:54   ` Peter Zijlstra
  2020-07-17 20:52     ` Ira Weiny
@ 2020-07-17 22:36     ` Dave Hansen
  2020-07-20  9:13       ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Hansen @ 2020-07-17 22:36 UTC (permalink / raw)
  To: Peter Zijlstra, ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On 7/17/20 1:54 AM, Peter Zijlstra wrote:
> This is unbelievable junk...

Ouch!

This is from the original user pkeys implementation.

> How about something like:
> 
> u32 update_pkey_reg(u32 pk_reg, int pkey, unsigned int flags)
> {
> 	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> 
> 	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> 
> 	if (flags & PKEY_DISABLE_ACCESS)
> 		pk_reg |= PKR_AD_BIT << pkey_shift;
> 	if (flags & PKEY_DISABLE_WRITE)
> 		pk_reg |= PKR_WD_BIT << pkey_shift;
> 
> 	return pk_reg;
> }
> 
> Then we at least have a little clue wtf the thing does.. Yes I started
> with a rename and then got annoyed at the implementation too.

That's fine, if some comments get added.  It looks correct to me but
probably compiles down to pretty much the same thing as what was there.
 FWIW, I prefer the explicit masking off of two bit values to implicit
masking off with a mask generated from PKR_BITS_PER_PKEY.  It's
certainly more compact, but I usually don't fret over the lines of code.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-17  9:10   ` Peter Zijlstra
@ 2020-07-18  5:06     ` Ira Weiny
  2020-07-20  9:16       ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-18  5:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 11:10:53AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:51AM -0700, ira.weiny@intel.com wrote:
> > +static pgprot_t dev_protection_enable_get(struct dev_pagemap *pgmap, pgprot_t prot)
> > +{
> > +	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) {
> > +		pgprotval_t val = pgprot_val(prot);
> > +
> > +		static_branch_inc(&dev_protection_static_key);
> > +		prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey));
> > +	}
> > +	return prot;
> > +}
> 
> Every other pgprot modifying function is called pgprot_*(), although I
> suppose we have the exceptions phys_mem_access_prot() and dma_pgprot().

Yea...  this function kind of morphed.  The issue is that this is also a 'get'
with a corresponding 'put'.  So I'm at a loss for what makes sense between the
2 functions.

> 
> How about we call this one devm_pgprot() ?

Dan Williams mentioned to me that the devm is not an appropriate prefix.  Thus
the 'dev' prefix instead.

How about dev_pgprot_{get,put}()?

Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-17  9:17   ` Peter Zijlstra
@ 2020-07-18  5:51     ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-18  5:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 11:17:18AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:51AM -0700, ira.weiny@intel.com wrote:
> > +void dev_access_disable(void)
> > +{
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&dev_protection_static_key))
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	current->dev_page_access_ref--;
> > +	if (current->dev_page_access_ref == 0)
> 
> 	if (!--current->dev_page_access_ref)

It's not my style but I'm ok with it.

> 
> > +		pks_update_protection(dev_page_pkey, PKEY_DISABLE_ACCESS);
> > +	local_irq_restore(flags);
> > +}
> > +EXPORT_SYMBOL_GPL(dev_access_disable);
> > +
> > +void dev_access_enable(void)
> > +{
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&dev_protection_static_key))
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	/* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */
> > +	if (current->dev_page_access_ref == 0)
> > +		pks_update_protection(dev_page_pkey, 0);
> > +	current->dev_page_access_ref++;
> 
> 	if (!current->dev_page_access_ref++)

Sure.

> 
> > +	local_irq_restore(flags);
> > +}
> > +EXPORT_SYMBOL_GPL(dev_access_enable);
> 
> 
> Also, you probably want something like:
> 
> static __always_inline devm_access_disable(void)

Yes that is better.

However, again Dan and I agree devm is not the right prefix here.

I've updated.

Thanks!
Ira

> {
> 	if (static_branch_unlikely(&dev_protection_static_key))
> 		__devm_access_disable();
> }
> 
> static __always_inline devm_access_enable(void)
> {
> 	if (static_branch_unlikely(&dev_protection_static_key))
> 		__devm_access_enable();
> }

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages
  2020-07-17  9:21   ` Peter Zijlstra
@ 2020-07-19  4:13     ` Ira Weiny
  2020-07-20  9:17       ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-19  4:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 11:21:39AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:52AM -0700, ira.weiny@intel.com wrote:
> > @@ -31,6 +32,20 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> >  
> >  #include <asm/kmap_types.h>
> >  
> > +static inline void enable_access(struct page *page)
> > +{
> > +	if (!page_is_access_protected(page))
> > +		return;
> > +	dev_access_enable();
> > +}
> > +
> > +static inline void disable_access(struct page *page)
> > +{
> > +	if (!page_is_access_protected(page))
> > +		return;
> > +	dev_access_disable();
> > +}
> 
> These are some very generic names, do we want them to be a little more
> specific?

I had them named kmap_* but Dave (I think it was Dave) thought they did not
really apply strictly to kmap_*.

They are static to this file which I thought may be sufficient to 'uniqify'
them?

I'm ok to change them but that is how I arrived at this name.

Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access()
  2020-07-17  9:22   ` Peter Zijlstra
@ 2020-07-19  4:41     ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-19  4:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 11:22:43AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:53AM -0700, ira.weiny@intel.com wrote:
> 
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -30,12 +30,14 @@ static DEFINE_SPINLOCK(dax_host_lock);
> >  
> >  int dax_read_lock(void)
> >  {
> > +	dev_access_enable();
> >  	return srcu_read_lock(&dax_srcu);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_read_lock);
> >  
> >  void dax_read_unlock(int id)
> >  {
> > +	dev_access_disable();
> >  	srcu_read_unlock(&dax_srcu, id);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_read_unlock);
> 
> This is inconsistently ordered.

Thanks, good catch.

Fixed.
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-07-17 22:36     ` Dave Hansen
@ 2020-07-20  9:13       ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-20  9:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: ira.weiny, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Fenghua Yu, x86, Dave Hansen, Dan Williams,
	Vishal Verma, Andrew Morton, linux-doc, linux-kernel,
	linux-nvdimm, linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 03:36:12PM -0700, Dave Hansen wrote:
> On 7/17/20 1:54 AM, Peter Zijlstra wrote:
> > This is unbelievable junk...
> 
> Ouch!
> 
> This is from the original user pkeys implementation.

The thing I fell over most was new in this patch; the naming of that
function. It doesn't 'get' anything, nor does it allocate anything, so
'new' is out the window too.

> > How about something like:
> > 
> > u32 update_pkey_reg(u32 pk_reg, int pkey, unsigned int flags)
> > {
> > 	int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > 
> > 	pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > 
> > 	if (flags & PKEY_DISABLE_ACCESS)
> > 		pk_reg |= PKR_AD_BIT << pkey_shift;
> > 	if (flags & PKEY_DISABLE_WRITE)
> > 		pk_reg |= PKR_WD_BIT << pkey_shift;
> > 
> > 	return pk_reg;
> > }
> > 
> > Then we at least have a little clue wtf the thing does.. Yes I started
> > with a rename and then got annoyed at the implementation too.
> 
> That's fine, if some comments get added.

I'm not sure what you would want commented; the code is trivial.

> It looks correct to me but
> probably compiles down to pretty much the same thing as what was there.
>  FWIW, I prefer the explicit masking off of two bit values to implicit
> masking off with a mask generated from PKR_BITS_PER_PKEY.  It's
> certainly more compact, but I usually don't fret over the lines of code.

This way you're sure there are no bits missed. Both the shift and mask
use the same value.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  2020-07-17 20:52     ` Ira Weiny
@ 2020-07-20  9:14       ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-20  9:14 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 01:52:55PM -0700, Ira Weiny wrote:
> On Fri, Jul 17, 2020 at 10:54:42AM +0200, Peter Zijlstra wrote:

> > Then we at least have a little clue wtf the thing does.. Yes I started
> > with a rename and then got annoyed at the implementation too.
> 
> On the code I think this is fair.  I've also updated the calling function to be
> a bit cleaner as well.
> 
> However, I think the name 'update' is a bit misleading.  Here is the new
> calling code:
> 
> ...
>         pkru = read_pkru();
> 	pkru = update_pkey_reg(pkru, pkey, init_val);
> 	write_pkru(pkru);
> ...
> 
> 
> I think it is odd to have a function called update_pkey_reg() called right
> before a write_pkru().  Can we call this update_pkey_value?  or just 'val'?
> Because write_pkru() actually updates the register.

Fair enough, update_pkey_val() works fine for me.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-17 22:34     ` Ira Weiny
@ 2020-07-20  9:15       ` Peter Zijlstra
  2020-07-20 18:35         ` Ira Weiny
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-20  9:15 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 03:34:07PM -0700, Ira Weiny wrote:
> On Fri, Jul 17, 2020 at 10:59:54AM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2020 at 12:20:43AM -0700, ira.weiny@intel.com wrote:
> > > +/*
> > > + * Write the PKey Register Supervisor.  This must be run with preemption
> > > + * disabled as it does not guarantee the atomicity of updating the pkrs_cache
> > > + * and MSR on its own.
> > > + */
> > > +void write_pkrs(u32 pkrs_val)
> > > +{
> > > +	this_cpu_write(pkrs_cache, pkrs_val);
> > > +	wrmsrl(MSR_IA32_PKRS, pkrs_val);
> > > +}
> > 
> > Should we write that like:
> > 
> > void write_pkrs(u32 pkr)
> > {
> > 	u32 *pkrs = get_cpu_ptr(pkrs_cache);
> > 	if (*pkrs != pkr) {
> > 		*pkrs = pkr;
> > 		wrmsrl(MSR_IA32_PKRS, pkr);
> > 	}
> > 	put_cpu_ptrpkrs_cache);
> > }
> > 
> > given that we fundamentally need to serialize againt schedule() here.
> 
> Yes.  That seems better.
> 
> That also means pks_sched_in() can be simplified to just
> 
> static inline void pks_sched_in(void)
> {
> 	write_pkrs(current->thread.saved_pkrs);
> }
> 
> Because of the built WRMSR avoidance.
> 
> However, pkrs_cache is static so I think I need to use {get,put}_cpu_var() here
> don't I?

Or get_cpu_ptr(&pkrs_cache), sorry for the typo :-)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 12/17] memremap: Add zone device access protection
  2020-07-18  5:06     ` Ira Weiny
@ 2020-07-20  9:16       ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-20  9:16 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 10:06:50PM -0700, Ira Weiny wrote:
> On Fri, Jul 17, 2020 at 11:10:53AM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2020 at 12:20:51AM -0700, ira.weiny@intel.com wrote:
> > > +static pgprot_t dev_protection_enable_get(struct dev_pagemap *pgmap, pgprot_t prot)
> > > +{
> > > +	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) {
> > > +		pgprotval_t val = pgprot_val(prot);
> > > +
> > > +		static_branch_inc(&dev_protection_static_key);
> > > +		prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey));
> > > +	}
> > > +	return prot;
> > > +}
> > 
> > Every other pgprot modifying function is called pgprot_*(), although I
> > suppose we have the exceptions phys_mem_access_prot() and dma_pgprot().
> 
> Yea...  this function kind of morphed.  The issue is that this is also a 'get'
> with a corresponding 'put'.  So I'm at a loss for what makes sense between the
> 2 functions.
> 
> > 
> > How about we call this one devm_pgprot() ?
> 
> Dan Williams mentioned to me that the devm is not an appropriate prefix.  Thus
> the 'dev' prefix instead.
> 
> How about dev_pgprot_{get,put}()?

works for me, thanks!

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages
  2020-07-19  4:13     ` Ira Weiny
@ 2020-07-20  9:17       ` Peter Zijlstra
  2020-07-21 16:31         ` Ira Weiny
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-20  9:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Sat, Jul 18, 2020 at 09:13:19PM -0700, Ira Weiny wrote:
> On Fri, Jul 17, 2020 at 11:21:39AM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2020 at 12:20:52AM -0700, ira.weiny@intel.com wrote:
> > > @@ -31,6 +32,20 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> > >  
> > >  #include <asm/kmap_types.h>
> > >  
> > > +static inline void enable_access(struct page *page)
> > > +{
> > > +	if (!page_is_access_protected(page))
> > > +		return;
> > > +	dev_access_enable();
> > > +}
> > > +
> > > +static inline void disable_access(struct page *page)
> > > +{
> > > +	if (!page_is_access_protected(page))
> > > +		return;
> > > +	dev_access_disable();
> > > +}
> > 
> > These are some very generic names, do we want them to be a little more
> > specific?
> 
> I had them named kmap_* but Dave (I think it was Dave) thought they did not
> really apply strictly to kmap_*.
> 
> They are static to this file which I thought may be sufficient to 'uniqify'
> them?

They're static to a .h, which means they're all over the place ;-)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch
  2020-07-20  9:15       ` Peter Zijlstra
@ 2020-07-20 18:35         ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-20 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Fenghua Yu, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Mon, Jul 20, 2020 at 11:15:54AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 03:34:07PM -0700, Ira Weiny wrote:
> > On Fri, Jul 17, 2020 at 10:59:54AM +0200, Peter Zijlstra wrote:
> > > On Fri, Jul 17, 2020 at 12:20:43AM -0700, ira.weiny@intel.com wrote:
> > > > +/*
> > > > + * Write the PKey Register Supervisor.  This must be run with preemption
> > > > + * disabled as it does not guarantee the atomicity of updating the pkrs_cache
> > > > + * and MSR on its own.
> > > > + */
> > > > +void write_pkrs(u32 pkrs_val)
> > > > +{
> > > > +	this_cpu_write(pkrs_cache, pkrs_val);
> > > > +	wrmsrl(MSR_IA32_PKRS, pkrs_val);
> > > > +}
> > > 
> > > Should we write that like:
> > > 
> > > void write_pkrs(u32 pkr)
> > > {
> > > 	u32 *pkrs = get_cpu_ptr(pkrs_cache);
> > > 	if (*pkrs != pkr) {
> > > 		*pkrs = pkr;
> > > 		wrmsrl(MSR_IA32_PKRS, pkr);
> > > 	}
> > > 	put_cpu_ptrpkrs_cache);
> > > }
> > > 
> > > given that we fundamentally need to serialize againt schedule() here.
> > 
> > Yes.  That seems better.
> > 
> > That also means pks_sched_in() can be simplified to just
> > 
> > static inline void pks_sched_in(void)
> > {
> > 	write_pkrs(current->thread.saved_pkrs);
> > }
> > 
> > Because of the built WRMSR avoidance.
> > 
> > However, pkrs_cache is static so I think I need to use {get,put}_cpu_var() here
> > don't I?
> 
> Or get_cpu_ptr(&pkrs_cache), sorry for the typo :-)

Ah I see...  sorry, yes that will work.

Done,
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages
  2020-07-20  9:17       ` Peter Zijlstra
@ 2020-07-21 16:31         ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-21 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	x86, Dave Hansen, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Mon, Jul 20, 2020 at 11:17:40AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 18, 2020 at 09:13:19PM -0700, Ira Weiny wrote:
> > On Fri, Jul 17, 2020 at 11:21:39AM +0200, Peter Zijlstra wrote:
> > > On Fri, Jul 17, 2020 at 12:20:52AM -0700, ira.weiny@intel.com wrote:
> > > > @@ -31,6 +32,20 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> > > >  
> > > >  #include <asm/kmap_types.h>
> > > >  
> > > > +static inline void enable_access(struct page *page)
> > > > +{
> > > > +	if (!page_is_access_protected(page))
> > > > +		return;
> > > > +	dev_access_enable();
> > > > +}
> > > > +
> > > > +static inline void disable_access(struct page *page)
> > > > +{
> > > > +	if (!page_is_access_protected(page))
> > > > +		return;
> > > > +	dev_access_disable();
> > > > +}
> > > 
> > > These are some very generic names, do we want them to be a little more
> > > specific?
> > 
> > I had them named kmap_* but Dave (I think it was Dave) thought they did not
> > really apply strictly to kmap_*.
> > 
> > They are static to this file which I thought may be sufficient to 'uniqify'
> > them?
> 
> They're static to a .h, which means they're all over the place ;-)

I've thought about it a bit.  I think I agree with both you and Dave.

How about:

dev_page_{en,dis}able_access()

??

I've made that change for now.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  9:30   ` Peter Zijlstra
@ 2020-07-21 18:01     ` Ira Weiny
  2020-07-21 19:11       ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-21 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 11:30:41AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> > +static void noinstr idt_save_pkrs(idtentry_state_t state)
> 
> noinstr goes in the same place you would normally put inline, that's
> before the return type, not after it.
>

Sorry about that.  But that does not look to be consistent.

10:57:35 > git grep 'noinstr' arch/x86/entry/common.c
...
arch/x86/entry/common.c:idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
arch/x86/entry/common.c:void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
arch/x86/entry/common.c:void noinstr idtentry_enter_user(struct pt_regs *regs)
arch/x86/entry/common.c:void noinstr idtentry_exit_user(struct pt_regs *regs)
...

Are the above 'wrong'?  Is it worth me sending a patch?

I've changed my code,
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-21 18:01     ` Ira Weiny
@ 2020-07-21 19:11       ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-21 19:11 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Jul 21, 2020 at 11:01:34AM -0700, Ira Weiny wrote:
> On Fri, Jul 17, 2020 at 11:30:41AM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> > > +static void noinstr idt_save_pkrs(idtentry_state_t state)
> > 
> > noinstr goes in the same place you would normally put inline, that's
> > before the return type, not after it.
> >
> 
> Sorry about that.  But that does not look to be consistent.
> 
> 10:57:35 > git grep 'noinstr' arch/x86/entry/common.c
> ...
> arch/x86/entry/common.c:idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
> arch/x86/entry/common.c:void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
> arch/x86/entry/common.c:void noinstr idtentry_enter_user(struct pt_regs *regs)
> arch/x86/entry/common.c:void noinstr idtentry_exit_user(struct pt_regs *regs)
> ...
> 
> Are the above 'wrong'?  Is it worth me sending a patch?

I think I already fixed all those, please check tip/master.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17 10:06   ` Peter Zijlstra
@ 2020-07-22  5:27     ` Ira Weiny
  2020-07-22  9:48       ` Peter Zijlstra
  2020-07-23 20:08       ` Thomas Gleixner
  0 siblings, 2 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-22  5:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> > First I'm not sure if adding this state to idtentry_state and having
> > that state copied is the right way to go.  It seems like we should start
> > passing this by reference instead of value.  But for now this works as
> > an RFC.  Comments?
> 
> As long as you keep sizeof(struct idtentry_state_t) <= sizeof(u64) or
> possibly 2*sizeof(unsigned long), code gen shouldn't be too horrid IIRC.
> You'll have to look at what the compiler makes of it.
> 
> > Second, I'm not 100% happy with having to save the reference count in
> > the exception handler.  It seems like a very ugly layering violation but
> > I don't see a way around it at the moment.
> 
> So I've been struggling with that API, all the way from
> pks_update_protection() to that dev_access_{en,dis}able(). I _really_
> hate it, but I see how you ended up with it.
> 
> I wanted to propose something like:
> 
> u32 current_pkey_save(int pkey, unsigned flags)
> {
> 	u32 *lpkr = get_cpu_ptr(&local_pkr);
> 	u32 pkr, saved = *lpkr;
> 
> 	pkr = update_pkey_reg(saved, pkey, flags);
> 	if (pkr != saved)
> 		wrpkr(pkr);
> 
> 	put_cpu_ptr(&local_pkr);
> 	return saved;
> }
> 
> void current_pkey_restore(u32 pkr)
> {
> 	u32 *lpkr = get_cpu_ptr(&local_pkr);
> 	if (*lpkr != pkr)
> 		wrpkr(pkr);
> 	put_cpu_ptr(&local_pkr);
> }
> 
> Together with:
> 
> void pkey_switch(struct task_struct *prev, struct task_struct *next)
> {
> 	prev->pkr = this_cpu_read(local_pkr);
> 	if (prev->pkr != next->pkr)
> 		wrpkr(next->pkr);
> }
> 
> But that's actually hard to frob into the kmap() model :-( The upside is
> that you only have 1 word of state, instead of the 2 you have now.

I'm still mulling over if what you have really helps us.  It seems like we may
be able to remove the percpu pkrs_cache variable...  But I'm hesitant to do so.

Regardless that is not the big issue right now...

> 
> > Third, this patch has gone through a couple of revisions as I've had
> > crashes which just don't make sense to me.  One particular issue I've
> > had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> > The code path was a pmem copy and the ref count should have been
> > elevated due to dev_access_enable() but why was
> > idtentry_enter()->idt_save_pkrs() not called I don't know.
> 
> Because MCEs are NMI-like and don't go through the normal interrupt
> path. MCEs are an abomination, please wear all the protective devices
> you can lay hands on when delving into that.

I've been really digging into this today and I'm very concerned that I'm
completely missing something WRT idtentry_enter() and idtentry_exit().

I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
with trace_printk()'s.

With this debug code, I have found an instance where it seems like
idtentry_enter() is called without a corresponding idtentry_exit().  This has
left the thread ref counter at 0 which results in very bad things happening
when __dev_access_disable() is called and the ref count goes negative.

Effectively this seems to be happening:

...
	// ref == 0
	dev_access_enable()  // ref += 1 ==> disable protection
		// exception  (which one I don't know)
			idtentry_enter()
				// ref = 0
				_handler() // or whatever code...
			// *_exit() not called [at least there is no trace_printk() output]...
			// Regardless of trace output, the ref is left at 0
	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
	(Bad stuff is bound to happen now...)
...

The ref count ends up completely messed up after this sequence...  and the PKRS
register gets out of sync as well.  This is starting to make some sense of how
I was getting what seemed like random crashes before.

Unfortunately I don't understand the idt entry/exit code well enough to see
clearly what is going on.  Is there any reason idtentry_exit() is not called
after idtentry_enter()?  Perhaps some NMI/MCE or 'not normal' exception code
path I'm not seeing?  In my searches I see a corresponding exit for every
enter.  But MCE is pretty hard to follow.

Also is there any chance that the process could be getting scheduled and that
is causing an issue?

Thanks,
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-22  5:27     ` Ira Weiny
@ 2020-07-22  9:48       ` Peter Zijlstra
  2020-07-22 21:24         ` Ira Weiny
  2020-07-23 20:08       ` Thomas Gleixner
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2020-07-22  9:48 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Jul 21, 2020 at 10:27:09PM -0700, Ira Weiny wrote:

> I've been really digging into this today and I'm very concerned that I'm
> completely missing something WRT idtentry_enter() and idtentry_exit().
> 
> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> with trace_printk()'s.
> 
> With this debug code, I have found an instance where it seems like
> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> left the thread ref counter at 0 which results in very bad things happening
> when __dev_access_disable() is called and the ref count goes negative.
> 
> Effectively this seems to be happening:
> 
> ...
> 	// ref == 0
> 	dev_access_enable()  // ref += 1 ==> disable protection
> 		// exception  (which one I don't know)
> 			idtentry_enter()
> 				// ref = 0
> 				_handler() // or whatever code...
> 			// *_exit() not called [at least there is no trace_printk() output]...
> 			// Regardless of trace output, the ref is left at 0
> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> 	(Bad stuff is bound to happen now...)
> ...
> 
> The ref count ends up completely messed up after this sequence...  and the PKRS
> register gets out of sync as well.  This is starting to make some sense of how
> I was getting what seemed like random crashes before.
> 
> Unfortunately I don't understand the idt entry/exit code well enough to see
> clearly what is going on.  Is there any reason idtentry_exit() is not called
> after idtentry_enter()?  Perhaps some NMI/MCE or 'not normal' exception code
> path I'm not seeing?  In my searches I see a corresponding exit for every
> enter.  But MCE is pretty hard to follow.
> 
> Also is there any chance that the process could be getting scheduled and that
> is causing an issue?

Ooh, I think I see the problem. We also use idtentry_enter() for #PF,
and #PF can schedule, exactly as you observed. Argh!

This then means you need to go frob something in
arch/x86/include/asm/idtentry.h, which is somewhat unfortunate.

Thomas, would it make sense to add a type parameter to
idtentry_{enter,exit}() ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
                     ` (2 preceding siblings ...)
  2020-07-17 10:06   ` Peter Zijlstra
@ 2020-07-22 16:21   ` Andy Lutomirski
  2020-07-23 16:18     ` Fenghua Yu
  2020-07-23 19:53   ` Thomas Gleixner
  4 siblings, 1 reply; 73+ messages in thread
From: Andy Lutomirski @ 2020-07-22 16:21 UTC (permalink / raw)
  To: Weiny Ira
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, X86 ML, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, open list:DOCUMENTATION, LKML,
	linux-nvdimm, Linux FS Devel, Linux-MM,
	open list:KERNEL SELFTEST FRAMEWORK

On Fri, Jul 17, 2020 at 12:21 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> The PKRS MSR is not managed by XSAVE.  It is already preserved through a
> context switch but this support leaves exception handling code open to
> memory accesses which the interrupted process has allowed.
>
> Close this hole by preserve the current task's PKRS MSR, reset the PKRS
> MSR value on exception entry, and then restore the state on exception
> exit.

Should this live in pt_regs?

--Andy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-22  9:48       ` Peter Zijlstra
@ 2020-07-22 21:24         ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-22 21:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Wed, Jul 22, 2020 at 11:48:53AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 21, 2020 at 10:27:09PM -0700, Ira Weiny wrote:
> 
> > I've been really digging into this today and I'm very concerned that I'm
> > completely missing something WRT idtentry_enter() and idtentry_exit().
> > 
> > I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> > with trace_printk()'s.
> > 
> > With this debug code, I have found an instance where it seems like
> > idtentry_enter() is called without a corresponding idtentry_exit().  This has
> > left the thread ref counter at 0 which results in very bad things happening
> > when __dev_access_disable() is called and the ref count goes negative.
> > 
> > Effectively this seems to be happening:
> > 
> > ...
> > 	// ref == 0
> > 	dev_access_enable()  // ref += 1 ==> disable protection
> > 		// exception  (which one I don't know)
> > 			idtentry_enter()
> > 				// ref = 0
> > 				_handler() // or whatever code...
> > 			// *_exit() not called [at least there is no trace_printk() output]...
> > 			// Regardless of trace output, the ref is left at 0
> > 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> > 	(Bad stuff is bound to happen now...)
> > ...
> > 
> > The ref count ends up completely messed up after this sequence...  and the PKRS
> > register gets out of sync as well.  This is starting to make some sense of how
> > I was getting what seemed like random crashes before.
> > 
> > Unfortunately I don't understand the idt entry/exit code well enough to see
> > clearly what is going on.  Is there any reason idtentry_exit() is not called
> > after idtentry_enter()?  Perhaps some NMI/MCE or 'not normal' exception code
> > path I'm not seeing?  In my searches I see a corresponding exit for every
> > enter.  But MCE is pretty hard to follow.
> > 
> > Also is there any chance that the process could be getting scheduled and that
> > is causing an issue?
> 
> Ooh, I think I see the problem. We also use idtentry_enter() for #PF,
> and #PF can schedule, exactly as you observed. Argh!

I dug more and I don't see this?  I threw a WARN_ON in the idt_save_pkrs().

Its showing most of these are coming from asm_sysvec_acpi_timer_interrupt().[1]

I don't see that call schedule() between idtentry_enter() and *_exit()...

I do see page faults.  But that is not the first instance where the count gets
messed up.

> 
> This then means you need to go frob something in
> arch/x86/include/asm/idtentry.h, which is somewhat unfortunate.
> 
> Thomas, would it make sense to add a type parameter to
> idtentry_{enter,exit}() ?

Ira

[1] Example trace.

[   30.289514] ------------[ cut here ]------------^M
[   30.290706] WARNING: CPU: 0 PID: 485 at arch/x86/entry/common.c:596 idt_save_pkrs.isra.0.constprop.0+0x45/0xc0^M
[   30.293178] Modules linked in: kvm_intel(+) dax_pmem bochs_drm dax_pmem_core snd_hwdep kvm drm_vram_helper nd_pmem(+) snd_seq nd_btt snd_seq_device irqbypass snd_pcm drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul drm_kms_helper snd_timer nd_e820 ghash_clmulni_intel libnvdimm cec snd drm pcspkr joydev soundcore i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter ip_tables crc32c_intel e1000e wmi pinctrl_sunrisepoint pinctrl_intel fuse^M
[   30.303437] CPU: 0 PID: 485 Comm: systemd-udevd Not tainted 5.8.0-rc5+ #7^M
[   30.305132] Hardware name: Intel Corporation XXXX/YYYY, BIOS EGSDCRB1.86B.0034.D09.2005061015 05/06/2020^M
[   30.307774] RIP: 0010:idt_save_pkrs.isra.0.constprop.0+0x45/0xc0^M
[   30.309281] Code: 85 c0 74 2d 45 8b 88 58 22 00 00 48 8b 35 bb 65 dd 00 ba 51 02 00 00 48 c7 c7 1f 97 a6 a8 65 8b 0d 20 7c 5a 57 e8 cb ba 77 ff <0f> 0b bb 01 00 00 00 65 48 8b 04 25 c0 7b 01 00 8b a8 58 22 00 00^M
[   30.313879] RSP: 0018:ff5ba73f008e73f0 EFLAGS: 00010046^M
[   30.315196] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000^M
[   30.316975] RDX: ff24c78be0c17400 RSI: 0000000000000080 RDI: ff24c78be0c01300^M
[   30.318754] RBP: ff5ba73f008e7430 R08: ff24c78be0e3dd00 R09: ff24c78be0e71000^M
[   30.320533] R10: ff24c78be0e7110c R11: 0000000000000000 R12: ff5ba73f008e7468^M
[   30.322312] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
[   30.324091] FS:  00007efd21acb940(0000) GS:ff24c78be4000000(0000) knlGS:0000000000000000^M
[   30.326102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[   30.327544] CR2: 0000559609d33940 CR3: 0000000833b70003 CR4: 0000000001761ef0^M
[   30.332593] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[   30.337638] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400^M
[   30.342632] PKRU: 55555554^M
[   30.346545] Call Trace:^M
[   30.350342]  idtentry_enter+0x1a/0xa0^M
[   30.354389]  sysvec_apic_timer_interrupt+0x12/0xa0^M
[   30.358681]  ? __dev_access_enable+0x82/0xe0^M
[   30.362846]  asm_sysvec_apic_timer_interrupt+0x12/0x20^M
[   30.367225] RIP: 0010:__dev_access_enable+0xad/0xe0^M
[   30.371517] Code: 8b 88 58 22 00 00 48 8b 35 d8 62 51 01 ba 86 00 00 00 48 c7 c7 22 9a 32 a8 65 8b 0d 1d 79 ce 57 e8 c8 b7 eb ff 48 89 df 57 9d <0f> 1f 44 00 00 5b 5d c3 8b bd 1c 23 00 00 8b 35 0f 8a 3d 01 31 d2^M
[   30.382320] RSP: 0018:ff5ba73f008e7510 EFLAGS: 00000246^M
[   30.386760] RAX: 0000000000000005 RBX: 0000000000000246 RCX: 0000000000000000^M
[   30.391651] RDX: ff24c78be0c17400 RSI: 0000000000000080 RDI: 0000000000000246^M
[   30.396525] RBP: ff24c78bb4fe4000 R08: ff24c78be0e3dd00 R09: ff24c78be0e71000^M
[   30.401393] R10: ff24c78be0e710dc R11: ff24c78bdbb81468 R12: ffd1de5920d38040^M
[   30.406246] R13: 0000000000000008 R14: 0000000000001000 R15: 0000000000000000^M
[   30.411101]  pmem_do_read+0x1a4/0x230 [nd_pmem]^M
[   30.415329]  pmem_make_request+0x147/0x2a0 [nd_pmem]^M
[   30.419678]  generic_make_request+0xeb/0x350^M
[   30.423846]  submit_bio+0x4b/0x1a0^M
[   30.427818]  ? bio_add_page+0x62/0x90^M
[   30.431825]  submit_bh_wbc+0x16a/0x190^M
[   30.435806]  block_read_full_page+0x344/0x460^M
[   30.439915]  ? blkdev_direct_IO+0x4a0/0x4a0^M
[   30.443960]  ? memcg_drain_all_list_lrus+0x1d0/0x1d0^M
[   30.448170]  do_read_cache_page+0x61e/0x890^M
[   30.452163]  ? do_read_cache_page+0x2af/0x890^M
[   30.456177]  read_part_sector+0x32/0xf7^M

...


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-22 16:21   ` Andy Lutomirski
@ 2020-07-23 16:18     ` Fenghua Yu
  2020-07-23 16:23       ` Dave Hansen
  0 siblings, 1 reply; 73+ messages in thread
From: Fenghua Yu @ 2020-07-23 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Weiny Ira, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Dave Hansen, X86 ML, Dan Williams, Vishal Verma,
	Andrew Morton, open list:DOCUMENTATION, LKML, linux-nvdimm,
	Linux FS Devel, Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

On Wed, Jul 22, 2020 at 09:21:43AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2020 at 12:21 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > The PKRS MSR is not managed by XSAVE.  It is already preserved through a
> > context switch but this support leaves exception handling code open to
> > memory accesses which the interrupted process has allowed.
> >
> > Close this hole by preserve the current task's PKRS MSR, reset the PKRS
> > MSR value on exception entry, and then restore the state on exception
> > exit.
> 
> Should this live in pt_regs?

The PKRS MSR has been preserved in thread_info during kernel entry. We
don't need to preserve it in another place (i.e. idtentry_state).

To avoid confusion, I think we need to change the above commit message to:

"Exception handling code is open to memory accesses which the interrupted
process has allowed.

Close this hole by reset the PKRS MSR value on exception entry and restore
the state on exception exit. The MSR was preserved in thread_info."

The patch needs to be changed accordingly, I think:

1. No need to define "pks" in struct idtentry_state because the MSR is
   already preserved in thread_info.
2. idt_save_pkrs() could be renamed as idt_reset_pkrs() to reset
   the MSR (no need to save it). "state.pkrs" can be replaced by
   "current->thread_info.pkrs" now.
3. The "pkrs_ref" could be defined in thread_info as well. But I'm not
   sure if it's better than defined in idtentry_state.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 16:18     ` Fenghua Yu
@ 2020-07-23 16:23       ` Dave Hansen
  2020-07-23 16:52         ` Fenghua Yu
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Hansen @ 2020-07-23 16:23 UTC (permalink / raw)
  To: Fenghua Yu, Andy Lutomirski
  Cc: Weiny Ira, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Dave Hansen, X86 ML, Dan Williams, Vishal Verma,
	Andrew Morton, open list:DOCUMENTATION, LKML, linux-nvdimm,
	Linux FS Devel, Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

On 7/23/20 9:18 AM, Fenghua Yu wrote:
> The PKRS MSR has been preserved in thread_info during kernel entry. We
> don't need to preserve it in another place (i.e. idtentry_state).

I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
explain the mechanism by which this happens and point to the code
implementing it, please?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 16:23       ` Dave Hansen
@ 2020-07-23 16:52         ` Fenghua Yu
  2020-07-23 17:08           ` Andy Lutomirski
  0 siblings, 1 reply; 73+ messages in thread
From: Fenghua Yu @ 2020-07-23 16:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Weiny Ira, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, X86 ML,
	Dan Williams, Vishal Verma, Andrew Morton,
	open list:DOCUMENTATION, LKML, linux-nvdimm, Linux FS Devel,
	Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

Hi, Dave,

On Thu, Jul 23, 2020 at 09:23:13AM -0700, Dave Hansen wrote:
> On 7/23/20 9:18 AM, Fenghua Yu wrote:
> > The PKRS MSR has been preserved in thread_info during kernel entry. We
> > don't need to preserve it in another place (i.e. idtentry_state).
> 
> I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
> explain the mechanism by which this happens and point to the code
> implementing it, please?

[Sorry, my mistake: I mean "thread_struct" instead of "thread_info".
Hopefully the typo doesn't change the essential part in my last email.]

The "saved_pkrs" is defined in thread_struct and context switched in
patch 04/17:
https://lore.kernel.org/lkml/20200717072056.73134-5-ira.weiny@intel.com/

Because there is no XSAVE support the PKRS MSR, we preserve it in
"saved_pkrs" in thread_struct. It's initialized as 0 (init state, no
protection key) in fork() or exec(). It's updated to a right protection
value when a driver calls the updating API. The PKRS MSR is context
switched by "saved_pkrs" when switching to a task (unless optimized if the
cached MSR is the same as the saved one).

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 16:52         ` Fenghua Yu
@ 2020-07-23 17:08           ` Andy Lutomirski
  2020-07-23 17:30             ` Dave Hansen
  2020-07-23 20:22             ` Thomas Gleixner
  0 siblings, 2 replies; 73+ messages in thread
From: Andy Lutomirski @ 2020-07-23 17:08 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Dave Hansen, Andy Lutomirski, Weiny Ira, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Peter Zijlstra, Dave Hansen,
	X86 ML, Dan Williams, Vishal Verma, Andrew Morton,
	open list:DOCUMENTATION, LKML, linux-nvdimm, Linux FS Devel,
	Linux-MM, open list:KERNEL SELFTEST FRAMEWORK


> On Jul 23, 2020, at 9:52 AM, Fenghua Yu <fenghua.yu@intel.com> wrote:
> 
> Hi, Dave,
> 
>> On Thu, Jul 23, 2020 at 09:23:13AM -0700, Dave Hansen wrote:
>>> On 7/23/20 9:18 AM, Fenghua Yu wrote:
>>> The PKRS MSR has been preserved in thread_info during kernel entry. We
>>> don't need to preserve it in another place (i.e. idtentry_state).
>> 
>> I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
>> explain the mechanism by which this happens and point to the code
>> implementing it, please?
> 
> [Sorry, my mistake: I mean "thread_struct" instead of "thread_info".
> Hopefully the typo doesn't change the essential part in my last email.]
> 
> The "saved_pkrs" is defined in thread_struct and context switched in
> patch 04/17:
> https://lore.kernel.org/lkml/20200717072056.73134-5-ira.weiny@intel.com/
> 
> Because there is no XSAVE support the PKRS MSR, we preserve it in
> "saved_pkrs" in thread_struct. It's initialized as 0 (init state, no
> protection key) in fork() or exec(). It's updated to a right protection
> value when a driver calls the updating API. The PKRS MSR is context
> switched by "saved_pkrs" when switching to a task (unless optimized if the
> cached MSR is the same as the saved one).
> 
> 

Suppose some kernel code (a syscall or
kernel thread) changes PKRS then takes a page fault. The page fault handler needs a fresh PKRS. Then the page fault handler (say a VMA’s .fault handler) changes PKRS.  The we get an interrupt. The interrupt *also* needs a fresh PKRS and the page fault value needs to be saved somewhere.

So we have more than one saved value per thread, and thread_struct isn’t going to solve this problem.

But idtentry_state is also not great for a couple reasons.  Not all entries have idtentry_state, and the unwinder can’t find it for debugging. For that matter, the page fault logic probably wants to know the previous PKRS, so it should either be stashed somewhere findable or it should be explicitly passed around.

My suggestion is to enlarge pt_regs.  The save and restore logic can probably be in C, but pt_regs is the logical place to put a register that is saved and restored across all entries.

Whoever does this work will have the delightful job of figuring out whether BPF thinks that the layout of pt_regs is ABI and, if so, fixing the resulting mess.

The fact the new fields will go at the beginning of pt_regs will make this an entertaining prospect.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 17:08           ` Andy Lutomirski
@ 2020-07-23 17:30             ` Dave Hansen
  2020-07-23 20:23               ` Thomas Gleixner
  2020-07-23 20:22             ` Thomas Gleixner
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Hansen @ 2020-07-23 17:30 UTC (permalink / raw)
  To: Andy Lutomirski, Fenghua Yu
  Cc: Andy Lutomirski, Weiny Ira, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, X86 ML,
	Dan Williams, Vishal Verma, Andrew Morton,
	open list:DOCUMENTATION, LKML, linux-nvdimm, Linux FS Devel,
	Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

On 7/23/20 10:08 AM, Andy Lutomirski wrote:
> Suppose some kernel code (a syscall or kernel thread) changes PKRS
> then takes a page fault. The page fault handler needs a fresh PKRS.
> Then the page fault handler (say a VMA’s .fault handler) changes
> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
> PKRS and the page fault value needs to be saved somewhere.
> 
> So we have more than one saved value per thread, and thread_struct
> isn’t going to solve this problem.

Taking a step back...  This is all true only if we decide that we want
protection keys to provide protection during exceptions and interrupts.
 Right now, the code supports nesting:

	kmap(foo);
		kmap(bar);
		kunmap(bar);
	kunmap(foo);

with a reference count.  So, the nested kmap() will see the count
elevated and do nothing.

I'm generally OK with this getting merged without extending PKS
protection to interrupts and exceptions.  Ira and Fenghua should
certainly give it a go, but I'd like to see it as an add-on feature and
we can judge the benefits versus complexity separately.

Ira was looking at adding it because it _looked_ simple.  Andy has me
really scared about it now. :)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
                     ` (3 preceding siblings ...)
  2020-07-22 16:21   ` Andy Lutomirski
@ 2020-07-23 19:53   ` Thomas Gleixner
  2020-07-23 22:04     ` Ira Weiny
  4 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 19:53 UTC (permalink / raw)
  To: ira.weiny, Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra
  Cc: Ira Weiny, Dave Hansen, x86, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

Ira,

ira.weiny@intel.com writes:

> ...
> 	// ref == 0
> 	dev_access_enable()  // ref += 1 ==> disable protection
> 		irq()
> 			// enable protection
> 			// ref = 0
> 			_handler()
> 				dev_access_enable()   // ref += 1 ==> disable protection
> 				dev_access_disable()  // ref -= 1 ==> enable protection
> 			// WARN_ON(ref != 0)
> 			// disable protection
> 	do_pmem_thing()  // all good here
> 	dev_access_disable() // ref -= 1 ==> 0 ==> enable protection

...

> First I'm not sure if adding this state to idtentry_state and having
> that state copied is the right way to go.

Adding the state to idtentry_state is fine at least for most interrupts
and exceptions. Emphasis on most.

#PF does not work because #PF can schedule.

> It seems like we should start passing this by reference instead of
> value.  But for now this works as an RFC.  Comments?

Works as in compiles, right?

static void noinstr idt_save_pkrs(idtentry_state_t state)
{
        state.foo = 1;
}

How is that supposed to change the caller state? C programming basics.

> Second, I'm not 100% happy with having to save the reference count in
> the exception handler.  It seems like a very ugly layering violation but
> I don't see a way around it at the moment.

That state is strict per task, right? So why do you want to store it
anywhere else that in task/thread storage. That solves your problem of
#PF scheduling nicely.

> Third, this patch has gone through a couple of revisions as I've had
> crashes which just don't make sense to me.  One particular issue I've
> had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> The code path was a pmem copy and the ref count should have been
> elevated due to dev_access_enable() but why was
> idtentry_enter()->idt_save_pkrs() not called I don't know.

Because #MC does not go through idtentry_enter(). Neither do #NMI, #DB, #BP.

> Finally, it looks like the entry/exit code is being refactored into
> common code.  So likely this is best handled somewhat there.  Because
> this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
> in a generic fashion.  But that is a ways off I think.

The invocation of save/restore might be placed in generic code at least
for the common exception and interrupt entries.

> +static void noinstr idt_save_pkrs(idtentry_state_t state)

*state. See above.

> +#else
> +/* Define as macros to prevent conflict of inline and noinstr */
> +#define idt_save_pkrs(state)
> +#define idt_restore_pkrs(state)

Empty inlines do not need noinstr because they are optimized out. If you
want inlines in a noinstr section then use __always_inline

>  /**
>   * idtentry_enter - Handle state tracking on ordinary idtentries
>   * @regs:	Pointer to pt_regs of interrupted context
> @@ -604,6 +671,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
>  		return ret;
>  	}
>  
> +	idt_save_pkrs(ret);

No. This really has no business to be called before the state is
established. It's not something horribly urgent and write_pkrs() is NOT
noinstr and invokes wrmsrl() which is subject to tracing.

> +
> +	idt_restore_pkrs(state);

This one is placed correctly.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-22  5:27     ` Ira Weiny
  2020-07-22  9:48       ` Peter Zijlstra
@ 2020-07-23 20:08       ` Thomas Gleixner
  2020-07-23 20:15         ` Thomas Gleixner
  1 sibling, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 20:08 UTC (permalink / raw)
  To: Ira Weiny, Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Dave Hansen, x86,
	Dan Williams, Vishal Verma, Andrew Morton, Fenghua Yu, linux-doc,
	linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

Ira Weiny <ira.weiny@intel.com> writes:
> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> I've been really digging into this today and I'm very concerned that I'm
> completely missing something WRT idtentry_enter() and idtentry_exit().
>
> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> with trace_printk()'s.
>
> With this debug code, I have found an instance where it seems like
> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> left the thread ref counter at 0 which results in very bad things happening
> when __dev_access_disable() is called and the ref count goes negative.
>
> Effectively this seems to be happening:
>
> ...
> 	// ref == 0
> 	dev_access_enable()  // ref += 1 ==> disable protection
> 		// exception  (which one I don't know)
> 			idtentry_enter()
> 				// ref = 0
> 				_handler() // or whatever code...
> 			// *_exit() not called [at least there is no trace_printk() output]...
> 			// Regardless of trace output, the ref is left at 0
> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> 	(Bad stuff is bound to happen now...)

Well, if any exception which calls idtentry_enter() would return without
going through idtentry_exit() then lots of bad stuff would happen even
without your patches.

> Also is there any chance that the process could be getting scheduled and that
> is causing an issue?

Only from #PF, but after the fault has been resolved and the tasks is
scheduled in again then the task returns through idtentry_exit() to the
place where it took the fault. That's not guaranteed to be on the same
CPU. If schedule is not aware of the fact that the exception turned off
stuff then you surely get into trouble. So you really want to store it
in the task itself then the context switch code can actually see the
state and act accordingly.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 20:08       ` Thomas Gleixner
@ 2020-07-23 20:15         ` Thomas Gleixner
  2020-07-24 17:23           ` Ira Weiny
  0 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 20:15 UTC (permalink / raw)
  To: Ira Weiny, Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Dave Hansen, x86,
	Dan Williams, Vishal Verma, Andrew Morton, Fenghua Yu, linux-doc,
	linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest

Thomas Gleixner <tglx@linutronix.de> writes:

> Ira Weiny <ira.weiny@intel.com> writes:
>> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
>>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
>> I've been really digging into this today and I'm very concerned that I'm
>> completely missing something WRT idtentry_enter() and idtentry_exit().
>>
>> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
>> with trace_printk()'s.
>>
>> With this debug code, I have found an instance where it seems like
>> idtentry_enter() is called without a corresponding idtentry_exit().  This has
>> left the thread ref counter at 0 which results in very bad things happening
>> when __dev_access_disable() is called and the ref count goes negative.
>>
>> Effectively this seems to be happening:
>>
>> ...
>> 	// ref == 0
>> 	dev_access_enable()  // ref += 1 ==> disable protection
>> 		// exception  (which one I don't know)
>> 			idtentry_enter()
>> 				// ref = 0
>> 				_handler() // or whatever code...
>> 			// *_exit() not called [at least there is no trace_printk() output]...
>> 			// Regardless of trace output, the ref is left at 0
>> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
>> 	(Bad stuff is bound to happen now...)
>
> Well, if any exception which calls idtentry_enter() would return without
> going through idtentry_exit() then lots of bad stuff would happen even
> without your patches.
>
>> Also is there any chance that the process could be getting scheduled and that
>> is causing an issue?
>
> Only from #PF, but after the fault has been resolved and the tasks is
> scheduled in again then the task returns through idtentry_exit() to the
> place where it took the fault. That's not guaranteed to be on the same
> CPU. If schedule is not aware of the fact that the exception turned off
> stuff then you surely get into trouble. So you really want to store it
> in the task itself then the context switch code can actually see the
> state and act accordingly.

Actually thats nasty as well as you need a stack of PKRS values to
handle nested exceptions. But it might be still the most reasonable
thing to do. 7 PKRS values plus an index should be really sufficient,
that's 32bytes total, not that bad.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 17:08           ` Andy Lutomirski
  2020-07-23 17:30             ` Dave Hansen
@ 2020-07-23 20:22             ` Thomas Gleixner
  2020-07-23 21:30               ` Andy Lutomirski
  1 sibling, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 20:22 UTC (permalink / raw)
  To: Andy Lutomirski, Fenghua Yu
  Cc: Dave Hansen, Andy Lutomirski, Weiny Ira, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, X86 ML,
	Dan Williams, Vishal Verma, Andrew Morton,
	open list:DOCUMENTATION, LKML, linux-nvdimm, Linux FS Devel,
	Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

Andy Lutomirski <luto@amacapital.net> writes:

> Suppose some kernel code (a syscall or kernel thread) changes PKRS
> then takes a page fault. The page fault handler needs a fresh
> PKRS. Then the page fault handler (say a VMA’s .fault handler) changes
> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
> PKRS and the page fault value needs to be saved somewhere.
>
> So we have more than one saved value per thread, and thread_struct
> isn’t going to solve this problem.

A stack of 7 entries and an index needs 32bytes total which is a
reasonable amount and solves the problem including scheduling from #PF
nicely. Make it 15 and it's still only 64 bytes.

> But idtentry_state is also not great for a couple reasons.  Not all
> entries have idtentry_state, and the unwinder can’t find it for
> debugging. For that matter, the page fault logic probably wants to
> know the previous PKRS, so it should either be stashed somewhere
> findable or it should be explicitly passed around.
>
> My suggestion is to enlarge pt_regs.  The save and restore logic can
> probably be in C, but pt_regs is the logical place to put a register
> that is saved and restored across all entries.

Kinda, but that still sucks because schedule from #PF will get it wrong
unless you do extra nasties.

> Whoever does this work will have the delightful job of figuring out
> whether BPF thinks that the layout of pt_regs is ABI and, if so,
> fixing the resulting mess.
>
> The fact the new fields will go at the beginning of pt_regs will make
> this an entertaining prospect.

Good luck with all of that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 17:30             ` Dave Hansen
@ 2020-07-23 20:23               ` Thomas Gleixner
  0 siblings, 0 replies; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 20:23 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Fenghua Yu
  Cc: Andy Lutomirski, Weiny Ira, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Dave Hansen, X86 ML, Dan Williams, Vishal Verma,
	Andrew Morton, open list:DOCUMENTATION, LKML, linux-nvdimm,
	Linux FS Devel, Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

Dave Hansen <dave.hansen@intel.com> writes:

> On 7/23/20 10:08 AM, Andy Lutomirski wrote:
>> Suppose some kernel code (a syscall or kernel thread) changes PKRS
>> then takes a page fault. The page fault handler needs a fresh PKRS.
>> Then the page fault handler (say a VMA’s .fault handler) changes
>> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
>> PKRS and the page fault value needs to be saved somewhere.
>> 
>> So we have more than one saved value per thread, and thread_struct
>> isn’t going to solve this problem.
>
> Taking a step back...  This is all true only if we decide that we want
> protection keys to provide protection during exceptions and interrupts.
>  Right now, the code supports nesting:
>
> 	kmap(foo);
> 		kmap(bar);
> 		kunmap(bar);
> 	kunmap(foo);
>
> with a reference count.  So, the nested kmap() will see the count
> elevated and do nothing.

Hopefully with a big fat warning if the nested map requires a different
key than the outer one.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 20:22             ` Thomas Gleixner
@ 2020-07-23 21:30               ` Andy Lutomirski
  2020-07-23 22:14                 ` Thomas Gleixner
  0 siblings, 1 reply; 73+ messages in thread
From: Andy Lutomirski @ 2020-07-23 21:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, Dave Hansen, Andy Lutomirski, Weiny Ira, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, X86 ML,
	Dan Williams, Vishal Verma, Andrew Morton,
	open list:DOCUMENTATION, LKML, linux-nvdimm, Linux FS Devel,
	Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

> On Jul 23, 2020, at 1:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> Suppose some kernel code (a syscall or kernel thread) changes PKRS
>> then takes a page fault. The page fault handler needs a fresh
>> PKRS. Then the page fault handler (say a VMA’s .fault handler) changes
>> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
>> PKRS and the page fault value needs to be saved somewhere.
>>
>> So we have more than one saved value per thread, and thread_struct
>> isn’t going to solve this problem.
>
> A stack of 7 entries and an index needs 32bytes total which is a
> reasonable amount and solves the problem including scheduling from #PF
> nicely. Make it 15 and it's still only 64 bytes.
>
>> But idtentry_state is also not great for a couple reasons.  Not all
>> entries have idtentry_state, and the unwinder can’t find it for
>> debugging. For that matter, the page fault logic probably wants to
>> know the previous PKRS, so it should either be stashed somewhere
>> findable or it should be explicitly passed around.
>>
>> My suggestion is to enlarge pt_regs.  The save and restore logic can
>> probably be in C, but pt_regs is the logical place to put a register
>> that is saved and restored across all entries.
>
> Kinda, but that still sucks because schedule from #PF will get it wrong
> unless you do extra nasties.

This seems like we’re reinventing the wheel.  PKRS is not
fundamentally different from, say, RSP.  If we want to save it across
exceptions, we save it on entry and context-switch-out and restore it
on exit and context-switch-in.


>
>> Whoever does this work will have the delightful job of figuring out
>> whether BPF thinks that the layout of pt_regs is ABI and, if so,
>> fixing the resulting mess.
>>
>> The fact the new fields will go at the beginning of pt_regs will make
>> this an entertaining prospect.
>
> Good luck with all of that.

We can always cheat like this:

struct real_pt_regs {
  unsigned long pkrs;
  struct pt_regs regs;
};

and pass a pointer to regs around.  What BPF doesn't know about can't hurt it.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 19:53   ` Thomas Gleixner
@ 2020-07-23 22:04     ` Ira Weiny
  2020-07-23 23:41       ` Thomas Gleixner
  0 siblings, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-23 22:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> Ira,
> 
> ira.weiny@intel.com writes:
> 
> > ...
> > 	// ref == 0
> > 	dev_access_enable()  // ref += 1 ==> disable protection
> > 		irq()
> > 			// enable protection
> > 			// ref = 0
> > 			_handler()
> > 				dev_access_enable()   // ref += 1 ==> disable protection
> > 				dev_access_disable()  // ref -= 1 ==> enable protection
> > 			// WARN_ON(ref != 0)
> > 			// disable protection
> > 	do_pmem_thing()  // all good here
> > 	dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
> 
> ...
> 
> > First I'm not sure if adding this state to idtentry_state and having
> > that state copied is the right way to go.
> 
> Adding the state to idtentry_state is fine at least for most interrupts
> and exceptions. Emphasis on most.
> 
> #PF does not work because #PF can schedule.


Merging with your other response:

<quote: https://lore.kernel.org/lkml/87o8o6vvt0.fsf@nanos.tec.linutronix.de/>
Only from #PF, but after the fault has been resolved and the tasks is
scheduled in again then the task returns through idtentry_exit() to the
place where it took the fault. That's not guaranteed to be on the same
CPU. If schedule is not aware of the fact that the exception turned off
stuff then you surely get into trouble. So you really want to store it
in the task itself then the context switch code can actually see the
state and act accordingly.
</quote>

I think, after fixing my code (see below), using idtentry_state could still
work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
should carry the state to the new cpu, correct?

The exception may have turned off stuff but it is a bug if it did not also turn
it back on.

Ie the irq should not be doing:

	kmap_atomic()
	return;

... without a kunmap_atomic().

That is why I put the WARN_ON_ONCE() to ensure the ref count has been properly
returned to 0 by the exception handler.

FWIW the fixed code is working for my tests...

> 
> > It seems like we should start passing this by reference instead of
> > value.  But for now this works as an RFC.  Comments?
> 
> Works as in compiles, right?
> 
> static void noinstr idt_save_pkrs(idtentry_state_t state)
> {
>         state.foo = 1;
> }
> 
> How is that supposed to change the caller state? C programming basics.

<sigh>  I am so stupid.  I was not looking at this particular case but you are
100% correct...  I can't believe I did not see this.

In the above statement I was only thinking about the extra overhead I was
adding to idtentry_enter() and the callers of it.

"C programming basics" indeed... Once again sorry...

> 
> > Second, I'm not 100% happy with having to save the reference count in
> > the exception handler.  It seems like a very ugly layering violation but
> > I don't see a way around it at the moment.
> 
> That state is strict per task, right? So why do you want to store it
> anywhere else that in task/thread storage. That solves your problem of
> #PF scheduling nicely.

The problem is with the kmap code.  If an exception handler calls kmap_atomic()
[more importantly if they nest kmap_atomic() calls] the kmap code will need to
keep track of that re-entrant call.  With a separate reference counter, the
kmap code would have to know it is in irq context or not.

Here I'm attempting to make the task 'current->dev_page_access_ref' counter
serve double duty so that the kmap code is agnostic to the context it is in.

> 
> > Third, this patch has gone through a couple of revisions as I've had
> > crashes which just don't make sense to me.  One particular issue I've
> > had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> > The code path was a pmem copy and the ref count should have been
> > elevated due to dev_access_enable() but why was
> > idtentry_enter()->idt_save_pkrs() not called I don't know.
> 
> Because #MC does not go through idtentry_enter(). Neither do #NMI, #DB, #BP.

And the above probably would work if I knew how to code in C...  :-/  I'm so
embarrassed.

> 
> > Finally, it looks like the entry/exit code is being refactored into
> > common code.  So likely this is best handled somewhat there.  Because
> > this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
> > in a generic fashion.  But that is a ways off I think.
> 
> The invocation of save/restore might be placed in generic code at least
> for the common exception and interrupt entries.

I think you are correct.  I'm testing now...

> 
> > +static void noinstr idt_save_pkrs(idtentry_state_t state)
> 
> *state. See above.

Yep...  :-/

> 
> > +#else
> > +/* Define as macros to prevent conflict of inline and noinstr */
> > +#define idt_save_pkrs(state)
> > +#define idt_restore_pkrs(state)
> 
> Empty inlines do not need noinstr because they are optimized out. If you
> want inlines in a noinstr section then use __always_inline

Peter made the same comment so I've changed to __always_inline.

> 
> >  /**
> >   * idtentry_enter - Handle state tracking on ordinary idtentries
> >   * @regs:	Pointer to pt_regs of interrupted context
> > @@ -604,6 +671,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
> >  		return ret;
> >  	}
> >  
> > +	idt_save_pkrs(ret);
> 
> No. This really has no business to be called before the state is
> established. It's not something horribly urgent and write_pkrs() is NOT
> noinstr and invokes wrmsrl() which is subject to tracing.

I don't understand.  I'm not calling it within intrumentation_{begin,end}() so
does that mean I can remove the noinstr?  I think I can actually.

Or do you mean call it at the end of idtentry_enter()?  Like this:

@@ -672,8 +672,6 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
                return ret;
        }
 
-       idt_save_pkrs(ret);
-
        /*
         * If this entry hit the idle task invoke rcu_irq_enter() whether
         * RCU is watching or not.
@@ -710,7 +708,7 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
                instrumentation_end();
 
                ret.exit_rcu = true;
-               return ret;
+               goto done;
        }
 
        /*
@@ -725,6 +723,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
        trace_hardirqs_off();
        instrumentation_end();
 
+done:
+       idt_save_pkrs(&ret);
        return ret;
 }


> 
> > +
> > +	idt_restore_pkrs(state);
> 
> This one is placed correctly.
> 
> Thanks,

No thank you...  I'm really sorry for wasting every ones time on this one.

I'm still processing all your responses so forgive me if I've missed something.
And thank you so much for pointing out my mistake.  That seems to have fixed
all the problems I have seen thus far.  But I want to think on if there may be
more issues.

Thank you,
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 21:30               ` Andy Lutomirski
@ 2020-07-23 22:14                 ` Thomas Gleixner
  0 siblings, 0 replies; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 22:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fenghua Yu, Dave Hansen, Andy Lutomirski, Weiny Ira, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, X86 ML,
	Dan Williams, Vishal Verma, Andrew Morton,
	open list:DOCUMENTATION, LKML, linux-nvdimm, Linux FS Devel,
	Linux-MM, open list:KERNEL SELFTEST FRAMEWORK

Andy Lutomirski <luto@kernel.org> writes:
>> On Jul 23, 2020, at 1:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>> My suggestion is to enlarge pt_regs.  The save and restore logic can
>>> probably be in C, but pt_regs is the logical place to put a register
>>> that is saved and restored across all entries.
>>
>> Kinda, but that still sucks because schedule from #PF will get it wrong
>> unless you do extra nasties.
>
> This seems like we’re reinventing the wheel.  PKRS is not
> fundamentally different from, say, RSP.  If we want to save it across
> exceptions, we save it on entry and context-switch-out and restore it
> on exit and context-switch-in.

It's fundamentally different from RSP because it has state (refcount)
attached, which RSP clearly has not. If you get rid of the state then
yes.

>>> Whoever does this work will have the delightful job of figuring out
>>> whether BPF thinks that the layout of pt_regs is ABI and, if so,
>>> fixing the resulting mess.
>>>
>>> The fact the new fields will go at the beginning of pt_regs will make
>>> this an entertaining prospect.
>>
>> Good luck with all of that.
>
> We can always cheat like this:
>
> struct real_pt_regs {
>   unsigned long pkrs;
>   struct pt_regs regs;
> };
>
> and pass a pointer to regs around.  What BPF doesn't know about can't hurt it.

Yes, but that's the easy part of the problem :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 22:04     ` Ira Weiny
@ 2020-07-23 23:41       ` Thomas Gleixner
  2020-07-24 21:24         ` Thomas Gleixner
  0 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-23 23:41 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

Ira,

Ira Weiny <ira.weiny@intel.com> writes:
> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> I think, after fixing my code (see below), using idtentry_state could still
> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
> should carry the state to the new cpu, correct?

I'm way too tired to think about that now. Will have a look tomorrow
with brain awake.

>> > It seems like we should start passing this by reference instead of
>> > value.  But for now this works as an RFC.  Comments?
>> 
>> Works as in compiles, right?
>> 
>> static void noinstr idt_save_pkrs(idtentry_state_t state)
>> {
>>         state.foo = 1;
>> }
>> 
>> How is that supposed to change the caller state? C programming basics.
>
> <sigh>  I am so stupid.  I was not looking at this particular case but you are
> 100% correct...  I can't believe I did not see this.
>
> In the above statement I was only thinking about the extra overhead I was
> adding to idtentry_enter() and the callers of it.

Fun. That statement immediately caught my attention and made me look at
that function.

> "C programming basics" indeed... Once again sorry...

Don't worry.

One interesting design bug of the human brain is that it tricks you into
seeing what you expect to see no matter how hard you try not to fall for
that. You can spend days staring at the obvious without seeing it. The
saying 'you can't see the forest for the trees' exists for a reason.

Yes, I know it's embarrassing, but that happens and it happens to all of
us no matter how experienced we are. Just search the LKML archives for
'brown paperbag'. You'll find amazing things.

If you show your problem to people who are not involved in that at all
there is a high propability that it immediately snaps for one of
them. But there is no guarantee, just look at this mail thread and the
number of people who did not notice.

Move on and accept the fact that it will happen again :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 20:15         ` Thomas Gleixner
@ 2020-07-24 17:23           ` Ira Weiny
  2020-07-24 17:29             ` Andy Lutomirski
  0 siblings, 1 reply; 73+ messages in thread
From: Ira Weiny @ 2020-07-24 17:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Thu, Jul 23, 2020 at 10:15:17PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > Ira Weiny <ira.weiny@intel.com> writes:
> >> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
> >>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> >> I've been really digging into this today and I'm very concerned that I'm
> >> completely missing something WRT idtentry_enter() and idtentry_exit().
> >>
> >> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> >> with trace_printk()'s.
> >>
> >> With this debug code, I have found an instance where it seems like
> >> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> >> left the thread ref counter at 0 which results in very bad things happening
> >> when __dev_access_disable() is called and the ref count goes negative.
> >>
> >> Effectively this seems to be happening:
> >>
> >> ...
> >> 	// ref == 0
> >> 	dev_access_enable()  // ref += 1 ==> disable protection
> >> 		// exception  (which one I don't know)
> >> 			idtentry_enter()
> >> 				// ref = 0
> >> 				_handler() // or whatever code...
> >> 			// *_exit() not called [at least there is no trace_printk() output]...
> >> 			// Regardless of trace output, the ref is left at 0
> >> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> >> 	(Bad stuff is bound to happen now...)
> >
> > Well, if any exception which calls idtentry_enter() would return without
> > going through idtentry_exit() then lots of bad stuff would happen even
> > without your patches.
> >
> >> Also is there any chance that the process could be getting scheduled and that
> >> is causing an issue?
> >
> > Only from #PF, but after the fault has been resolved and the tasks is
> > scheduled in again then the task returns through idtentry_exit() to the
> > place where it took the fault. That's not guaranteed to be on the same
> > CPU. If schedule is not aware of the fact that the exception turned off
> > stuff then you surely get into trouble. So you really want to store it
> > in the task itself then the context switch code can actually see the
> > state and act accordingly.
> 
> Actually thats nasty as well as you need a stack of PKRS values to
> handle nested exceptions. But it might be still the most reasonable
> thing to do. 7 PKRS values plus an index should be really sufficient,
> that's 32bytes total, not that bad.

I've thought about this a bit more and unless I'm wrong I think the
idtentry_state provides for that because each nested exception has it's own
idtentry_state doesn't it?

Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-24 17:23           ` Ira Weiny
@ 2020-07-24 17:29             ` Andy Lutomirski
  2020-07-24 19:43               ` Ira Weiny
  0 siblings, 1 reply; 73+ messages in thread
From: Andy Lutomirski @ 2020-07-24 17:29 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, x86, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest


> On Jul 24, 2020, at 10:23 AM, Ira Weiny <ira.weiny@intel.com> wrote:
> 
> On Thu, Jul 23, 2020 at 10:15:17PM +0200, Thomas Gleixner wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>> 
>>> Ira Weiny <ira.weiny@intel.com> writes:
>>>> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
>>>>>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
>>>>> I've been really digging into this today and I'm very concerned that I'm
>>>>> completely missing something WRT idtentry_enter() and idtentry_exit().
>>>>> 
>>>>> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
>>>>> with trace_printk()'s.
>>>>> 
>>>>> With this debug code, I have found an instance where it seems like
>>>>> idtentry_enter() is called without a corresponding idtentry_exit().  This has
>>>>> left the thread ref counter at 0 which results in very bad things happening
>>>>> when __dev_access_disable() is called and the ref count goes negative.
>>>>> 
>>>>> Effectively this seems to be happening:
>>>>> 
>>>>> ...
>>>>>    // ref == 0
>>>>>    dev_access_enable()  // ref += 1 ==> disable protection
>>>>>        // exception  (which one I don't know)
>>>>>            idtentry_enter()
>>>>>                // ref = 0
>>>>>                _handler() // or whatever code...
>>>>>            // *_exit() not called [at least there is no trace_printk() output]...
>>>>>            // Regardless of trace output, the ref is left at 0
>>>>>    dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
>>>>>    (Bad stuff is bound to happen now...)
>>> 
>>> Well, if any exception which calls idtentry_enter() would return without
>>> going through idtentry_exit() then lots of bad stuff would happen even
>>> without your patches.
>>> 
>>>> Also is there any chance that the process could be getting scheduled and that
>>>> is causing an issue?
>>> 
>>> Only from #PF, but after the fault has been resolved and the tasks is
>>> scheduled in again then the task returns through idtentry_exit() to the
>>> place where it took the fault. That's not guaranteed to be on the same
>>> CPU. If schedule is not aware of the fact that the exception turned off
>>> stuff then you surely get into trouble. So you really want to store it
>>> in the task itself then the context switch code can actually see the
>>> state and act accordingly.
>> 
>> Actually thats nasty as well as you need a stack of PKRS values to
>> handle nested exceptions. But it might be still the most reasonable
>> thing to do. 7 PKRS values plus an index should be really sufficient,
>> that's 32bytes total, not that bad.
> 
> I've thought about this a bit more and unless I'm wrong I think the
> idtentry_state provides for that because each nested exception has it's own
> idtentry_state doesn't it?

Only the ones that use idtentry_enter() instead of, say, nmi_enter().

> 
> Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-24 17:29             ` Andy Lutomirski
@ 2020-07-24 19:43               ` Ira Weiny
  0 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-24 19:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, x86, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 24, 2020 at 10:29:23AM -0700, Andy Lutomirski wrote:
> 
> > On Jul 24, 2020, at 10:23 AM, Ira Weiny <ira.weiny@intel.com> wrote:
> > 
> > On Thu, Jul 23, 2020 at 10:15:17PM +0200, Thomas Gleixner wrote:
> >> Thomas Gleixner <tglx@linutronix.de> writes:
> >> 
> >>> Ira Weiny <ira.weiny@intel.com> writes:
> >>>> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
> >>>>>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> >>>>> I've been really digging into this today and I'm very concerned that I'm
> >>>>> completely missing something WRT idtentry_enter() and idtentry_exit().
> >>>>> 
> >>>>> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> >>>>> with trace_printk()'s.
> >>>>> 
> >>>>> With this debug code, I have found an instance where it seems like
> >>>>> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> >>>>> left the thread ref counter at 0 which results in very bad things happening
> >>>>> when __dev_access_disable() is called and the ref count goes negative.
> >>>>> 
> >>>>> Effectively this seems to be happening:
> >>>>> 
> >>>>> ...
> >>>>>    // ref == 0
> >>>>>    dev_access_enable()  // ref += 1 ==> disable protection
> >>>>>        // exception  (which one I don't know)
> >>>>>            idtentry_enter()
> >>>>>                // ref = 0
> >>>>>                _handler() // or whatever code...
> >>>>>            // *_exit() not called [at least there is no trace_printk() output]...
> >>>>>            // Regardless of trace output, the ref is left at 0
> >>>>>    dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> >>>>>    (Bad stuff is bound to happen now...)
> >>> 
> >>> Well, if any exception which calls idtentry_enter() would return without
> >>> going through idtentry_exit() then lots of bad stuff would happen even
> >>> without your patches.
> >>> 
> >>>> Also is there any chance that the process could be getting scheduled and that
> >>>> is causing an issue?
> >>> 
> >>> Only from #PF, but after the fault has been resolved and the tasks is
> >>> scheduled in again then the task returns through idtentry_exit() to the
> >>> place where it took the fault. That's not guaranteed to be on the same
> >>> CPU. If schedule is not aware of the fact that the exception turned off
> >>> stuff then you surely get into trouble. So you really want to store it
> >>> in the task itself then the context switch code can actually see the
> >>> state and act accordingly.
> >> 
> >> Actually thats nasty as well as you need a stack of PKRS values to
> >> handle nested exceptions. But it might be still the most reasonable
> >> thing to do. 7 PKRS values plus an index should be really sufficient,
> >> that's 32bytes total, not that bad.
> > 
> > I've thought about this a bit more and unless I'm wrong I think the
> > idtentry_state provides for that because each nested exception has it's own
> > idtentry_state doesn't it?
> 
> Only the ones that use idtentry_enter() instead of, say, nmi_enter().

Oh agreed...

But with this patch we are still better off than just preserving during context
switch.

I need to update the commit message here to make this clear though.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-23 23:41       ` Thomas Gleixner
@ 2020-07-24 21:24         ` Thomas Gleixner
  2020-07-24 21:31           ` Thomas Gleixner
                             ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-24 21:24 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

Ira,

Thomas Gleixner <tglx@linutronix.de> writes:
> Ira Weiny <ira.weiny@intel.com> writes:
>> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
>> I think, after fixing my code (see below), using idtentry_state could still
>> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
>> should carry the state to the new cpu, correct?
>
> I'm way too tired to think about that now. Will have a look tomorrow
> with brain awake.

Not that I'm way more awake now, but at least I have the feeling that my
brain is not completely useless.

Let me summarize what I understood:

  1) A per CPU cache which shadows the current state of the MSR, i.e. the
     current valid key. You use that to avoid costly MSR writes if the
     key does not change.

  2) On idtentry you store the key on entry in idtentry_state, clear it
     in the MSR and shadow state if necessary and restore it on exit.

  3) On context switch out you save the per CPU cache value in the task
     and on context switch in you restore it from there.

Yes, that works (see below for #2) and sorry for my confusion yesterday
about storing this in task state.

#2 requires to handle the exceptions which do not go through
idtentry_enter/exit() seperately, but that's a manageable amount. It's
the ones which use IDTENTRY_RAW or a variant of it.

#BP, #MC, #NMI, #DB, #DF need extra local storage as all the kernel
entries for those use nmi_enter()/exit(). So you just can create
wrappers around those. Somehting like this

static __always_inline idtentry_state_t idtentry_nmi_enter(void)
{
     	idtentry_state_t state = {};

        nmi_enter();
        instrumentation_begin();
        state.key = save_and_clear_key();
        instrumentation_end();
}

static __always_inline void idtentry_nmi_exit(idtentry_state_t state)
{
        instrumentation_begin();
        restore_key(state.key);
        instrumentation_end();
        nmi_exit();
}

#UD and #PF are using the raw entry variant as well but still invoke
idtentry_enter()/exit(). #PF does not need any work. #UD handles
WARN/BUG without going through idtentry_enter() first, but I don't think
that's an issue unless a not 0 key would prevent writing to the console
device. You surely can figure that out.

Hope that helps.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-24 21:24         ` Thomas Gleixner
@ 2020-07-24 21:31           ` Thomas Gleixner
  2020-07-25  0:09           ` Andy Lutomirski
  2020-07-27 20:59           ` Ira Weiny
  2 siblings, 0 replies; 73+ messages in thread
From: Thomas Gleixner @ 2020-07-24 21:31 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

Thomas Gleixner <tglx@linutronix.de> writes:
> static __always_inline idtentry_state_t idtentry_nmi_enter(void)
> {
>      	idtentry_state_t state = {};
>
>         nmi_enter();
>         instrumentation_begin();
>         state.key = save_and_clear_key();
>         instrumentation_end();

Clearly lacks a

        return state;

But I assume you already spotted it. Otherwise the compiler would have :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support
  2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
                   ` (16 preceding siblings ...)
  2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
@ 2020-07-24 22:19 ` Kees Cook
  17 siblings, 0 replies; 73+ messages in thread
From: Kees Cook @ 2020-07-24 22:19 UTC (permalink / raw)
  To: ira.weiny
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, x86, Dave Hansen, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm,
	linux-fsdevel, linux-mm, linux-kselftest, Igor Stoppa,
	Nadav Amit

On Fri, Jul 17, 2020 at 12:20:39AM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> This RFC series has been reviewed by Dave Hansen.
> 
> Changes from RFC:
> 	Clean up commit messages based on Peter Zijlstra's and Dave Hansen's
> 		feedback
> 	Fix static branch anti-pattern
> 	New patch:
> 	(memremap: Convert devmap static branch to {inc,dec})
> 		This was the code I used as a model for my static branch which
> 		I believe is wrong now.
> 	New Patch:
> 	(x86/entry: Preserve PKRS MSR through exceptions)
> 		This attempts to preserve the per-logical-processor MSR, and
> 		reference counting during exceptions.  I'd really like feed
> 		back on this because I _think_ it should work but I'm afraid
> 		I'm missing something as my testing has shown a lot of spotty
> 		crashes which don't make sense to me.
> 
> This patch set introduces a new page protection mechanism for supervisor pages,
> Protection Key Supervisor (PKS) and an initial user of them, persistent memory,
> PMEM.
> 
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to those pages beyond the normal paging protections.  They work in
> a similar fashion to user space pkeys.  Like User page pkeys (PKU), supervisor
> pkeys are checked in addition to normal paging protections and Access or Writes
> can be disabled via a MSR update without TLB flushes when permissions change.
> A page mapping is assigned to a domain by setting a pkey in the page table
> entry.
> 
> Unlike User pkeys no new instructions are added; rather WRMSR/RDMSR are used to
> update the PKRS register.
> 
> XSAVE is not supported for the PKRS MSR.  To reduce software complexity the
> implementation saves/restores the MSR across context switches but not during
> irqs.  This is a compromise which results is a hardening of unwanted access
> without absolute restriction.
> 
> For consistent behavior with current paging protections, pkey 0 is reserved and
> configured to allow full access via the pkey mechanism, thus preserving the
> default paging protections on mappings with the default pkey value of 0.
> 
> Other keys, (1-15) are allocated by an allocator which prepares us for key
> contention from day one.  Kernel users should be prepared for the allocator to
> fail either because of key exhaustion or due to PKS not being supported on the
> arch and/or CPU instance.
> 
> Protecting against stray writes is particularly important for PMEM because,
> unlike writes to anonymous memory, writes to PMEM persists across a reboot.
> Thus data corruption could result in permanent loss of data.
> 
> The following attributes of PKS makes it perfect as a mechanism to protect PMEM
> from stray access within the kernel:
> 
>    1) Fast switching of permissions
>    2) Prevents access without page table manipulations
>    3) Works on a per thread basis
>    4) No TLB flushes required

Cool! This seems like it'd be very handy to make other types of kernel
data "read-only at rest" (as was long ago proposed via X86_CR0_WP[1],
which only provided to protection levels, not 15). For example, I think
at least a few other kinds of areas stand out to me that are in need
of PKS markings (i.e. only things that actually manipulate these areas
should gain temporary PK access):
- Page Tables themselves
- Identity mapping
- The "read-only at rest" stuff, though it'll need special plumbing to
  make it work with the slab allocator, etc (more like the later "static
  allocation" work[2]).

[1] https://lore.kernel.org/lkml/1490811363-93944-1-git-send-email-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/cover.1550097697.git.igor.stoppa@huawei.com/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-24 21:24         ` Thomas Gleixner
  2020-07-24 21:31           ` Thomas Gleixner
@ 2020-07-25  0:09           ` Andy Lutomirski
  2020-07-27 20:59           ` Ira Weiny
  2 siblings, 0 replies; 73+ messages in thread
From: Andy Lutomirski @ 2020-07-25  0:09 UTC (permalink / raw)
  To: Thomas Gleixner, Josh Poimboeuf
  Cc: Ira Weiny, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Peter Zijlstra, Dave Hansen, X86 ML, Dan Williams, Vishal Verma,
	Andrew Morton, Fenghua Yu, open list:DOCUMENTATION, LKML,
	linux-nvdimm, Linux FS Devel, Linux-MM,
	open list:KERNEL SELFTEST FRAMEWORK

On Fri, Jul 24, 2020 at 2:25 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ira,
>
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Ira Weiny <ira.weiny@intel.com> writes:
> >> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> >> I think, after fixing my code (see below), using idtentry_state could still
> >> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
> >> should carry the state to the new cpu, correct?
> >
> > I'm way too tired to think about that now. Will have a look tomorrow
> > with brain awake.
>
> Not that I'm way more awake now, but at least I have the feeling that my
> brain is not completely useless.
>
> Let me summarize what I understood:
>
>   1) A per CPU cache which shadows the current state of the MSR, i.e. the
>      current valid key. You use that to avoid costly MSR writes if the
>      key does not change.
>
>   2) On idtentry you store the key on entry in idtentry_state, clear it
>      in the MSR and shadow state if necessary and restore it on exit.
>
>   3) On context switch out you save the per CPU cache value in the task
>      and on context switch in you restore it from there.
>
> Yes, that works (see below for #2) and sorry for my confusion yesterday
> about storing this in task state.
>
> #2 requires to handle the exceptions which do not go through
> idtentry_enter/exit() seperately, but that's a manageable amount. It's
> the ones which use IDTENTRY_RAW or a variant of it.
>
> #BP, #MC, #NMI, #DB, #DF need extra local storage as all the kernel
> entries for those use nmi_enter()/exit(). So you just can create
> wrappers around those. Somehting like this
>
> static __always_inline idtentry_state_t idtentry_nmi_enter(void)
> {
>         idtentry_state_t state = {};
>
>         nmi_enter();
>         instrumentation_begin();
>         state.key = save_and_clear_key();
>         instrumentation_end();
> }
>
> static __always_inline void idtentry_nmi_exit(idtentry_state_t state)
> {
>         instrumentation_begin();
>         restore_key(state.key);
>         instrumentation_end();
>         nmi_exit();
> }
>
> #UD and #PF are using the raw entry variant as well but still invoke
> idtentry_enter()/exit(). #PF does not need any work. #UD handles
> WARN/BUG without going through idtentry_enter() first, but I don't think
> that's an issue unless a not 0 key would prevent writing to the console
> device. You surely can figure that out.

Putting on my mm maintainer hat for a moment, I really think that we
want oopses to print PKRS along with all the other registers when we
inevitably oops due to a page fault.  And we probably also want it in
the nasty nested case where we get infinite page faults and eventually
double fault.

I'm sure it's *possible* to wire this up if we stick it in
idtentry_state, but it's trivial if we stick it in pt_regs.  I'm okay
with doing the save/restore in C (in fact, I prefer that), but I think
that either the value should be stuck in pt_regs or we should find a
way to teach the unwinder to locate idtentry_state.

And, if we go with idtentry_state, we should make a signature change
to nmi_enter() to also provide idtentry_state or some equivalent
object.

--Andy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions
  2020-07-24 21:24         ` Thomas Gleixner
  2020-07-24 21:31           ` Thomas Gleixner
  2020-07-25  0:09           ` Andy Lutomirski
@ 2020-07-27 20:59           ` Ira Weiny
  2 siblings, 0 replies; 73+ messages in thread
From: Ira Weiny @ 2020-07-27 20:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra,
	Dave Hansen, x86, Dan Williams, Vishal Verma, Andrew Morton,
	Fenghua Yu, linux-doc, linux-kernel, linux-nvdimm, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, Jul 24, 2020 at 11:24:58PM +0200, Thomas Gleixner wrote:
> Ira,
> 
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Ira Weiny <ira.weiny@intel.com> writes:
> >> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> >> I think, after fixing my code (see below), using idtentry_state could still
> >> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
> >> should carry the state to the new cpu, correct?
> >
> > I'm way too tired to think about that now. Will have a look tomorrow
> > with brain awake.
> 
> Not that I'm way more awake now, but at least I have the feeling that my
> brain is not completely useless.
> 
> Let me summarize what I understood:
> 
>   1) A per CPU cache which shadows the current state of the MSR, i.e. the
>      current valid key. You use that to avoid costly MSR writes if the
>      key does not change.

Yes

> 
>   2) On idtentry you store the key on entry in idtentry_state, clear it
>      in the MSR and shadow state if necessary and restore it on exit.

Yes, but I've subsequently found a bug here but yea that was the intention.
:-D

I also maintain the ref count of the number of nested calls to kmap to ensure
that kmap_atomic() is nestable during an exception independent of the number
of nested calls of the interrupted thread.

>   3) On context switch out you save the per CPU cache value in the task
>      and on context switch in you restore it from there.

yes

> 
> Yes, that works (see below for #2) and sorry for my confusion yesterday
> about storing this in task state.

No problem.

> 
> #2 requires to handle the exceptions which do not go through
> idtentry_enter/exit() seperately, but that's a manageable amount. It's
> the ones which use IDTENTRY_RAW or a variant of it.
> 
> #BP, #MC, #NMI, #DB, #DF need extra local storage as all the kernel
> entries for those use nmi_enter()/exit(). So you just can create
> wrappers around those. Somehting like this
> 
> static __always_inline idtentry_state_t idtentry_nmi_enter(void)
> {
>      	idtentry_state_t state = {};
> 
>         nmi_enter();
>         instrumentation_begin();
>         state.key = save_and_clear_key();
>         instrumentation_end();
> }
> 
> static __always_inline void idtentry_nmi_exit(idtentry_state_t state)
> {
>         instrumentation_begin();
>         restore_key(state.key);
>         instrumentation_end();
>         nmi_exit();
> }
> 

Thanks!

> #UD and #PF are using the raw entry variant as well but still invoke
> idtentry_enter()/exit(). #PF does not need any work. #UD handles
> WARN/BUG without going through idtentry_enter() first, but I don't think
> that's an issue unless a not 0 key would prevent writing to the console
> device. You surely can figure that out.
> 
> Hope that helps.

Yes it does thank you.  I'm also trying to simplify the API per Peters
comments while refactoring this.

Ira


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2020-07-27 20:59 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-17  7:20 [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 01/17] x86/pkeys: Create pkeys_internal.h ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support ira.weiny
2020-07-17  8:54   ` Peter Zijlstra
2020-07-17 20:52     ` Ira Weiny
2020-07-20  9:14       ` Peter Zijlstra
2020-07-17 22:36     ` Dave Hansen
2020-07-20  9:13       ` Peter Zijlstra
2020-07-17  7:20 ` [PATCH RFC V2 03/17] x86/pks: Enable Protection Keys Supervisor (PKS) ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 04/17] x86/pks: Preserve the PKRS MSR on context switch ira.weiny
2020-07-17  8:31   ` Peter Zijlstra
2020-07-17 21:39     ` Ira Weiny
2020-07-17  8:59   ` Peter Zijlstra
2020-07-17 22:34     ` Ira Weiny
2020-07-20  9:15       ` Peter Zijlstra
2020-07-20 18:35         ` Ira Weiny
2020-07-17  7:20 ` [PATCH RFC V2 05/17] x86/pks: Add PKS kernel API ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 06/17] x86/pks: Add a debugfs file for allocated PKS keys ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 07/17] Documentation/pkeys: Update documentation for kernel pkeys ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 08/17] x86/pks: Add PKS Test code ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 09/17] memremap: Convert devmap static branch to {inc,dec} ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 10/17] fs/dax: Remove unused size parameter ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 11/17] drivers/dax: Expand lock scope to cover the use of addresses ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 12/17] memremap: Add zone device access protection ira.weiny
2020-07-17  9:10   ` Peter Zijlstra
2020-07-18  5:06     ` Ira Weiny
2020-07-20  9:16       ` Peter Zijlstra
2020-07-17  9:17   ` Peter Zijlstra
2020-07-18  5:51     ` Ira Weiny
2020-07-17  9:20   ` Peter Zijlstra
2020-07-17  7:20 ` [PATCH RFC V2 13/17] kmap: Add stray write protection for device pages ira.weiny
2020-07-17  9:21   ` Peter Zijlstra
2020-07-19  4:13     ` Ira Weiny
2020-07-20  9:17       ` Peter Zijlstra
2020-07-21 16:31         ` Ira Weiny
2020-07-17  7:20 ` [PATCH RFC V2 14/17] dax: Stray write protection for dax_direct_access() ira.weiny
2020-07-17  9:22   ` Peter Zijlstra
2020-07-19  4:41     ` Ira Weiny
2020-07-17  7:20 ` [PATCH RFC V2 15/17] nvdimm/pmem: Stray write protection for pmem->virt_addr ira.weiny
2020-07-17  7:20 ` [PATCH RFC V2 16/17] [dax|pmem]: Enable stray write protection ira.weiny
2020-07-17  9:25   ` Peter Zijlstra
2020-07-17  7:20 ` [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions ira.weiny
2020-07-17  9:30   ` Peter Zijlstra
2020-07-21 18:01     ` Ira Weiny
2020-07-21 19:11       ` Peter Zijlstra
2020-07-17  9:34   ` Peter Zijlstra
2020-07-17 10:06   ` Peter Zijlstra
2020-07-22  5:27     ` Ira Weiny
2020-07-22  9:48       ` Peter Zijlstra
2020-07-22 21:24         ` Ira Weiny
2020-07-23 20:08       ` Thomas Gleixner
2020-07-23 20:15         ` Thomas Gleixner
2020-07-24 17:23           ` Ira Weiny
2020-07-24 17:29             ` Andy Lutomirski
2020-07-24 19:43               ` Ira Weiny
2020-07-22 16:21   ` Andy Lutomirski
2020-07-23 16:18     ` Fenghua Yu
2020-07-23 16:23       ` Dave Hansen
2020-07-23 16:52         ` Fenghua Yu
2020-07-23 17:08           ` Andy Lutomirski
2020-07-23 17:30             ` Dave Hansen
2020-07-23 20:23               ` Thomas Gleixner
2020-07-23 20:22             ` Thomas Gleixner
2020-07-23 21:30               ` Andy Lutomirski
2020-07-23 22:14                 ` Thomas Gleixner
2020-07-23 19:53   ` Thomas Gleixner
2020-07-23 22:04     ` Ira Weiny
2020-07-23 23:41       ` Thomas Gleixner
2020-07-24 21:24         ` Thomas Gleixner
2020-07-24 21:31           ` Thomas Gleixner
2020-07-25  0:09           ` Andy Lutomirski
2020-07-27 20:59           ` Ira Weiny
2020-07-24 22:19 ` [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).