All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection
@ 2022-03-10 17:19 ira.weiny
  2022-03-10 17:19 ` [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
                   ` (45 more replies)
  0 siblings, 46 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>


I'm looking for Intel acks on the series prior to submitting to maintainers.
Most of the changes from V8 to V9 was in getting the tests straightened out.
But there are some improvements in the actual code.


Changes for V9

Review and update all commit messages.
Update cover letter below

PKS Core
	Separate user and supervisor pkey code in the headers
		create linux/pks.h for supervisor calls
		This facilitated making the pmem code more efficient 
	Completely rearchitect the test code
		[After Dave Hansen and Rick Edgecombe found issues in the test
			code it was easier to rearchitect the code completely
			rather than attempt to fix it.]
		Remove pks_test_callback in favor of using fault hooks
			Fault hooks also isolate the fault callbacks from being
			false positives if non-test consumers are running
		Make additional PKS_TEST_RUN_ALL Kconfig option which is
			mutually exclusive to any non-test PKS consumer
			PKS_TEST_RUN_ALL takes over all pkey callbacks
		Ensure that each test runs within it's own context and is
			mutually exclusive from running while any other test is
			running.
		Ensure test session and context memory is cleaned up on file
			close
		Use pr_debug() and dynamic debug for in kernel debug messages
		Enhance test_pks selftest
			Add the ability to run all tests not just the context
				switch test
			Standardize output [PASS][FAIL][SKIP]
			Add '-d' option enables dynamic debug to see the kernel
				debug messages

	Incorporate feedback from Rick Edgecombe
		Update all pkey types to u8
		Fix up test code barriers
	Move patch declaring PKS_INIT_VALUE ahead of the patch which enables
		PKS so that PKS_INIT_VALUE can be used when pks_setup() is
		first created
	From Dan Williams
		Use macros instead of an enum for a pkey allocation scheme
			which is predicated on the config options of consumers
			This almost worked perfectly.  It required a bit of
			tweeking to be able to allocate all of the keys.

	From Dave Hansen
		Reposition some code to be near/similar to user pkeys
			s/pks_write_current/x86_pkrs_load
			s/pks_saved_pkrs/pkrs
		Update Documentation
		s/PKR_{RW,AD,WD}_KEY/PKR_{RW,AD,WD}_MASK
		Consistently use lower case for pkey
		Update commit messages
		Add Acks

PMEM Stray Write
	Building on the change to the pks_mk_*() function rename
		s/pgmap_mk_*/pgmap_set_*/
		s/dax_mk_*/dax_set_*/
	From Dan Williams
		Avoid adding new dax operations by teaching dax_device about pgmap
		Remove pgmap_protection_flag_invalid() patch (Just let
			kmap'ings fail)


PKS/PMEM Stray write protection
===============================

This series is broken into 2 parts.

	1) Introduce Protection Key Supervisor (PKS), testing, and
	   documentation
	2) Use PKS to protect PMEM from stray writes

Introduce Protection Key Supervisor (PKS)
-----------------------------------------

PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to pages beyond the normal paging protections.  PKS works in a
similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections.  And page mappings are
assigned to a domain by setting a 4 bit pkey in the PTE of that mapping.

Unlike PKU, permissions are changed via a MSR update.  This update avoids TLB
flushes making this an efficient way to alter protections vs PTE updates.

Also, unlike PTE updates PKS permission changes apply only to the current
processor.  Therefore changing permissions apply only to that thread and not
any other cpu/process.  This allows protections to remain in place on other
cpus for additional protection and isolation.

Even though PKS updates are thread local, XSAVE is not supported for the PKRS
MSR.  Therefore this implementation saves and restores the MSR across context
switches and during exceptions within software.  Nested exceptions are
supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections because PTEs naturally have a pkey value of 0.

Other keys, (1-15) are statically allocated by kernel consumers when
configured.  This is done by adding the appropriate PKS_NEW_KEY and
PKS_DECLARE_INIT_VALUE macros to pks-keys.h.

Two PKS consumers, PKS_TEST and PMEM stray write protection, are included in
this series.  When the number of users grows larger the sharing of keys will
need to be resolved depending on the needs of the users at that time.  Many
methods have been contemplated but the number of kernel users and use cases
envisioned is still quite small, much less than the 15 available keys.

To summarize, the following are key attributes of PKS.

	1) Fast switching of permissions
		1a) Prevents access without page table manipulations
		1b) No TLB flushes required
	2) Works on a per thread basis, thus allowing protections to be
	   preserved on threads which are not actively accessing data through
	   the mapping.

PKS is available with 4 and 5 level paging.  For this and simplicity of
implementation, the feature is restricted to x86_64.


Use PKS to protect PMEM from stray writes
-----------------------------------------

DAX leverages the direct-map to enable 'struct page' services for PMEM.  Given
that PMEM capacity may be an order of magnitude higher capacity than System RAM
it presents a large vulnerability surface to stray writes.  Such a stray write
becomes a silent data corruption bug.

Stray pointers to System RAM may result in a crash or other undesirable
behavior which, while unfortunate, are usually recoverable with a reboot.
Stray writes to PMEM are permanent in nature and thus are more likely to result
in permanent user data loss.  Given that PMEM access from the kernel is limited
to a constrained set of locations (PMEM driver, Filesystem-DAX, direct-I/O, and
any properly kmap'ed page), it is amenable to PKS protection.

Set up an infrastructure for extra device access protection. Then implement the
protection using the new Protection Keys Supervisor (PKS) on architectures
which support it.

Because PMEM pages are all associated with a struct dev_pagemap and flags in
struct page are valuable the flag of protecting memory can be stored in struct
dev_pagemap.  All PMEM is protected by the same pkey.  So a single flag is all
that is needed in each dev_pagemap to indicate protection.

General access in the kernel is supported by modifying the kmap infrastructure
which can detect if a page is pks protected and enable access until the
corresponding unmap is called.

Because PKS is a thread local mechanism and because kmap was never really
intended to create a long term mapping, this implementation does not support
the kmap()/kunmap() calls.  Calling kmap() on a PMEM protected page is allowed
but accessing that mapping will cause a fault.

Originally this series modified many of the kmap call sites to indicate they
were thread local.[1]  And an attempt to support kmap()[2] was made.  But now
that kmap_local_page() has been developed[3] and in more wide spread use,
kmap() can safely be left unsupported.

How the fault is handled is configurable via a new module parameter
memremap.pks_fault_mode.  Two modes are supported.

	'relaxed' (default) -- WARN_ONCE, disable the protection and allow
	                       access

	'strict' -- prevent any unguarded access to a protected dev_pagemap
		    range

This 'safety valve' feature has already been useful in the development of this
feature.


[1] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/

[2] https://lore.kernel.org/lkml/87mtycqcjf.fsf@nanos.tec.linutronix.de/

[3] https://lore.kernel.org/lkml/20210128061503.1496847-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210210062221.3023586-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210205170030.856723-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210217024826.3466046-1-ira.weiny@intel.com/


----------------------------------------------------------------------------
Changes for V8

Feedback from Thomas
	* clean up noinstr mess
	* Fix static PKEY allocation mess
	* Ensure all functions are consistently named.
	* Split up patches to do 1 thing per patch
	* pkey_update_pkval() implementation
	* Streamline the use of pks_write_pkrs() by not disabling preemption
		- Leave this to the callers who require it.
		- Use documentation and lockdep to prevent errors
	* Clean up commit messages to explain in detail _why_ each patch is
		there.

Feedback from Dave H.
	* Leave out pks_mk_readonly() as it is not used by the PMEM use case

Feedback from Peter Anvin
	* Replace pks_abandon_pkey() with pks_update_exception()
		This is an even greater simplification in that it no longer
		attempts to shield users from faults.  As the main use case for
		abandoning a key was to allow a system to continue running even
		with an error.  This should be a rare event so the performance
		should not be an issue.

* Simplify ARCH_ENABLE_SUPERVISOR_PKEYS

* Update PKS Test code
	- Add default value test
	- Split up the test code into patches which follow each feature
	  addition
	- simplify test code processing
	- ensure consistent reporting of errors.

* Ensure all entry points to the PKS code are protected by
	cpu_feature_enabled(X86_FEATURE_PKS)
	- At the same time make sure non-entry points or sub-functions to the
	  PKS code are not _unnecessarily_ protected by the feature check

* Update documentation
	- Use kernel docs to place the docs with the code for easier internal
	  developer use

* Adjust the PMEM use cases for the core changes

* Split the PMEM patches up to be 1 change per patch and help clarify review

* Review all header files and remove those no longer needed

* Review/update/clarify all commit messages

Fenghua Yu (1):
mm/pkeys: Define PKS page table macros

Ira Weiny (43):
entry: Create an internal irqentry_exit_cond_resched() call
Documentation/protection-keys: Clean up documentation for User Space
pkeys
x86/pkeys: Clarify PKRU_AD_KEY macro
x86/pkeys: Make PKRU macros generic
x86/fpu: Refactor arch_set_user_pkey_access()
mm/pkeys: Add Kconfig options for PKS
x86/pkeys: Add PKS CPU feature bit
x86/fault: Adjust WARN_ON for pkey fault
Documentation/pkeys: Add initial PKS documentation
mm/pkeys: Provide for PKS key allocation
x86/pkeys: Enable PKS on cpus which support it
mm/pkeys: PKS testing, add initial test code
x86/selftests: Add test_pks
x86/pkeys: Introduce pks_write_pkrs()
x86/pkeys: Preserve the PKS MSR on context switch
mm/pkeys: Introduce pks_set_readwrite()
mm/pkeys: Introduce pks_set_noaccess()
mm/pkeys: PKS testing, add a fault call back
mm/pkeys: PKS testing, add pks_set_*() tests
mm/pkeys: PKS testing, test context switching
x86/entry: Add auxiliary pt_regs space
entry: Split up irqentry_exit_cond_resched()
entry: Add calls for save/restore auxiliary pt_regs
x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
x86/pkeys: Preserve PKRS MSR across exceptions
x86/fault: Print PKS MSR on fault
mm/pkeys: PKS testing, Add exception test
mm/pkeys: Introduce pks_update_exception()
mm/pkeys: PKS testing, test pks_update_exception()
mm/pkeys: PKS testing, add test for all keys
mm/pkeys: Add pks_available()
memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
memremap_pages: Introduce pgmap_protection_available()
memremap_pages: Introduce a PGMAP_PROTECTION flag
memremap_pages: Introduce devmap_protected()
memremap_pages: Reserve a PKS pkey for eventual use by PMEM
memremap_pages: Set PKS pkey in PTEs if requested
memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls
memremap_pages: Add memremap.pks_fault_mode
kmap: Make kmap work for devmap protected pages
dax: Stray access protection for dax_direct_access()
nvdimm/pmem: Enable stray access protection
devdax: Enable stray access protection

Rick Edgecombe (1):
mm/pkeys: Introduce PKS fault callbacks

.../admin-guide/kernel-parameters.txt | 12 +
Documentation/core-api/protection-keys.rst | 130 ++-
arch/x86/Kconfig | 6 +
arch/x86/entry/calling.h | 20 +
arch/x86/entry/common.c | 2 +-
arch/x86/entry/entry_64.S | 22 +
arch/x86/entry/entry_64_compat.S | 6 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/entry-common.h | 15 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable_types.h | 22 +
arch/x86/include/asm/pkeys.h | 2 +
arch/x86/include/asm/pkeys_common.h | 18 +
arch/x86/include/asm/pkru.h | 20 +-
arch/x86/include/asm/pks.h | 46 ++
arch/x86/include/asm/processor.h | 15 +-
arch/x86/include/asm/ptrace.h | 21 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/asm-offsets_64.c | 15 +
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/dumpstack.c | 32 +-
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/head_64.S | 6 +
arch/x86/kernel/process_64.c | 3 +
arch/x86/mm/fault.c | 17 +-
arch/x86/mm/pkeys.c | 320 +++++++-
drivers/dax/device.c | 2 +
drivers/dax/super.c | 59 ++
drivers/md/dm-writecache.c | 8 +-
drivers/nvdimm/pmem.c | 26 +
fs/dax.c | 8 +
fs/fuse/virtio_fs.c | 2 +
include/linux/dax.h | 5 +
include/linux/entry-common.h | 15 +-
include/linux/highmem-internal.h | 4 +
include/linux/memremap.h | 1 +
include/linux/mm.h | 72 ++
include/linux/pgtable.h | 4 +
include/linux/pks-keys.h | 92 +++
include/linux/pks.h | 73 ++
include/linux/sched.h | 7 +
include/uapi/asm-generic/mman-common.h | 1 +
init/init_task.c | 3 +
kernel/entry/common.c | 44 +-
kernel/sched/core.c | 40 +-
lib/Kconfig.debug | 33 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 755 ++++++++++++++++++
mm/Kconfig | 32 +
mm/memremap.c | 132 +++
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/test_pks.c | 514 ++++++++++++
54 files changed, 2617 insertions(+), 109 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 arch/x86/include/asm/pks.h
create mode 100644 include/linux/pks-keys.h
create mode 100644 include/linux/pks.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c

--
2.35.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-04-07  2:48   ` Ira Weiny
  2022-03-10 17:19 ` [PATCH V9 02/45] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
                   ` (44 subsequent siblings)
  45 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The static call to irqentry_exit_cond_resched() was not properly being
overridden when called from xen_pv_evtchn_do_upcall().

Define __irqentry_exit_cond_resched() as the static call and place the
override logic in irqentry_exit_cond_resched().

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Update the commit message a bit

Because this was found via code inspection and it does not actually fix
any seen bug I've not added a fixes tag.

But for reference:
Fixes: 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
---
 include/linux/entry-common.h |  5 ++++-
 kernel/entry/common.c        | 23 +++++++++++++--------
 kernel/sched/core.c          | 40 ++++++++++++++++++------------------
 3 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..ddaffc983e62 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -455,10 +455,13 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  * Conditional reschedule with additional sanity checks.
  */
 void irqentry_exit_cond_resched(void);
+
+void __irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+DECLARE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 #endif
 
+
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
  * @regs:	Pointer to pt_regs (exception entry regs)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bad713684c2e..490442a48332 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -380,7 +380,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	return ret;
 }
 
-void irqentry_exit_cond_resched(void)
+void __irqentry_exit_cond_resched(void)
 {
 	if (!preempt_count()) {
 		/* Sanity check RCU and thread stack */
@@ -392,9 +392,20 @@ void irqentry_exit_cond_resched(void)
 	}
 }
 #ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 #endif
 
+void irqentry_exit_cond_resched(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPTION)) {
+#ifdef CONFIG_PREEMPT_DYNAMIC
+		static_call(__irqentry_exit_cond_resched)();
+#else
+		__irqentry_exit_cond_resched();
+#endif
+	}
+}
+
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
@@ -420,13 +431,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION)) {
-#ifdef CONFIG_PREEMPT_DYNAMIC
-			static_call(irqentry_exit_cond_resched)();
-#else
-			irqentry_exit_cond_resched();
-#endif
-		}
+		irqentry_exit_cond_resched();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9745613d531c..f56db4bd9730 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6571,29 +6571,29 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
  * SC:might_resched
  * SC:preempt_schedule
  * SC:preempt_schedule_notrace
- * SC:irqentry_exit_cond_resched
+ * SC:__irqentry_exit_cond_resched
  *
  *
  * NONE:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- RET0
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
+ *   cond_resched                 <- __cond_resched
+ *   might_resched                <- RET0
+ *   preempt_schedule             <- NOP
+ *   preempt_schedule_notrace     <- NOP
+ *   __irqentry_exit_cond_resched <- NOP
  *
  * VOLUNTARY:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- __cond_resched
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
+ *   cond_resched                 <- __cond_resched
+ *   might_resched                <- __cond_resched
+ *   preempt_schedule             <- NOP
+ *   preempt_schedule_notrace     <- NOP
+ *   __irqentry_exit_cond_resched <- NOP
  *
  * FULL:
- *   cond_resched               <- RET0
- *   might_resched              <- RET0
- *   preempt_schedule           <- preempt_schedule
- *   preempt_schedule_notrace   <- preempt_schedule_notrace
- *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ *   cond_resched                 <- RET0
+ *   might_resched                <- RET0
+ *   preempt_schedule             <- preempt_schedule
+ *   preempt_schedule_notrace     <- preempt_schedule_notrace
+ *   __irqentry_exit_cond_resched <- __irqentry_exit_cond_resched
  */
 
 enum {
@@ -6629,7 +6629,7 @@ void sched_dynamic_update(int mode)
 	static_call_update(might_resched, __cond_resched);
 	static_call_update(preempt_schedule, __preempt_schedule_func);
 	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+	static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 
 	switch (mode) {
 	case preempt_dynamic_none:
@@ -6637,7 +6637,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, (void *)&__static_call_return0);
 		static_call_update(preempt_schedule, NULL);
 		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
+		static_call_update(__irqentry_exit_cond_resched, NULL);
 		pr_info("Dynamic Preempt: none\n");
 		break;
 
@@ -6646,7 +6646,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, __cond_resched);
 		static_call_update(preempt_schedule, NULL);
 		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
+		static_call_update(__irqentry_exit_cond_resched, NULL);
 		pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
@@ -6655,7 +6655,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, (void *)&__static_call_return0);
 		static_call_update(preempt_schedule, __preempt_schedule_func);
 		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+		static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 		pr_info("Dynamic Preempt: full\n");
 		break;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 02/45] Documentation/protection-keys: Clean up documentation for User Space pkeys
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
  2022-03-10 17:19 ` [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 03/45] x86/pkeys: Clarify PKRU_AD_KEY macro ira.weiny
                   ` (43 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The documentation for user space pkeys was a bit dated including things
such as Amazon and distribution testing information which is irrelevant
now.

Update the documentation.  This also streamlines adding the Supervisor
pkey documentation later on.

Cc: "Moger, Babu" <Babu.Moger@amd.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9:
	use pkey
	Change information on which CPU's have PKU
---
 Documentation/core-api/protection-keys.rst | 44 +++++++++++-----------
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..bf28ac0401f3 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,31 +4,29 @@
 Memory Protection Keys
 ======================
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+Pkeys Userspace (PKU) is a feature which can be found on:
+        * Intel server CPUs, Skylake and later
+        * Intel client CPUs, Tiger Lake (11th Gen Core) and later
+        * Future AMD CPUs
+
+Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.
+
+Protections for each key are defined with a per-CPU user-accessible register
+(PKRU).  Each of these is a 32-bit register storing two bits (Access Disable
+and Write Disable) for each of 16 keys.
+
+Being a CPU register, PKRU is inherently thread-local, potentially giving each
 thread a different set of protections from every other thread.
 
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register.  The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs.  These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
+There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
+register.  The feature is only available in 64-bit mode, even though there is
+theoretically space in the PAE PTEs.  These permissions are enforced on data
+access only and have no effect on instruction fetches.
 
 Syscalls
 ========
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 03/45] x86/pkeys: Clarify PKRU_AD_KEY macro
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
  2022-03-10 17:19 ` [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
  2022-03-10 17:19 ` [PATCH V9 02/45] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 04/45] x86/pkeys: Make PKRU macros generic ira.weiny
                   ` (42 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

When changing the PKRU_AD_KEY macro to be used for PKS the name came
into question.[1]

The intent of PKRU_AD_KEY is to set an initial value for the PKRU
register but that is just a mask value.

Clarify this by changing the name to PKRU_AD_MASK().

NOTE the checkpatch errors are ignored for the init_pkru_value to align
the values in the code.

[1] https://lore.kernel.org/lkml/eff862e2-bfaa-9e12-42b5-a12467d72a22@intel.com/

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	New Patch
---
 arch/x86/mm/pkeys.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e44e938885b7..7418c367e328 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -110,7 +110,7 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
 	return vma_pkey(vma);
 }
 
-#define PKRU_AD_KEY(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
+#define PKRU_AD_MASK(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
 
 /*
  * Make the default PKRU value (at execve() time) as restrictive
@@ -118,11 +118,14 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
-		      PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
-		      PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
-		      PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
-		      PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);
+u32 init_pkru_value = PKRU_AD_MASK( 1) | PKRU_AD_MASK( 2) |
+		      PKRU_AD_MASK( 3) | PKRU_AD_MASK( 4) |
+		      PKRU_AD_MASK( 5) | PKRU_AD_MASK( 6) |
+		      PKRU_AD_MASK( 7) | PKRU_AD_MASK( 8) |
+		      PKRU_AD_MASK( 9) | PKRU_AD_MASK(10) |
+		      PKRU_AD_MASK(11) | PKRU_AD_MASK(12) |
+		      PKRU_AD_MASK(13) | PKRU_AD_MASK(14) |
+		      PKRU_AD_MASK(15);
 
 static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 04/45] x86/pkeys: Make PKRU macros generic
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (2 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 03/45] x86/pkeys: Clarify PKRU_AD_KEY macro ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 05/45] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
                   ` (41 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

	1. A single control register
	2. The same number of keys
	3. The same number of bits in the register per key
	4. Access and Write disable in the same bit locations

Given the above, share all the macros that synthesize and manipulate
register values between the two features.  Share these defines by moving
them into a new header, change their names to reflect the common use,
and include the header where needed.  This mostly takes the form of
converting names from the PKU-specific "PKRU" to a user/supervisor
agnostic "PKR".

Also while editing the code remove the use of 'we' from comments being
touched.

NOTE the checkpatch errors are ignored for the init_pkru_value to
align the values in the code.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9:
	From Dave Hansen
		Add detail to commit message
		Add Ack
		s/PKR_AD_KEY/PKR_AD_MASK/

Changes from v7:
	Rebased onto latest
---
 arch/x86/include/asm/pkeys_common.h | 11 +++++++++++
 arch/x86/include/asm/pkru.h         | 20 ++++++++------------
 arch/x86/kernel/fpu/xstate.c        | 10 +++++-----
 arch/x86/mm/pkeys.c                 | 20 +++++++++-----------
 4 files changed, 33 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index 000000000000..359b94cdcc0c
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_COMMON_H
+#define _ASM_X86_PKEYS_COMMON_H
+
+#define PKR_AD_BIT 0x1u
+#define PKR_WD_BIT 0x2u
+#define PKR_BITS_PER_PKEY 2
+
+#define PKR_AD_MASK(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h
index 74f0a2d34ffd..06980dd42946 100644
--- a/arch/x86/include/asm/pkru.h
+++ b/arch/x86/include/asm/pkru.h
@@ -3,10 +3,7 @@
 #define _ASM_X86_PKRU_H
 
 #include <asm/cpufeature.h>
-
-#define PKRU_AD_BIT 0x1u
-#define PKRU_WD_BIT 0x2u
-#define PKRU_BITS_PER_PKEY 2
+#include <asm/pkeys_common.h>
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -18,18 +15,17 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	/*
-	 * Access-disable disables writes too so we need to check
-	 * both bits here.
-	 */
-	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+	/* Access-disable disables writes too so check both bits here. */
+	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u32 read_pkru(void)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 7c7824ae7862..d090867c9de3 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1089,19 +1089,19 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
 		return -EINVAL;
 
-	/* Set the bits we need in PKRU:  */
+	/* Set the bits needed in PKRU:  */
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKRU_AD_BIT;
+		new_pkru_bits |= PKR_AD_BIT;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKRU_WD_BIT;
+		new_pkru_bits |= PKR_WD_BIT;
 
 	/* Shift the bits in to the correct place in PKRU for pkey: */
-	pkey_shift = pkey * PKRU_BITS_PER_PKEY;
+	pkey_shift = pkey * PKR_BITS_PER_PKEY;
 	new_pkru_bits <<= pkey_shift;
 
 	/* Get old PKRU and mask off any old bits in place: */
 	old_pkru = read_pkru();
-	old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
 
 	/* Write old part along with new part: */
 	write_pkru(old_pkru | new_pkru_bits);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7418c367e328..e1527b4619e1 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -110,22 +110,20 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
 	return vma_pkey(vma);
 }
 
-#define PKRU_AD_MASK(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
-
 /*
  * Make the default PKRU value (at execve() time) as restrictive
  * as possible.  This ensures that any threads clone()'d early
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-u32 init_pkru_value = PKRU_AD_MASK( 1) | PKRU_AD_MASK( 2) |
-		      PKRU_AD_MASK( 3) | PKRU_AD_MASK( 4) |
-		      PKRU_AD_MASK( 5) | PKRU_AD_MASK( 6) |
-		      PKRU_AD_MASK( 7) | PKRU_AD_MASK( 8) |
-		      PKRU_AD_MASK( 9) | PKRU_AD_MASK(10) |
-		      PKRU_AD_MASK(11) | PKRU_AD_MASK(12) |
-		      PKRU_AD_MASK(13) | PKRU_AD_MASK(14) |
-		      PKRU_AD_MASK(15);
+u32 init_pkru_value = PKR_AD_MASK( 1) | PKR_AD_MASK( 2) |
+		      PKR_AD_MASK( 3) | PKR_AD_MASK( 4) |
+		      PKR_AD_MASK( 5) | PKR_AD_MASK( 6) |
+		      PKR_AD_MASK( 7) | PKR_AD_MASK( 8) |
+		      PKR_AD_MASK( 9) | PKR_AD_MASK(10) |
+		      PKR_AD_MASK(11) | PKR_AD_MASK(12) |
+		      PKR_AD_MASK(13) | PKR_AD_MASK(14) |
+		      PKR_AD_MASK(15);
 
 static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
@@ -158,7 +156,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 	 * up immediately if someone attempts to disable access
 	 * or writes to pkey 0.
 	 */
-	if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT))
+	if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT))
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 05/45] x86/fpu: Refactor arch_set_user_pkey_access()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (3 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 04/45] x86/pkeys: Make PKRU macros generic ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 06/45] mm/pkeys: Add Kconfig options for PKS ira.weiny
                   ` (40 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Both PKU and PKS update their register values in the same way.  They can
therefore share the update code.

Define a helper, pkey_update_pkval(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

pkey_update_pkval() contributed by Thomas

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Update for V8:
	From Rick Edgecombe
		Change pkey type to u8
	Replace the code Peter provided in update_pkey_reg() for
	Thomas' pkey_update_pkval()
		-- https://lore.kernel.org/lkml/20200717085442.GX10769@hirez.programming.kicks-ass.net/
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
 arch/x86/mm/pkeys.c          | 16 ++++++++++++++++
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 1d5f14aff5f6..26616cbe19e2 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -131,4 +131,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d090867c9de3..c8a8dadd9f87 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1071,8 +1071,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			      unsigned long init_val)
 {
-	u32 old_pkru, new_pkru_bits = 0;
-	int pkey_shift;
+	u32 pkru;
 
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
@@ -1089,22 +1088,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
 		return -EINVAL;
 
-	/* Set the bits needed in PKRU:  */
-	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKR_AD_BIT;
-	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKR_WD_BIT;
-
-	/* Shift the bits in to the correct place in PKRU for pkey: */
-	pkey_shift = pkey * PKR_BITS_PER_PKEY;
-	new_pkru_bits <<= pkey_shift;
-
-	/* Get old PKRU and mask off any old bits in place: */
-	old_pkru = read_pkru();
-	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-	/* Write old part along with new part: */
-	write_pkru(old_pkru | new_pkru_bits);
+	pkru = read_pkru();
+	pkru = pkey_update_pkval(pkru, pkey, init_val);
+	write_pkru(pkru);
 
 	return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e1527b4619e1..7c90b2188c5f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -193,3 +193,19 @@ static __init int setup_init_pkru(char *opt)
 	return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Kernel users use the same flags as user space:
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ */
+u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
+{
+	int shift = pkey * PKR_BITS_PER_PKEY;
+
+	if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
+		accessbits &= PKEY_ACCESS_MASK;
+
+	pkval &= ~(PKEY_ACCESS_MASK << shift);
+	return pkval | accessbits << shift;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 06/45] mm/pkeys: Add Kconfig options for PKS
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (4 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 05/45] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 07/45] x86/pkeys: Add PKS CPU feature bit ira.weiny
                   ` (39 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Consumers wishing to implement additional protections on memory pages
can use PKS.  However, PKS is only available on some architectures.

For this reason PKS code, both in the core and in the consumers, is dead
code without PKS being both available and used.

Add Kconfig options to allow for the elimination of unneeded code by
detecting architecture PKS support (ARCH_HAS_SUPERVISOR_PKEYS) and
requiring an indication of consumer need (ARCH_ENABLE_SUPERVISOR_PKEYS).

In this patch ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first
kernel consumer sets it.

Cc: "Moger, Babu" <Babu.Moger@amd.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Dave Hansen
		Don't exclude AMD, cpu supported bits will properly turn
		the feature off.
		Clarify commit message
		Depend on CPU_SUP_INTEL

Changes for V8
	Split this out to a single change patch
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig       | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9f5bd41bf660..459948622a73 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1868,6 +1868,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_SUPERVISOR_PKEYS
 	help
 	  Memory Protection Keys provides a mechanism for enforcing
 	  page-based protections, but without requiring modification of the
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..46f2bb15aa4e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -804,6 +804,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+	bool
+config ARCH_ENABLE_SUPERVISOR_PKEYS
+	bool
 
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 07/45] x86/pkeys: Add PKS CPU feature bit
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (5 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 06/45] mm/pkeys: Add Kconfig options for PKS ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 08/45] x86/fault: Adjust WARN_ON for pkey fault ira.weiny
                   ` (38 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Memory Protection Keys (pkeys) provides a mechanism for enforcing
page-based protections, but without requiring modification of the page
tables when an application changes protection domains.

The supervisor support for memory protection keys is referred to as
PKS (Protection Keys Supervisor).

Add the defines for the CPU support bit and the boilerplate disable
infrastructure predicated on the new ARCH_ENABLE_SUPERVISOR_PKEYS
Kconfig option.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Dave Hansen
		New commit message

Changes for V8
	Split this out into it's own patch
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 65d147974f8d..cb529b824a96 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -370,6 +370,7 @@
 #define X86_FEATURE_MOVDIR64B		(16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD		(16*32+29) /* ENQCMD and ENQCMDS instructions */
 #define X86_FEATURE_SGX_LC		(16*32+30) /* Software Guard Extensions Launch Control */
+#define X86_FEATURE_PKS			(16*32+31) /* Protection Keys for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV	(17*32+ 0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..66fdad8f3941 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+# define DISABLE_PKS		0
+#else
+# define DISABLE_PKS		(1<<(X86_FEATURE_PKS & 31))
+#endif
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57	0
 #else
@@ -85,7 +91,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK19	0
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 08/45] x86/fault: Adjust WARN_ON for pkey fault
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (6 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 07/45] x86/pkeys: Add PKS CPU feature bit ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 09/45] Documentation/pkeys: Add initial PKS documentation ira.weiny
                   ` (37 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Previously if a protection key fault occurred on a kernel address it
indicated something wrong because user page mappings are not supposed to
be in the kernel address space.

With the addition of PKS, pkey faults may now happen on kernel mappings.

If PKS is enabled, avoid the warning in the fault path.  Simplify the
comment.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dave Hansen
		Clarify the comment and commit message
---
 arch/x86/mm/fault.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d0074c6ed31a..5599109d1124 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1148,11 +1148,11 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
 	/*
-	 * Protection keys exceptions only happen on user pages.  We
-	 * have no user pages in the kernel portion of the address
-	 * space, so do not expect them here.
+	 * PF_PF faults should only occur on kernel
+	 * addresses when supervisor pkeys are enabled.
 	 */
-	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
+		     (hw_error_code & X86_PF_PK));
 
 #ifdef CONFIG_X86_32
 	/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 09/45] Documentation/pkeys: Add initial PKS documentation
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (7 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 08/45] x86/fault: Adjust WARN_ON for pkey fault ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 10/45] mm/pkeys: Provide for PKS key allocation ira.weiny
                   ` (36 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Add initial overview and configuration information about PKS.

Cc: "Moger, Babu" <Babu.Moger@amd.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Feedback from Dave Hansen
		Remove overview and move relevant text to the main pkey
		overview which covers both user ans kernel keys.
		Add an example of using Kconfig
		Move MSR details to later patches
---
 Documentation/core-api/protection-keys.rst | 43 ++++++++++++++++++++--
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index bf28ac0401f3..13eedb0119e1 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -13,6 +13,11 @@ Pkeys Userspace (PKU) is a feature which can be found on:
         * Intel client CPUs, Tiger Lake (11th Gen Core) and later
         * Future AMD CPUs
 
+Protection Keys Supervisor (PKS) is a feature which can be found on:
+        * Sapphire Rapids (and later) "Scalable Processor" Server CPUs
+        * Future non-server Intel parts.
+        * qemu: https://www.qemu.org/2021/04/30/qemu-6-0-0/
+
 Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
 a "protection key", giving 16 possible keys.
 
@@ -23,13 +28,20 @@ and Write Disable) for each of 16 keys.
 Being a CPU register, PKRU is inherently thread-local, potentially giving each
 thread a different set of protections from every other thread.
 
-There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
-register.  The feature is only available in 64-bit mode, even though there is
+For Userspace (PKU), there are two instructions (RDPKRU/WRPKRU) for reading and
+writing to the register.
+
+For Supervisor (PKS), the register (MSR_IA32_PKRS) is accessible only to the
+kernel through rdmsr and wrmsr.
+
+The feature is only available in 64-bit mode, even though there is
 theoretically space in the PAE PTEs.  These permissions are enforced on data
 access only and have no effect on instruction fetches.
 
-Syscalls
-========
+
+
+Syscalls for user space keys
+============================
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -96,3 +108,26 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+Kconfig
+-------
+
+Kernel users intending to use PKS support should depend on
+ARCH_HAS_SUPERVISOR_PKEYS, and select ARCH_ENABLE_SUPERVISOR_PKEYS to turn on
+this support within the core.  For example:
+
+.. code-block:: c
+
+        config MY_NEW_FEATURE
+                depends on ARCH_HAS_SUPERVISOR_PKEYS
+                select ARCH_ENABLE_SUPERVISOR_PKEYS
+
+This will make "MY_NEW_FEATURE" unavailable unless the architecture sets
+ARCH_HAS_SUPERVISOR_PKEYS.  It also makes it possible for multiple independent
+features to "select ARCH_ENABLE_SUPERVISOR_PKEYS".  If no features enable PKS
+by selecting ARCH_ENABLE_SUPERVISOR_PKEYS, PKS support will not be compiled
+into the kernel.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 10/45] mm/pkeys: Provide for PKS key allocation
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (8 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 09/45] Documentation/pkeys: Add initial PKS documentation ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 11/45] x86/pkeys: Enable PKS on cpus which support it ira.weiny
                   ` (35 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Kernel consumers of PKS need a way to allocate a PKS pkey and assign the
initial permissions for that key.  It is desirable to not allocate keys
for consumers which are not configured.

Introduce a macro to allocate keys sequentially based on which consumers
are configured.  In addition define a macro to set the proper permission
bits based on the actual pkey value allocated.

pks-keys.h is added as a new header with minimal header dependencies.
This allows the use of PKS_INIT_VALUE within other headers where the
additional includes from other pkey headers caused major conflicts.  The
main conflict was using PKS_INIT_VALUE for INIT_TRHEAD in
asm/processor.h

Add documentation.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Reword the commit message
	Move this patch ahead of the enable patch so that the enable
		patch can use PKS_INIT_VALUE
	From Dan Williams
		Use Dan's macro magic
			enhanced it to account for the max number of
			keys
		Update documentation for the change
	From Dave Hansen
		use pkey
		s/PKR_RW_KEY/PKR_RW_MASK

Changes for V8
	Create pks-keys.h to solve header conflicts in subsequent
		patches.
	Remove create_initial_pkrs_value() which did not work
		Replace it with PKS_INIT_VALUE
		Fix up documentation to match
	s/PKR_RW_BIT/PKR_RW_KEY()/
	s/PKRS_INIT_VALUE/PKS_INIT_VALUE
	Split this off of the previous patch
	Update documentation and embed it in the code to help ensure it
	is kept up to date.

Changes for V7
	Create a dynamic pkrs_initial_value in early init code.
	Clean up comments
	Add comment to macro guard
---
 Documentation/core-api/protection-keys.rst |  5 ++
 arch/x86/include/asm/pkeys_common.h        |  9 ++-
 include/linux/pks-keys.h                   | 78 ++++++++++++++++++++++
 3 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pks-keys.h

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 13eedb0119e1..d501bd27ee29 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -131,3 +131,8 @@ ARCH_HAS_SUPERVISOR_PKEYS.  It also makes it possible for multiple independent
 features to "select ARCH_ENABLE_SUPERVISOR_PKEYS".  If no features enable PKS
 by selecting ARCH_ENABLE_SUPERVISOR_PKEYS, PKS support will not be compiled
 into the kernel.
+
+PKS Key Allocation
+------------------
+.. kernel-doc:: include/linux/pks-keys.h
+        :doc: PKS_KEY_ALLOCATION
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index 359b94cdcc0c..b28a72dea22b 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -2,10 +2,17 @@
 #ifndef _ASM_X86_PKEYS_COMMON_H
 #define _ASM_X86_PKEYS_COMMON_H
 
+#define PKS_NUM_PKEYS 16
+#define PKS_ALL_AD (0x55555555UL)
+
 #define PKR_AD_BIT 0x1u
 #define PKR_WD_BIT 0x2u
 #define PKR_BITS_PER_PKEY 2
 
-#define PKR_AD_MASK(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+#define PKR_PKEY_SHIFT(pkey)	(pkey * PKR_BITS_PER_PKEY)
+
+#define PKR_RW_MASK(pkey)	(0          << PKR_PKEY_SHIFT(pkey))
+#define PKR_AD_MASK(pkey)	(PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
+#define PKR_WD_MASK(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
 
 #endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
new file mode 100644
index 000000000000..c914afecb2d3
--- /dev/null
+++ b/include/linux/pks-keys.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKS_KEYS_H
+#define _LINUX_PKS_KEYS_H
+
+/*
+ * The contents of this header should be limited to assigning PKS keys and
+ * default values to avoid intricate header dependencies.
+ */
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+#include <asm/pkeys_common.h>
+
+#define PKS_NEW_KEY(prev, config) \
+	(prev + __is_defined(config))
+#define PKS_DECLARE_INIT_VALUE(pkey, value, config) \
+	(PKR_##value##_MASK(pkey) * __is_defined(config))
+
+/**
+ * DOC: PKS_KEY_ALLOCATION
+ *
+ * Users reserve a key value in 5 steps.
+ *	1) Use PKS_NEW_KEY to create a new key
+ *	2) Ensure that the last key value is specified in the PKS_NEW_KEY macro
+ *	3) Adjust PKS_KEY_MAX to use the newly defined key value
+ *	4) Use PKS_DECLARE_INIT_VALUE to define an initial value
+ *	5) Add the new PKS default value to PKS_INIT_VALUE
+ *
+ * The PKS_NEW_KEY and PKS_DECLARE_INIT_VALUE macros require the Kconfig
+ * option to be specified to automatically adjust the number of keys used.
+ *
+ * PKS_KEY_DEFAULT must remain 0 with a default of PKS_DECLARE_INIT_VALUE(...,
+ * RW, ...) to support non-pks protected pages.
+ *
+ * Example: to configure a key for 'MY_FEATURE' with a default of Write
+ * Disabled.
+ *
+ * .. code-block:: c
+ *
+ *	#define PKS_KEY_DEFAULT		0
+ *
+ *	// 1) Use PKS_NEW_KEY to create a new key
+ *	// 2) Ensure that the last key value is specified (eg PKS_KEY_DEFAULT)
+ *	#define PKS_KEY_MY_FEATURE PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_MY_FEATURE)
+ *
+ *	// 3) Adjust PKS_KEY_MAX
+ *	#define PKS_KEY_MAX	   PKS_NEW_KEY(PKS_KEY_MY_FEATURE, 1)
+ *
+ *	// 4) Define initial value
+ *	#define PKS_KEY_MY_FEATURE_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_MY_FEATURE, \
+ *								WD, CONFIG_MY_FEATURE)
+ *
+ *
+ *	// 5) Add initial value to PKS_INIT_VALUE
+ *	#define PKS_INIT_VALUE ((PKS_ALL_AD & PKS_ALL_AD_MASK) | \
+ *				PKS_KEY_DEFAULT_INIT | \
+ *				PKS_KEY_MY_FEATURE_INIT \
+ *				)
+ */
+
+/* PKS_KEY_DEFAULT must be 0 */
+#define PKS_KEY_DEFAULT		0
+#define PKS_KEY_MAX		PKS_NEW_KEY(PKS_KEY_DEFAULT, 1)
+
+/* PKS_KEY_DEFAULT_INIT must be RW */
+#define PKS_KEY_DEFAULT_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
+
+#define PKS_ALL_AD_MASK \
+	GENMASK(PKS_NUM_PKEYS * PKR_BITS_PER_PKEY, \
+		PKS_KEY_MAX * PKR_BITS_PER_PKEY)
+
+#define PKS_INIT_VALUE ((PKS_ALL_AD & PKS_ALL_AD_MASK) | \
+			PKS_KEY_DEFAULT_INIT \
+			)
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _LINUX_PKS_KEYS_H */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 11/45] x86/pkeys: Enable PKS on cpus which support it
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (9 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 10/45] mm/pkeys: Provide for PKS key allocation ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 12/45] mm/pkeys: Define PKS page table macros ira.weiny
                   ` (34 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses a supervisor specific MSR to assign permissions to
the pkeys.

When PKS is configured and the cpu supports PKS, initialize the MSR, and
enable the hardware.

Add asm/pks.h to store new internal functions and structures such as
pks_setup().

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Reword commit message
	Move this after the patch defining PKS_INIT_VALUE

Changes for V8
	Move setup_pks() into this patch with a default of all access
		for all pkeys.
	From Thomas
		s/setup_pks/pks_setup/
	Update Change log to better reflect exactly what this patch does.
---
 arch/x86/include/asm/msr-index.h            |  1 +
 arch/x86/include/asm/pks.h                  | 15 +++++++++++++++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c                |  2 ++
 arch/x86/mm/pkeys.c                         | 17 +++++++++++++++++
 5 files changed, 37 insertions(+)
 create mode 100644 arch/x86/include/asm/pks.h

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a4a39c3e0f19..6b0a6e0300a4 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -787,6 +787,7 @@
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 
+#define MSR_IA32_PKRS			0x000006E1
 
 #define MSR_TSX_FORCE_ABORT		0x0000010F
 
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
new file mode 100644
index 000000000000..8180fc59790b
--- /dev/null
+++ b/arch/x86/include/asm/pks.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKS_H
+#define _ASM_X86_PKS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void pks_setup(void);
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pks_setup(void) { }
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7b8382c11788..83c1abce7d93 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -59,6 +59,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/uv/uv.h>
 #include <asm/sigframe.h>
+#include <asm/pks.h>
 
 #include "cpu.h"
 
@@ -1632,6 +1633,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	pks_setup();
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7c90b2188c5f..f904376570f4 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -6,6 +6,7 @@
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
+#include <linux/pks-keys.h>
 #include <uapi/asm-generic/mman-common.h>
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
@@ -209,3 +210,19 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
 	pkval &= ~(PKEY_ACCESS_MASK << shift);
 	return pkval | accessbits << shift;
 }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ */
+void pks_setup(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	wrmsrl(MSR_IA32_PKRS, PKS_INIT_VALUE);
+	cr4_set_bits(X86_CR4_PKS);
+}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 12/45] mm/pkeys: Define PKS page table macros
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (10 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 11/45] x86/pkeys: Enable PKS on cpus which support it ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 13/45] mm/pkeys: PKS testing, add initial test code ira.weiny
                   ` (33 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Fenghua Yu <fenghua.yu@intel.com>

Kernel PKS consumers will need a way to assign their pkey to pages.

Define _PAGE_PKEY() and PAGE_KERNEL_PKEY() to allow users to set a pkey
on a PTE.

Add documentation.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>

---
Changes for V9
	From Dave Hansen
		s/PKey/pkey

Changes for V8
	Split out from the 'Add PKS kernel API' patch
	Include documentation in this patch
---
 Documentation/core-api/protection-keys.rst |  6 ++++++
 arch/x86/include/asm/pgtable_types.h       | 22 ++++++++++++++++++++++
 include/linux/pgtable.h                    |  4 ++++
 3 files changed, 32 insertions(+)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index d501bd27ee29..fe63acf5abbe 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -136,3 +136,9 @@ PKS Key Allocation
 ------------------
 .. kernel-doc:: include/linux/pks-keys.h
         :doc: PKS_KEY_ALLOCATION
+
+Adding pages to a pkey protected domain
+---------------------------------------
+
+.. kernel-doc:: arch/x86/include/asm/pgtable_types.h
+        :doc: PKS_KEY_ASSIGNMENT
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..e1d4535b525e 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,22 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+/**
+ * DOC: PKS_KEY_ASSIGNMENT
+ *
+ * The following macros are used to set a pkey value in a supervisor PTE.
+ *
+ * .. code-block:: c
+ *
+ *         #define _PAGE_KEY(pkey)
+ *         #define PAGE_KERNEL_PKEY(pkey)
+ */
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+#else
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
 #else
@@ -226,6 +242,12 @@ enum page_cache_mode {
 #define PAGE_KERNEL_IO		__pgprot_mask(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE	__pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define PAGE_KERNEL_PKEY(pkey)	__pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+#else
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 /*         xwr */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..bcef6b306fcb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1511,6 +1511,10 @@ static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
+#ifndef PAGE_KERNEL_PKEY
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 /*
  * Page Table Modification bits for pgtbl_mod_mask.
  *
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 13/45] mm/pkeys: PKS testing, add initial test code
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (11 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 12/45] mm/pkeys: Define PKS page table macros ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 14/45] x86/selftests: Add test_pks ira.weiny
                   ` (32 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Define a PKS consumer for testing.

Two initial tests are created.  One to check that the default values
have been properly assigned and a second which purposely causes a fault.

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Simplify the commit message
	Simplify documentation in favor of using test_pks
	Complete re-arch of test code...
	Return -ENOENT for unknown tests
	Adjust the key allocation
	Reduce the globals used during fault detection
	Introduce a session structure to track information as long as the
		debugfs file remains open.
	Use pr_debug() for internal debug output.
	Document how to run tests from debugfs with trace_printk()
		output.
	Feedback from Rick Edgecombe
		Change pkey type to u8
		remove pks_test_exit
		set file data within the crash test to be cleaned up on
			file close
		Resolve when memory barriers are needed
	From Dave Hansen
		Place a lock around the execution of tests so that only
			a single thread execute at a time.

Changes for V8
	Ensure that unknown tests are flagged as failures.
	Split out the various tests into their own patches which test
		the functionality as the series goes.
	Move this basic test forward in the series

Changes for V7
	Add testing for pks_abandon_protections()
	Adjust pkrs_init_value
	Adjust for new defines
	Clean up comments
        Adjust test for static allocation of pkeys
        Use lookup_address() instead of follow_pte()
		follow_pte only works on IO and raw PFN mappings, use
		lookup_address() instead.  lookup_address() is
		constrained to architectures which support it.
---
 Documentation/core-api/protection-keys.rst |   6 +
 include/linux/pks-keys.h                   |   8 +-
 lib/Kconfig.debug                          |  12 +
 lib/Makefile                               |   3 +
 lib/pks/Makefile                           |   3 +
 lib/pks/pks_test.c                         | 301 +++++++++++++++++++++
 6 files changed, 331 insertions(+), 2 deletions(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index fe63acf5abbe..4d99ca41c914 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -142,3 +142,9 @@ Adding pages to a pkey protected domain
 
 .. kernel-doc:: arch/x86/include/asm/pgtable_types.h
         :doc: PKS_KEY_ASSIGNMENT
+
+Testing
+-------
+
+.. kernel-doc:: lib/pks/pks_test.c
+        :doc: PKS_TEST
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index c914afecb2d3..43e4ae42db2e 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -60,17 +60,21 @@
 
 /* PKS_KEY_DEFAULT must be 0 */
 #define PKS_KEY_DEFAULT		0
-#define PKS_KEY_MAX		PKS_NEW_KEY(PKS_KEY_DEFAULT, 1)
+#define PKS_KEY_TEST		PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_PKS_TEST)
+#define PKS_KEY_MAX		PKS_NEW_KEY(PKS_KEY_TEST, 1)
 
 /* PKS_KEY_DEFAULT_INIT must be RW */
 #define PKS_KEY_DEFAULT_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
+#define PKS_KEY_TEST_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_TEST, AD, \
+							CONFIG_PKS_TEST)
 
 #define PKS_ALL_AD_MASK \
 	GENMASK(PKS_NUM_PKEYS * PKR_BITS_PER_PKEY, \
 		PKS_KEY_MAX * PKR_BITS_PER_PKEY)
 
 #define PKS_INIT_VALUE ((PKS_ALL_AD & PKS_ALL_AD_MASK) | \
-			PKS_KEY_DEFAULT_INIT \
+			PKS_KEY_DEFAULT_INIT | \
+			PKS_KEY_TEST_INIT \
 			)
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 14b89aa37c5c..5cab2100c133 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2685,6 +2685,18 @@ config HYPERV_TESTING
 	help
 	  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TEST
+	bool "PKey (S)upervisor testing"
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	select ARCH_ENABLE_SUPERVISOR_PKEYS
+	help
+	  Select this option to enable testing of PKS core software and
+	  hardware.
+
+	  Answer N if you don't know what supervisor keys are.
+
+	  If unsure, say N.
+
 endmenu # "Kernel Testing and Coverage"
 
 source "Documentation/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 300f569c626b..038a93c89714 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -398,3 +398,6 @@ $(obj)/$(TEST_FORTIFY_LOG): $(addprefix $(obj)/, $(TEST_FORTIFY_LOGS)) FORCE
 ifeq ($(CONFIG_FORTIFY_SOURCE),y)
 $(obj)/string.o: $(obj)/$(TEST_FORTIFY_LOG)
 endif
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..9daccba4f7c4
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TEST) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..2fc92aaa54e8
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2022 Intel Corporation. All rights reserved.
+ */
+
+/**
+ * DOC: PKS_TEST
+ *
+ * When CONFIG_PKS_TEST is enabled a debugfs file is created to facilitate in
+ * kernel testing.  Tests can be triggered by writing a test number to
+ * /sys/kernel/debug/x86/run_pks
+ *
+ * Results and debug output can be seen through dynamic debug.
+ *
+ * Example:
+ *
+ * .. code-block:: sh
+ *
+ *	# Enable kernel debug
+ *	echo "file pks_test.c +pflm" > /sys/kernel/debug/dynamic_debug/control
+ *
+ *	# Run test
+ *	echo 0 > /sys/kernel/debug/x86/run_pks
+ *
+ *	# Turn off kernel debug
+ *	echo "file pks_test.c -p" > /sys/kernel/debug/dynamic_debug/control
+ *
+ *	# view kernel debugging output
+ *	dmesg -H | grep pks_test
+ */
+
+#include <linux/debugfs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/pks-keys.h>
+
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+#define CHECK_DEFAULTS		0
+#define RUN_CRASH_TEST		9
+
+static struct dentry *pks_test_dentry;
+
+DEFINE_MUTEX(test_run_lock);
+
+struct pks_test_ctx {
+	u8 pkey;
+	char data[64];
+	void *test_page;
+};
+
+static void debug_context(const char *label, struct pks_test_ctx *ctx)
+{
+	pr_debug("%s [%d] %s <-> %p\n",
+		     label,
+		     ctx->pkey,
+		     ctx->data,
+		     ctx->test_page);
+}
+
+struct pks_session_data {
+	struct pks_test_ctx *ctx;
+	bool need_unlock;
+	bool crash_armed;
+	bool last_test_pass;
+};
+
+static void debug_session(const char *label, struct pks_session_data *sd)
+{
+	pr_debug("%s ctx %p; unlock %d; crash %d; last test %s\n",
+		     label,
+		     sd->ctx,
+		     sd->need_unlock,
+		     sd->crash_armed,
+		     sd->last_test_pass ? "PASS" : "FAIL");
+
+}
+
+static void debug_result(const char *label, int test_num,
+			 struct pks_session_data *sd)
+{
+	pr_debug("%s [%d]: %s\n",
+		     label, test_num,
+		     sd->last_test_pass ? "PASS" : "FAIL");
+}
+
+static void *alloc_test_page(u8 pkey)
+{
+	return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START,
+				    VMALLOC_END, GFP_KERNEL,
+				    PAGE_KERNEL_PKEY(pkey), 0,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+	if (!ctx)
+		return;
+
+	vfree(ctx->test_page);
+	kfree(ctx);
+}
+
+static struct pks_test_ctx *alloc_ctx(u8 pkey)
+{
+	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->pkey = pkey;
+	sprintf(ctx->data, "%s", "DEADBEEF");
+
+	ctx->test_page = alloc_test_page(ctx->pkey);
+	if (!ctx->test_page) {
+		pr_debug("Test page allocation failed\n");
+		kfree(ctx);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	debug_context("Context allocated", ctx);
+	return ctx;
+}
+
+static void set_ctx_data(struct pks_session_data *sd, struct pks_test_ctx *ctx)
+{
+	if (sd->ctx) {
+		pr_debug("Context data already set\n");
+		free_ctx(sd->ctx);
+	}
+	pr_debug("Setting context data; %p\n", ctx);
+	sd->ctx = ctx;
+}
+
+static void crash_it(struct pks_session_data *sd)
+{
+	struct pks_test_ctx *ctx;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to allocate context???\n");
+		sd->last_test_pass = false;
+		return;
+	}
+	set_ctx_data(sd, ctx);
+
+	pr_debug("Purposely faulting...\n");
+	memcpy(ctx->test_page, ctx->data, 8);
+
+	pr_err("ERROR: Should never get here...\n");
+	sd->last_test_pass = false;
+}
+
+static void check_pkey_settings(void *data)
+{
+	struct pks_session_data *sd = data;
+	unsigned long long msr = 0;
+	unsigned int cpu = smp_processor_id();
+
+	rdmsrl(MSR_IA32_PKRS, msr);
+	pr_debug("cpu %d 0x%llx\n", cpu, msr);
+	if (msr != PKS_INIT_VALUE) {
+		pr_err("cpu %d value incorrect : 0x%llx expected 0x%lx\n",
+			cpu, msr, PKS_INIT_VALUE);
+		sd->last_test_pass = false;
+	}
+}
+
+static void arm_or_run_crash_test(struct pks_session_data *sd)
+{
+
+	/*
+	 * WARNING: Test "9" will crash.
+	 * Arm the test.
+	 * A second "9" will run the test.
+	 */
+	if (!sd->crash_armed) {
+		pr_debug("Arming crash test\n");
+		sd->crash_armed = true;
+		return;
+	}
+
+	sd->crash_armed = false;
+	crash_it(sd);
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	struct pks_session_data *sd = file->private_data;
+	char buf[64];
+	unsigned int len;
+
+	len = sprintf(buf, "%s\n", sd->last_test_pass ? "PASS" : "FAIL");
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+			      size_t count, loff_t *ppos)
+{
+	struct pks_session_data *sd = file->private_data;
+	long test_num;
+	char buf[2];
+
+	pr_debug("Begin...\n");
+	sd->last_test_pass = false;
+
+	if (copy_from_user(buf, user_buf, 1))
+		return -EFAULT;
+	buf[1] = '\0';
+
+	if (kstrtol(buf, 0, &test_num))
+		return -EINVAL;
+
+	if (mutex_lock_interruptible(&test_run_lock))
+		return -EBUSY;
+
+	sd->need_unlock = true;
+	sd->last_test_pass = true;
+
+	switch (test_num) {
+	case RUN_CRASH_TEST:
+		pr_debug("crash test\n");
+		arm_or_run_crash_test(file->private_data);
+		goto unlock_test;
+	case CHECK_DEFAULTS:
+		pr_debug("check defaults test: 0x%lx\n", PKS_INIT_VALUE);
+		on_each_cpu(check_pkey_settings, file->private_data, 1);
+		break;
+	default:
+		pr_debug("Unknown test\n");
+		sd->last_test_pass = false;
+		count = -ENOENT;
+		break;
+	}
+
+	/* Clear arming on any test run */
+	pr_debug("Clearing crash test arm\n");
+	sd->crash_armed = false;
+
+unlock_test:
+	/*
+	 * Normal exit; clear up the locking flag
+	 */
+	sd->need_unlock = false;
+	mutex_unlock(&test_run_lock);
+	debug_result("Test complete", test_num, sd);
+	return count;
+}
+
+static int pks_open_file(struct inode *inode, struct file *file)
+{
+	struct pks_session_data *sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+
+	if (!sd)
+		return -ENOMEM;
+
+	debug_session("Allocated session", sd);
+	file->private_data = sd;
+
+	return 0;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+	struct pks_session_data *sd = file->private_data;
+
+	debug_session("Freeing session", sd);
+
+	/*
+	 * Some tests may fault and not return through the normal write
+	 * syscall.  The crash test is specifically designed to do this.  Clean
+	 * up the run lock when the file is closed if the write syscall does
+	 * not exit normally.
+	 */
+	if (sd->need_unlock)
+		mutex_unlock(&test_run_lock);
+	free_ctx(sd->ctx);
+	kfree(sd);
+	return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+	.read = pks_read_file,
+	.write = pks_write_file,
+	.llseek = default_llseek,
+	.open = pks_open_file,
+	.release = pks_release_file,
+};
+
+static int __init pks_test_init(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_PKS))
+		pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+						      NULL, &fops_init_pks);
+
+	return 0;
+}
+late_initcall(pks_test_init);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 14/45] x86/selftests: Add test_pks
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (12 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 13/45] mm/pkeys: PKS testing, add initial test code ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 15/45] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
                   ` (31 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKS kernel tests are clumsy to run using debugfs directly.  It is
much nicer to have a user space application trigger the execution of
those tests.

Create test_pks as a selftest.

Output is as follows.

$ ./test_pks_64 -h
Usage: ./test_pks_64 [-h,-d] [test]
	--help,-h   This help
	--debug,-d  Output kernel debug via dynamic debug if available

        Run all PKS tests or the [test] specified.

	[test] can be one of:
	       'check_defaults'
	       'create_fault' (Not included in run all)

$ ./test_pks_64
[RUN]	check_defaults
[OK]

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9:
	New Patch
---
 Documentation/core-api/protection-keys.rst |   3 +
 tools/testing/selftests/x86/Makefile       |   2 +-
 tools/testing/selftests/x86/test_pks.c     | 353 +++++++++++++++++++++
 3 files changed, 357 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 4d99ca41c914..23330a7d53eb 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -148,3 +148,6 @@ Testing
 
 .. kernel-doc:: lib/pks/pks_test.c
         :doc: PKS_TEST
+
+.. kernel-doc:: tools/testing/selftests/x86/test_pks.c
+        :doc: PKS_TEST_USER
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 8a1f62ab3c8e..e08670596c14 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
 			check_initial_reg_state sigreturn iopl ioperm \
 			test_vsyscall mov_ss_trap \
-			syscall_arg_fault fsgsbase_restore sigaltstack
+			syscall_arg_fault fsgsbase_restore sigaltstack test_pks
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
new file mode 100644
index 000000000000..df5bde9bfdbe
--- /dev/null
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -0,0 +1,353 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2022 Intel Corporation. All rights reserved.
+ */
+
+/**
+ * DOC: PKS_TEST_USER
+ *
+ * To assist in executing the tests 'test_pks' can be built from the
+ * tools/testing directory.  See the help output for details.
+ *
+ * .. code-block:: sh
+ *
+ *	$ cd tools/testing/selftests/x86
+ *	$ make test_pks
+ *	$ ./test_pks_64 -h
+ *	...
+ */
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <time.h>
+
+#define DYN_DBG_CNT_FILE "/sys/kernel/debug/dynamic_debug/control"
+#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks"
+
+/* Values from the kernel */
+#define CHECK_DEFAULTS		"0"
+#define RUN_CRASH_TEST		"9"
+
+time_t g_start_time;
+int g_debug;
+
+#define PRINT_DEBUG(fmt, ...) \
+	do { \
+		if (g_debug) \
+			printf("%s: " fmt, __func__, ##__VA_ARGS__); \
+	} while (0)
+
+#define PRINT_ERROR(fmt, ...) \
+	fprintf(stderr, "%s: " fmt, __func__, ##__VA_ARGS__)
+
+static int do_simple_test(const char *debugfs_str);
+
+/*
+ * The crash test is a special case which is not included in the run all
+ * option.  Do not add it here.
+ */
+enum {
+	TEST_DEFAULTS = 0,
+	MAX_TESTS,
+} tests;
+
+/* Special */
+#define CREATE_FAULT_TEST_NAME "create_fault"
+
+struct test_item {
+	char *name;
+	const char *debugfs_str;
+	int (*test_fn)(const char *debugfs_str);
+} test_list[] = {
+	{ "check_defaults", CHECK_DEFAULTS, do_simple_test }
+};
+
+static char *get_test_name(int test_num)
+{
+	if (test_num > MAX_TESTS)
+		return "<UNKNOWN>";
+	/* Special: not in run all */
+	if (test_num == MAX_TESTS)
+		return CREATE_FAULT_TEST_NAME;
+	return test_list[test_num].name;
+}
+
+static int get_test_num(char *test_name)
+{
+	int i;
+
+	/* Special: not in run all */
+	if (strcmp(test_name, CREATE_FAULT_TEST_NAME) == 0)
+		return MAX_TESTS;
+
+	for (i = 0; i < MAX_TESTS; i++)
+		if (strcmp(test_name, test_list[i].name) == 0)
+			return i;
+	return -1;
+}
+
+static void print_help_and_exit(char *argv0)
+{
+	int i;
+
+	printf("Usage: %s [-h,-d] [test]\n", argv0);
+	printf("	--help,-h   This help\n");
+	printf("	--debug,-d  Output kernel debug via dynamic debug if available\n");
+	printf("\n");
+	printf("        Run all PKS tests or the [test] specified.\n");
+	printf("\n");
+	printf("	[test] can be one of:\n");
+
+	for (i = 0; i < MAX_TESTS; i++)
+		printf("	       '%s'\n", get_test_name(i));
+
+	/* Special: not in run all */
+	printf("	       '%s' (Not included in run all)\n",
+		CREATE_FAULT_TEST_NAME);
+
+	printf("\n");
+}
+
+/*
+ * Do a simple test of writing the debugfs value and reading back for 'PASS'
+ */
+static int do_simple_test(const char *debugfs_str)
+{
+	char str[16];
+	int fd, rc = 0;
+
+	fd = open(PKS_TEST_FILE, O_RDWR);
+	if (fd < 0) {
+		PRINT_DEBUG("Failed to open test file : %s\n", PKS_TEST_FILE);
+		return -ENOENT;
+	}
+
+	rc = write(fd, debugfs_str, strlen(debugfs_str));
+	if (rc < 0) {
+		rc = -errno;
+		goto close_file;
+	}
+
+	rc = read(fd, str, 16);
+	if (rc < 0)
+		goto close_file;
+
+	str[15] = '\0';
+
+	if (strncmp(str, "PASS", 4)) {
+		PRINT_ERROR("result: %s\n", str);
+		rc = -EFAULT;
+		goto close_file;
+	}
+
+	rc = 0;
+
+close_file:
+	close(fd);
+	return rc;
+}
+
+/*
+ * This test is special in that it requires the option to be written 2 times.
+ * In addition because it creates a fault it is not included in the run all
+ * test suite.
+ */
+static int create_fault(void)
+{
+	char str[16];
+	int fd, rc = 0;
+
+	fd = open(PKS_TEST_FILE, O_RDWR);
+	if (fd < 0) {
+		PRINT_DEBUG("Failed to open test file : %s\n", PKS_TEST_FILE);
+		return -ENOENT;
+	}
+
+	rc = write(fd, "9", 1);
+	if (rc < 0) {
+		rc = -errno;
+		goto close_file;
+	}
+
+	rc = write(fd, "9", 1);
+	if (rc < 0)
+		goto close_file;
+
+	rc = read(fd, str, 16);
+	if (rc < 0)
+		goto close_file;
+
+	str[15] = '\0';
+
+	if (strncmp(str, "PASS", 4)) {
+		PRINT_ERROR("result: %s\n", str);
+		rc = -EFAULT;
+		goto close_file;
+	}
+
+	rc = 0;
+
+close_file:
+	close(fd);
+	return rc;
+}
+
+static int run_one(int test_num)
+{
+	int ret;
+
+	printf("[RUN]\t%s\n", get_test_name(test_num));
+
+	if (test_num == MAX_TESTS)
+		/* Special: not in run all */
+		ret = create_fault();
+	else
+		ret = test_list[test_num].test_fn(test_list[test_num].debugfs_str);
+
+	if (ret == -ENOENT) {
+		printf("[SKIP] Test not supported\n");
+		return 0;
+	} else if (ret) {
+		printf("[FAIL]\n");
+		return 1;
+	}
+
+	printf("[OK]\n");
+	return 0;
+}
+
+static int run_all(void)
+{
+	int i, rc = 0;
+
+	for (i = 0; i < MAX_TESTS; i++) {
+		int ret = run_one(i);
+
+		/* sticky fail */
+		if (ret)
+			rc = ret;
+	}
+
+	return rc;
+}
+
+#define STR_LEN 256
+
+/* Debug output in the kernel is through dynamic debug */
+static void setup_debug(void)
+{
+	char str[STR_LEN];
+	int fd, rc;
+
+	g_start_time = time(NULL);
+
+	fd = open(DYN_DBG_CNT_FILE, O_RDWR);
+	if (fd < 0) {
+		PRINT_ERROR("Dynamic debug not available: Failed to open: %s\n",
+			DYN_DBG_CNT_FILE);
+		return;
+	}
+
+	snprintf(str, STR_LEN, "file pks_test.c +pflm");
+
+	rc = write(fd, str, strlen(str));
+	if (rc != strlen(str))
+		PRINT_ERROR("ERROR: Failed to set up dynamic debug...\n");
+
+	close(fd);
+}
+
+static void print_debug(void)
+{
+	char str[STR_LEN];
+	struct tm *tm;
+	int fd, rc;
+
+	fd = open(DYN_DBG_CNT_FILE, O_RDWR);
+	if (fd < 0)
+		return;
+
+	snprintf(str, STR_LEN, "file pks_test.c -p");
+
+	rc = write(fd, str, strlen(str));
+	if (rc != strlen(str))
+		PRINT_ERROR("ERROR: Failed to turn off dynamic debug...\n");
+
+	close(fd);
+
+	/*
+	 * dmesg is not accurate with time stamps so back up the start time a
+	 * bit to ensure all the output from this run is dumped.
+	 */
+	g_start_time -= 5;
+	tm = localtime(&g_start_time);
+
+	snprintf(str, STR_LEN,
+		 "dmesg -H --since '%d-%d-%d %d:%d:%d' | grep pks_test",
+		 tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+		 tm->tm_hour, tm->tm_min, tm->tm_sec);
+	system(str);
+	printf("\tDebug output command (approximate start time):\n\t\t%s\n",
+		str);
+}
+
+int main(int argc, char *argv[])
+{
+	int flag_all = 1;
+	int test_num = 0;
+	int rc;
+
+	while (1) {
+		static struct option long_options[] = {
+			{"help",	no_argument,	0,	'h' },
+			{"debug",	no_argument,	0,	'd' },
+			{0,		0,		0,	0 }
+		};
+		int option_index = 0;
+		int c;
+
+		c = getopt_long(argc, argv, "hd", long_options, &option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'h':
+			print_help_and_exit(argv[0]);
+			return 0;
+		case 'd':
+			g_debug++;
+			break;
+		default:
+			print_help_and_exit(argv[0]);
+			exit(-1);
+		}
+	}
+
+	if (optind < argc) {
+		test_num = get_test_num(argv[optind]);
+		if (test_num < 0) {
+			printf("[RUN]\t'%s'\n[SKIP]\tInvalid test\n", argv[optind]);
+			return 1;
+		}
+
+		flag_all = 0;
+	}
+
+	if (g_debug)
+		setup_debug();
+
+	if (flag_all)
+		rc = run_all();
+	else
+		rc = run_one(test_num);
+
+	if (g_debug)
+		print_debug();
+
+	return rc;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 15/45] x86/pkeys: Introduce pks_write_pkrs()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (13 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 14/45] x86/selftests: Add test_pks ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 16/45] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
                   ` (30 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Writing to MSR's is inefficient.  Even though the underlying PKS
register, MSR_IA32_PKRS, is not serializing; writing to the MSR should
be avoided if possible.  Especially when updates are made in critical
paths such as the scheduler or the entry code.

Introduce pks_write_pkrs().  pks_write_pkrs() avoids writing
MSR_IA32_PKRS if the pkrs value has not changed for the current CPU.
Most of the callers are in a non-preemptable code path.  Therefore,
avoid calling preempt_{disable,enable}() to protect the per-cpu cache
and instead rely on outer calls for this protection.  Do the same with
checks to X86_FEATURE_PKS.

On startup, while unlikely, the PKS_INIT_VALUE may be 0.  This would
prevent pks_write_pkrs() from updating the MSR because of the initial
value of the per-cpu cache.  Therefore, keep the MSR write in
pks_setup() to ensure the MSR is initialized at least one time.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dave Hansen
		Update commit message with a bit more detail about why
			this optimization is needed
		Update the code comments as well.

Changes for V8
	From Thomas
		Remove get/put_cpu_ptr() and make this a 'lower level
		call.  This makes it preemption unsafe but it is called
		mostly where preemption is already disabled.  Add this
		as a predicate of the call and those calls which need to
		can disable preemption.
		Add lockdep assert for preemption
	Ensure MSR gets written even if the PKS_INIT_VALUE is 0.
	Completely re-write the commit message.
	s/write_pkrs/pks_write_pkrs/
	Split this off into a singular patch

Changes for V7
	Create a dynamic pkrs_initial_value in early init code.
	Clean up comments
	Add comment to macro guard
---
 arch/x86/mm/pkeys.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f904376570f4..10521f1a292e 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -213,15 +213,56 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
+static DEFINE_PER_CPU(u32, pkrs_cache);
+
+/*
+ * pks_write_pkrs() - Write the pkrs of the current CPU
+ * @new_pkrs: New value to write to the current CPU register
+ *
+ * Optimizes the MSR writes by maintaining a per cpu cache.
+ *
+ * Context: must be called with preemption disabled
+ * Context: must only be called if PKS is enabled
+ *
+ * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
+ * serializing but still maintains ordering properties similar to WRPKRU.
+ * The current SDM section on PKRS needs updating but should be the same as
+ * that of WRPKRU.  Quote from the WRPKRU text:
+ *
+ *     WRPKRU will never execute transiently. Memory accesses
+ *     affected by PKRU register will not execute (even transiently)
+ *     until all prior executions of WRPKRU have completed execution
+ *     and updated the PKRU register.
+ */
+static inline void pks_write_pkrs(u32 new_pkrs)
+{
+	u32 pkrs = __this_cpu_read(pkrs_cache);
+
+	lockdep_assert_preemption_disabled();
+
+	if (pkrs != new_pkrs) {
+		__this_cpu_write(pkrs_cache, new_pkrs);
+		wrmsrl(MSR_IA32_PKRS, new_pkrs);
+	}
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
+ *
+ * Context: must be called with preemption disabled
  */
 void pks_setup(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
+	/*
+	 * If the PKS_INIT_VALUE is 0 then pks_write_pkrs() will fail to
+	 * initialize the MSR.  Do a single write here to ensure the MSR is
+	 * written at least one time.
+	 */
 	wrmsrl(MSR_IA32_PKRS, PKS_INIT_VALUE);
+	pks_write_pkrs(PKS_INIT_VALUE);
 	cr4_set_bits(X86_CR4_PKS);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 16/45] x86/pkeys: Preserve the PKS MSR on context switch
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (14 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 15/45] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 17/45] mm/pkeys: Introduce pks_set_readwrite() ira.weiny
                   ` (29 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKS MSR (PKRS) is a per-logical-processor register.  Unfortunately,
the MSR is not managed by XSAVE.  Therefore, software must save/restore
the MSR value on context switch.

Allocate space in thread_struct to hold the saved MSR value.  Ensure all
tasks, including the init_task are properly initialized.  Set the CPU
PKRS value when a task is scheduled.

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dave Hansen
		Clarify the commit message
		s/pks_saved_pkrs/pkrs/
		s/pks_write_current/x86_pkrs_load/
		Change x86_pkrs_load to take the next thread instead of
			'current'

Changes for V8
	From Thomas
		Ensure pkrs_write_current() does not suffer the overhead
		of preempt disable.
		Fix setting of initial value
		Remove flawed and broken create_initial_pkrs_value() in
			favor of a much simpler and robust macro default
		Update function names to be consistent.

	s/pkrs_write_current/pks_write_current
		This is a more consistent name
	s/saved_pkrs/pks_saved_pkrs
	s/pkrs_init_value/PKS_INIT_VALUE
	Remove pks_init_task()
		This function was added mainly to avoid the header file
		issue.  Adding pks-keys.h solved that and saves the
		complexity.

Changes for V7
	Move definitions from asm/processor.h to asm/pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Change pks_init_task()/pks_sched_in() to functions
	s/pks_sched_in/pks_write_current to be used more generically
	later in the series
---
 arch/x86/include/asm/pks.h       |  2 ++
 arch/x86/include/asm/processor.h | 15 ++++++++++++++-
 arch/x86/kernel/process_64.c     |  2 ++
 arch/x86/mm/pkeys.c              |  9 +++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 8180fc59790b..a7bad7301783 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -5,10 +5,12 @@
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
 void pks_setup(void);
+void x86_pkrs_load(struct thread_struct *thread);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pks_setup(void) { }
+static inline void x86_pkrs_load(struct thread_struct *thread) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2c5f12ae7d04..e3874c2d175e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PROCESSOR_H
 #define _ASM_X86_PROCESSOR_H
 
+#include <linux/pks-keys.h>
+
 #include <asm/processor-flags.h>
 
 /* Forward declaration, a strange C thing */
@@ -527,6 +529,10 @@ struct thread_struct {
 	 * PKRU is the hardware itself.
 	 */
 	u32			pkru;
+#ifdef	CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* Saved Protection key register for supervisor mappings */
+	u32			pkrs;
+#endif
 
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
@@ -769,7 +775,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task)		(task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define INIT_THREAD  {			\
+	.pkrs = PKS_INIT_VALUE,		\
+}
+#else
+#define INIT_THREAD  { }
+#endif
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3402edec236c..e703cc451128 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,6 +59,7 @@
 /* Not included via unistd.h */
 #include <asm/unistd_32_ia32.h>
 #endif
+#include <asm/pks.h>
 
 #include "process.h"
 
@@ -612,6 +613,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	x86_fsgsbase_load(prev, next);
 
 	x86_pkru_load(prev, next);
+	x86_pkrs_load(next);
 
 	/*
 	 * Switch the PDA and FPU contexts.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 10521f1a292e..39e4c2cbc279 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -246,6 +246,15 @@ static inline void pks_write_pkrs(u32 new_pkrs)
 	}
 }
 
+/* x86_pkrs_load() - Update CPU with the incoming thread pkrs value */
+void x86_pkrs_load(struct thread_struct *thread)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	pks_write_pkrs(thread->pkrs);
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
  *
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 17/45] mm/pkeys: Introduce pks_set_readwrite()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (15 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 16/45] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 18/45] mm/pkeys: Introduce pks_set_noaccess() ira.weiny
                   ` (28 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

When kernel code needs access to a PKS protected page they will need to
change the protections for the pkey to Read/Write.

Define pks_set_readwrite() to update the specified pkey.  Define
pks_update_protection() as a helper to do the heavy lifting and allow
for subsequent pks_set_*() calls.

Define PKEY_READ_WRITE rather than use a magic value of '0' in
pks_update_protection().

Finally, ensure preemption is disabled for pks_write_pkrs() because the
context of this call can not generally be predicted.

pks.h is created to avoid conflicts and header dependencies with the
user space pkey code.

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
changes for v9
	Move MSR documentation note to this patch
	move declarations to incline/linux/pks.h
	from rick edgecombe
		change pkey type to u8
	validate pkey range in pks_update_protection
	from 0day
		fix documentation link
	from dave hansen
		s/pks_mk_*/pks_set_*/
		use pkey
		s/pks_saved_pkrs/pkrs/

changes for v8
	define pkey_read_write
	make the call inline
	clean up the names
	use pks_write_pkrs() with preemption disabled
	split this out from 'add pks kernel api'
	include documentation in this patch
---
 Documentation/core-api/protection-keys.rst | 15 +++++++++++
 arch/x86/mm/pkeys.c                        | 31 ++++++++++++++++++++++
 include/linux/pks.h                        | 31 ++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h     |  1 +
 4 files changed, 78 insertions(+)
 create mode 100644 include/linux/pks.h

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 23330a7d53eb..e6564f5336b7 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -143,6 +143,21 @@ Adding pages to a pkey protected domain
 .. kernel-doc:: arch/x86/include/asm/pgtable_types.h
         :doc: PKS_KEY_ASSIGNMENT
 
+Changing permissions of individual keys
+---------------------------------------
+
+.. kernel-doc:: include/linux/pks.h
+        :identifiers: pks_set_readwrite
+
+MSR details
+~~~~~~~~~~~
+
+WRMSR is typically an architecturally serializing instruction.  However,
+WRMSR(MSR_IA32_PKRS) is an exception.  It is not a serializing instruction and
+instead maintains ordering properties similar to WRPKRU.  Thus it is safe to
+immediately use a mapping when the pks_set*() functions returns.  Check the
+latest SDM for details.
+
 Testing
 -------
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 39e4c2cbc279..e4cbc79686ea 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -6,6 +6,7 @@
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
+#include <linux/pks.h>
 #include <linux/pks-keys.h>
 #include <uapi/asm-generic/mman-common.h>
 
@@ -275,4 +276,34 @@ void pks_setup(void)
 	cr4_set_bits(X86_CR4_PKS);
 }
 
+/*
+ * Do not call this directly, see pks_set*().
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ *
+ */
+void pks_update_protection(u8 pkey, u8 protection)
+{
+	u32 pkrs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	if (WARN_ON_ONCE(pkey >= PKS_KEY_MAX))
+		return;
+
+	pkrs = current->thread.pkrs;
+	current->thread.pkrs = pkey_update_pkval(pkrs, pkey,
+						 protection);
+	preempt_disable();
+	pks_write_pkrs(current->thread.pkrs);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(pks_update_protection);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pks.h b/include/linux/pks.h
new file mode 100644
index 000000000000..8b705a937b19
--- /dev/null
+++ b/include/linux/pks.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKS_H
+#define _LINUX_PKS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+#include <linux/types.h>
+
+#include <uapi/asm-generic/mman-common.h>
+
+void pks_update_protection(u8 pkey, u8 protection);
+
+/**
+ * pks_set_readwrite() - Make the domain Read/Write
+ * @pkey: the pkey for which the access should change.
+ *
+ * Allow all access, read and write, to the domain specified by pkey.  This is
+ * not a global update and only affects the current running thread.
+ */
+static inline void pks_set_readwrite(u8 pkey)
+{
+	pks_update_protection(pkey, PKEY_READ_WRITE);
+}
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pks_set_readwrite(u8 pkey) {}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _LINUX_PKS_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1567a3294c3d..3da6ac9e5ded 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -78,6 +78,7 @@
 /* compatibility flags */
 #define MAP_FILE	0
 
+#define PKEY_READ_WRITE		0x0
 #define PKEY_DISABLE_ACCESS	0x1
 #define PKEY_DISABLE_WRITE	0x2
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 18/45] mm/pkeys: Introduce pks_set_noaccess()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (16 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 17/45] mm/pkeys: Introduce pks_set_readwrite() ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 19/45] mm/pkeys: Introduce PKS fault callbacks ira.weiny
                   ` (27 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

After a valid access consumers will want to change PKS protections back
to No Access for their pkey.

Define pks_set_noaccess() to update the specified pkey.

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Move to pks.h
	Change pkey type to u8
	From 0day
		Fix documentation link
	From Dave Hansen
		use pkey
		s/pks_mk*/pks_set*/

Changes for V8
	Make the call inline
	Split this patch out from 'Add PKS kernel API'
	Include documentation in this patch
---
 Documentation/core-api/protection-keys.rst |  2 +-
 include/linux/pks.h                        | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index e6564f5336b7..2ec35349ecfd 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -147,7 +147,7 @@ Changing permissions of individual keys
 ---------------------------------------
 
 .. kernel-doc:: include/linux/pks.h
-        :identifiers: pks_set_readwrite
+        :identifiers: pks_set_readwrite pks_set_noaccess
 
 MSR details
 ~~~~~~~~~~~
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 8b705a937b19..9f18f8b4cbb1 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -10,6 +10,18 @@
 
 void pks_update_protection(u8 pkey, u8 protection);
 
+/**
+ * pks_set_noaccess() - Disable all access to the domain
+ * @pkey: the pkey for which the access should change.
+ *
+ * Disable all access to the domain specified by pkey.  This is not a global
+ * update and only affects the current running thread.
+ */
+static inline void pks_set_noaccess(u8 pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+
 /**
  * pks_set_readwrite() - Make the domain Read/Write
  * @pkey: the pkey for which the access should change.
@@ -24,6 +36,7 @@ static inline void pks_set_readwrite(u8 pkey)
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+static inline void pks_set_noaccess(u8 pkey) {}
 static inline void pks_set_readwrite(u8 pkey) {}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 19/45] mm/pkeys: Introduce PKS fault callbacks
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (17 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 18/45] mm/pkeys: Introduce pks_set_noaccess() ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 20/45] mm/pkeys: PKS testing, add a fault call back ira.weiny
                   ` (26 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Some PKS consumers will want special handling on violations of pkey
permissions.  Such a consumer is PMEM which will want to have a mode
that logs the access violation, disables protection, and continues
rather than oops'ing the machine.

Provide an API to assign callbacks for individual pkeys.

Since PKS faults do not provide the key that faulted, this information
needs to be recovered by walking the page tables and extracting it from
the leaf entry.  The key can then be used to call the proper callback.

Add documentation.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
Changes for V9:
	Rework commit message
	Adjust for the use of linux/pks.h
	From the new key allocation: s/PKS_NR_CONSUMERS/PKS_KEY_MAX
	From Dave Hansen
		use pkey
		Fix conflicts with other users in the test code by
			moving this forward in the series

Changes for V8:
	Add pt_regs to the callback signature so that
		pks_update_exception() can be called if needed.
	Update commit message
	Determine if page is large prior to not present
	Update commit message with more clarity as to why this was kept
		separate from pks_abandon_protections() and
		pks_test_callback()
	Embed documentation in c file.
	Move handle_pks_key_fault() to pkeys.c
		s/handle_pks_key_fault/pks_handle_key_fault/
		This consolidates the PKS code nicely
	Add feature check to pks_handle_key_fault()
	From Rick Edgecombe
		Fix key value check
	From kernel test robot
		Add static to handle_pks_key_fault

Changes for V7:
	New patch
---
 Documentation/core-api/protection-keys.rst |  6 ++
 arch/x86/include/asm/pks.h                 | 10 +++
 arch/x86/mm/fault.c                        | 17 +++--
 arch/x86/mm/pkeys.c                        | 86 ++++++++++++++++++++++
 include/linux/pks.h                        |  3 +
 5 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 2ec35349ecfd..5fdc83a39d4e 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -149,6 +149,12 @@ Changing permissions of individual keys
 .. kernel-doc:: include/linux/pks.h
         :identifiers: pks_set_readwrite pks_set_noaccess
 
+Overriding Default Fault Behavior
+---------------------------------
+
+.. kernel-doc:: arch/x86/mm/pkeys.c
+        :doc: DEFINE_PKS_FAULT_CALLBACK
+
 MSR details
 ~~~~~~~~~~~
 
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index a7bad7301783..e9ad3ecd7ed0 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -7,11 +7,21 @@
 void pks_setup(void);
 void x86_pkrs_load(struct thread_struct *thread);
 
+bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+			  unsigned long address);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pks_setup(void) { }
 static inline void x86_pkrs_load(struct thread_struct *thread) { }
 
+static inline bool pks_handle_key_fault(struct pt_regs *regs,
+					unsigned long hw_error_code,
+					unsigned long address)
+{
+	return false;
+}
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5599109d1124..e8934df1b886 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
 #include <asm/irq_stack.h>
+#include <asm/pks.h>			/* pks_handle_key_fault() */
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1147,12 +1148,16 @@ static void
 do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
-	/*
-	 * PF_PF faults should only occur on kernel
-	 * addresses when supervisor pkeys are enabled.
-	 */
-	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
-		     (hw_error_code & X86_PF_PK));
+	if (hw_error_code & X86_PF_PK) {
+		/*
+		 * PF_PF faults should only occur on kernel
+		 * addresses when supervisor pkeys are enabled.
+		 */
+		WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS));
+
+		if (pks_handle_key_fault(regs, hw_error_code, address))
+			return;
+	}
 
 #ifdef CONFIG_X86_32
 	/*
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e4cbc79686ea..a3b27b7811da 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -12,6 +12,7 @@
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
 #include <asm/mmu_context.h>            /* vma_pkey()                   */
+#include <asm/trap_pf.h>		/* X86_PF_WRITE */
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
@@ -216,6 +217,91 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
 
 static DEFINE_PER_CPU(u32, pkrs_cache);
 
+/**
+ * DOC: DEFINE_PKS_FAULT_CALLBACK
+ *
+ * Users may also provide a fault handler which can handle a fault differently
+ * than an oops.  For example if 'MY_FEATURE' wanted to define a handler they
+ * can do so by adding the coresponding entry to the pks_key_callbacks array.
+ *
+ * .. code-block:: c
+ *
+ *	#ifdef CONFIG_MY_FEATURE
+ *	bool my_feature_pks_fault_callback(struct pt_regs *regs,
+ *					   unsigned long address, bool write)
+ *	{
+ *		if (my_feature_fault_is_ok)
+ *			return true;
+ *		return false;
+ *	}
+ *	#endif
+ *
+ *	static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = {
+ *		[PKS_KEY_DEFAULT]            = NULL,
+ *	#ifdef CONFIG_MY_FEATURE
+ *		[PKS_KEY_MY_FEATURE]         = my_feature_pks_fault_callback,
+ *	#endif
+ *	};
+ */
+static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = { 0 };
+
+static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
+				    bool write, u16 key)
+{
+	if (key >= PKS_KEY_MAX)
+		return false;
+
+	if (pks_key_callbacks[key])
+		return pks_key_callbacks[key](regs, address, write);
+
+	return false;
+}
+
+bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+			  unsigned long address)
+{
+	bool write;
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+	pte_t pte;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return false;
+
+	write = (hw_error_code & X86_PF_WRITE);
+
+	pgd = READ_ONCE(*(init_mm.pgd + pgd_index(address)));
+	if (!pgd_present(pgd))
+		return false;
+
+	p4d = READ_ONCE(*p4d_offset(&pgd, address));
+	if (p4d_large(p4d))
+		return pks_call_fault_callback(regs, address, write,
+					       pte_flags_pkey(p4d_val(p4d)));
+	if (!p4d_present(p4d))
+		return false;
+
+	pud = READ_ONCE(*pud_offset(&p4d, address));
+	if (pud_large(pud))
+		return pks_call_fault_callback(regs, address, write,
+					       pte_flags_pkey(pud_val(pud)));
+	if (!pud_present(pud))
+		return false;
+
+	pmd = READ_ONCE(*pmd_offset(&pud, address));
+	if (pmd_large(pmd))
+		return pks_call_fault_callback(regs, address, write,
+					       pte_flags_pkey(pmd_val(pmd)));
+	if (!pmd_present(pmd))
+		return false;
+
+	pte = READ_ONCE(*pte_offset_kernel(&pmd, address));
+	return pks_call_fault_callback(regs, address, write,
+				       pte_flags_pkey(pte_val(pte)));
+}
+
 /*
  * pks_write_pkrs() - Write the pkrs of the current CPU
  * @new_pkrs: New value to write to the current CPU register
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 9f18f8b4cbb1..d0d8bf1aaa1d 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -34,6 +34,9 @@ static inline void pks_set_readwrite(u8 pkey)
 	pks_update_protection(pkey, PKEY_READ_WRITE);
 }
 
+typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
+				 bool write);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pks_set_noaccess(u8 pkey) {}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 20/45] mm/pkeys: PKS testing, add a fault call back
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (18 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 19/45] mm/pkeys: Introduce PKS fault callbacks ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 21/45] mm/pkeys: PKS testing, add pks_set_*() tests ira.weiny
                   ` (25 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

PKS testing will need to know when a fault occurs due to it's actions so
that it can properly determine functionality.

Install a PKS fault handler for the PKS test pkey.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	New Patch
---
 arch/x86/mm/pkeys.c | 6 +++++-
 include/linux/pks.h | 7 +++++++
 lib/pks/pks_test.c  | 6 ++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index a3b27b7811da..39867d39460b 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -243,7 +243,11 @@ static DEFINE_PER_CPU(u32, pkrs_cache);
  *	#endif
  *	};
  */
-static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = { 0 };
+static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = {
+#ifdef CONFIG_PKS_TEST
+	[PKS_KEY_TEST]		= pks_test_fault_callback,
+#endif
+};
 
 static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
 				    bool write, u16 key)
diff --git a/include/linux/pks.h b/include/linux/pks.h
index d0d8bf1aaa1d..208f88fcb48c 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -44,4 +44,11 @@ static inline void pks_set_readwrite(u8 pkey) {}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+#ifdef CONFIG_PKS_TEST
+
+bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
+			     bool write);
+
+#endif /* CONFIG_PKS_TEST */
+
 #endif /* _LINUX_PKS_H */
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 2fc92aaa54e8..37f2cd7d0f56 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -85,6 +85,12 @@ static void debug_result(const char *label, int test_num,
 		     sd->last_test_pass ? "PASS" : "FAIL");
 }
 
+bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
+			     bool write)
+{
+	return false;
+}
+
 static void *alloc_test_page(u8 pkey)
 {
 	return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 21/45] mm/pkeys: PKS testing, add pks_set_*() tests
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (19 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 20/45] mm/pkeys: PKS testing, add a fault call back ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 22/45] mm/pkeys: PKS testing, test context switching ira.weiny
                   ` (24 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Test that the pks_set_*() functions operate as intended.

First, verify that the pkey was properly set in the PTE.

Second, use the fault callback mechanism to detect if a fault occurred
when expected and if so clear the fault.

The test iterates each of the following test cases.

	PKS_TEST_NO_ACCESS,	WRITE,	FAULT_EXPECTED
	PKS_TEST_NO_ACCESS,	READ,	FAULT_EXPECTED

	PKS_TEST_RDWR,		WRITE,	NO_FAULT_EXPECTED
	PKS_TEST_RDWR,		READ,	NO_FAULT_EXPECTED

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Update commit message
	Clarify use of global state for faults to be used by all tests
	Add test to test_pks user app
	Remove an incorrect comment in the kdoc
	Change pkey type to u8
	From Dave Hansen
		s/pks_mk*/pks_set*/
	From Rick Edgecombe
		Use standard fault callback instead of the custom PKS
		test one

Changes for V8
	Remove readonly test, as that patch is not needed for PMEM
	Split this off into a patch which follows the pks_mk_*()
		patches.  Thus allowing for a better view of how the
		test works compared to the functionality added with
		those patches.
	Remove unneeded prints
---
 lib/pks/pks_test.c                     | 161 ++++++++++++++++++++++++-
 tools/testing/selftests/x86/test_pks.c |   5 +-
 2 files changed, 162 insertions(+), 4 deletions(-)

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 37f2cd7d0f56..3e14c621bde6 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -33,11 +33,14 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/pgtable.h>
+#include <linux/pks.h>
 #include <linux/pks-keys.h>
 
 #define PKS_TEST_MEM_SIZE (PAGE_SIZE)
 
 #define CHECK_DEFAULTS		0
+#define RUN_SINGLE		1
 #define RUN_CRASH_TEST		9
 
 static struct dentry *pks_test_dentry;
@@ -48,6 +51,7 @@ struct pks_test_ctx {
 	u8 pkey;
 	char data[64];
 	void *test_page;
+	bool fault_seen;
 };
 
 static void debug_context(const char *label, struct pks_test_ctx *ctx)
@@ -85,10 +89,103 @@ static void debug_result(const char *label, int test_num,
 		     sd->last_test_pass ? "PASS" : "FAIL");
 }
 
+/* Global data protected by test_run_lock */
+struct pks_test_ctx *g_ctx_under_test;
+
+/*
+ * Call set_context_for_fault() after the context has been set up and prior to
+ * the expected fault.
+ */
+static void set_context_for_fault(struct pks_test_ctx *ctx)
+{
+	g_ctx_under_test = ctx;
+	/* Ensure the state of the global context is correct prior to a fault */
+	barrier();
+}
+
 bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
 			     bool write)
 {
-	return false;
+	pr_debug("PKS Fault callback: ctx %p\n", g_ctx_under_test);
+
+	if (!g_ctx_under_test)
+		return false;
+
+	pks_set_readwrite(g_ctx_under_test->pkey);
+	g_ctx_under_test->fault_seen = true;
+	return true;
+}
+
+enum pks_access_mode {
+	PKS_TEST_NO_ACCESS,
+	PKS_TEST_RDWR,
+};
+
+#define PKS_WRITE true
+#define PKS_READ false
+#define PKS_FAULT_EXPECTED true
+#define PKS_NO_FAULT_EXPECTED false
+
+static char *get_mode_str(enum pks_access_mode mode)
+{
+	switch (mode) {
+	case PKS_TEST_NO_ACCESS:
+		return "No Access";
+	case PKS_TEST_RDWR:
+		return "Read Write";
+	}
+
+	return "";
+}
+
+struct pks_access_test {
+	enum pks_access_mode mode;
+	bool write;
+	bool fault;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+	{ PKS_TEST_NO_ACCESS,     PKS_WRITE,  PKS_FAULT_EXPECTED },
+	{ PKS_TEST_NO_ACCESS,     PKS_READ,   PKS_FAULT_EXPECTED },
+
+	{ PKS_TEST_RDWR,          PKS_WRITE,  PKS_NO_FAULT_EXPECTED },
+	{ PKS_TEST_RDWR,          PKS_READ,   PKS_NO_FAULT_EXPECTED },
+};
+
+static bool run_access_test(struct pks_test_ctx *ctx,
+			   struct pks_access_test *test,
+			   void *ptr)
+{
+	switch (test->mode) {
+	case PKS_TEST_NO_ACCESS:
+		pks_set_noaccess(ctx->pkey);
+		break;
+	case PKS_TEST_RDWR:
+		pks_set_readwrite(ctx->pkey);
+		break;
+	default:
+		pr_debug("BUG in test, invalid mode\n");
+		return false;
+	}
+
+	ctx->fault_seen = false;
+	set_context_for_fault(ctx);
+
+	if (test->write)
+		memcpy(ptr, ctx->data, 8);
+	else
+		memcpy(ctx->data, ptr, 8);
+
+	if (test->fault != ctx->fault_seen) {
+		pr_err("pkey test FAILED: mode %s; write %s; fault %s != %s\n",
+			get_mode_str(test->mode),
+			test->write ? "TRUE" : "FALSE",
+			test->fault ? "YES" : "NO",
+			ctx->fault_seen ? "YES" : "NO");
+		return false;
+	}
+
+	return true;
 }
 
 static void *alloc_test_page(u8 pkey)
@@ -108,6 +205,37 @@ static void free_ctx(struct pks_test_ctx *ctx)
 	kfree(ctx);
 }
 
+static bool test_ctx(struct pks_test_ctx *ctx)
+{
+	bool rc = true;
+	int i;
+	u8 pkey;
+	void *ptr = ctx->test_page;
+	pte_t *ptep = NULL;
+	unsigned int level;
+
+	ptep = lookup_address((unsigned long)ptr, &level);
+	if (!ptep) {
+		pr_err("Failed to lookup address???\n");
+		return false;
+	}
+
+	pkey = pte_flags_pkey(ptep->pte);
+	if (pkey != ctx->pkey) {
+		pr_err("invalid pkey found: %u, test_pkey: %u\n",
+			pkey, ctx->pkey);
+		return false;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+		/* sticky fail */
+		if (!run_access_test(ctx, &pkey_test_ary[i], ptr))
+			rc = false;
+	}
+
+	return rc;
+}
+
 static struct pks_test_ctx *alloc_ctx(u8 pkey)
 {
 	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
@@ -139,6 +267,23 @@ static void set_ctx_data(struct pks_session_data *sd, struct pks_test_ctx *ctx)
 	sd->ctx = ctx;
 }
 
+static bool run_single(struct pks_session_data *sd)
+{
+	struct pks_test_ctx *ctx;
+	bool rc;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx))
+		return false;
+
+	set_ctx_data(sd, ctx);
+
+	rc = test_ctx(ctx);
+	pks_set_noaccess(ctx->pkey);
+
+	return rc;
+}
+
 static void crash_it(struct pks_session_data *sd)
 {
 	struct pks_test_ctx *ctx;
@@ -203,6 +348,12 @@ static ssize_t pks_read_file(struct file *file, char __user *user_buf,
 	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
 }
 
+static void cleanup_test(void)
+{
+	g_ctx_under_test = NULL;
+	mutex_unlock(&test_run_lock);
+}
+
 static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 			      size_t count, loff_t *ppos)
 {
@@ -235,6 +386,10 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 		pr_debug("check defaults test: 0x%lx\n", PKS_INIT_VALUE);
 		on_each_cpu(check_pkey_settings, file->private_data, 1);
 		break;
+	case RUN_SINGLE:
+		pr_debug("Single key\n");
+		sd->last_test_pass = run_single(file->private_data);
+		break;
 	default:
 		pr_debug("Unknown test\n");
 		sd->last_test_pass = false;
@@ -251,7 +406,7 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 	 * Normal exit; clear up the locking flag
 	 */
 	sd->need_unlock = false;
-	mutex_unlock(&test_run_lock);
+	cleanup_test();
 	debug_result("Test complete", test_num, sd);
 	return count;
 }
@@ -282,7 +437,7 @@ static int pks_release_file(struct inode *inode, struct file *file)
 	 * not exit normally.
 	 */
 	if (sd->need_unlock)
-		mutex_unlock(&test_run_lock);
+		cleanup_test();
 	free_ctx(sd->ctx);
 	kfree(sd);
 	return 0;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index df5bde9bfdbe..2c10b6c50416 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -31,6 +31,7 @@
 
 /* Values from the kernel */
 #define CHECK_DEFAULTS		"0"
+#define RUN_SINGLE		"1"
 #define RUN_CRASH_TEST		"9"
 
 time_t g_start_time;
@@ -53,6 +54,7 @@ static int do_simple_test(const char *debugfs_str);
  */
 enum {
 	TEST_DEFAULTS = 0,
+	TEST_SINGLE,
 	MAX_TESTS,
 } tests;
 
@@ -64,7 +66,8 @@ struct test_item {
 	const char *debugfs_str;
 	int (*test_fn)(const char *debugfs_str);
 } test_list[] = {
-	{ "check_defaults", CHECK_DEFAULTS, do_simple_test }
+	{ "check_defaults", CHECK_DEFAULTS, do_simple_test },
+	{ "single", RUN_SINGLE, do_simple_test }
 };
 
 static char *get_test_name(int test_num)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 22/45] mm/pkeys: PKS testing, test context switching
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (20 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 21/45] mm/pkeys: PKS testing, add pks_set_*() tests ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 23/45] x86/entry: Add auxiliary pt_regs space ira.weiny
                   ` (23 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

PKS software must maintain the PKRS value during a context switch.  Test
this by running two processes simultaneously on the same CPU while using
different permissions for the same pkey.

Leverage test_pks to create two threads scheduled on the same cpu.

On the kernel side create two commands.  One to set up the pkey prior to
the context switch (arm context switch) and a second to check the pkey
after the context switch (check context switch).

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Rick Edgecombe
		Ensure the parent/child threads don't cause each other
			to hang if one experiences a failure
	Adjust for the new test_pks user space component
	Adjust the debug output for '-d' option
	s/pks_mk_*/pks_set_*/
	Use new set_file_data() call

Changes for V8
	Split this off from the main testing patch
	Remove unneeded prints
---
 lib/pks/pks_test.c                     |  54 +++++++++
 tools/testing/selftests/x86/test_pks.c | 157 ++++++++++++++++++++++++-
 2 files changed, 207 insertions(+), 4 deletions(-)

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 3e14c621bde6..16aa44cf498a 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -37,10 +37,14 @@
 #include <linux/pks.h>
 #include <linux/pks-keys.h>
 
+#include <uapi/asm-generic/mman-common.h>
+
 #define PKS_TEST_MEM_SIZE (PAGE_SIZE)
 
 #define CHECK_DEFAULTS		0
 #define RUN_SINGLE		1
+#define ARM_CTX_SWITCH		2
+#define CHECK_CTX_SWITCH	3
 #define RUN_CRASH_TEST		9
 
 static struct dentry *pks_test_dentry;
@@ -336,6 +340,48 @@ static void arm_or_run_crash_test(struct pks_session_data *sd)
 	crash_it(sd);
 }
 
+static void arm_ctx_switch(struct pks_session_data *sd)
+{
+	struct pks_test_ctx *ctx;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to allocate a context\n");
+		sd->last_test_pass = false;
+		return;
+	}
+
+	set_ctx_data(sd, ctx);
+
+	/* Ensure a known state to test context switch */
+	pks_set_readwrite(ctx->pkey);
+}
+
+static void check_ctx_switch(struct pks_session_data *sd)
+{
+	struct pks_test_ctx *ctx = sd->ctx;
+	unsigned long reg_pkrs;
+	int access;
+
+	sd->last_test_pass = true;
+
+	if (!ctx) {
+		pr_err("No Context switch configured\n");
+		sd->last_test_pass = false;
+		return;
+	}
+
+	rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+	access = (reg_pkrs >> PKR_PKEY_SHIFT(ctx->pkey)) &
+		  PKEY_ACCESS_MASK;
+	if (access != 0) {
+		pr_err("Context switch check failed: pkey %u: 0x%x reg: 0x%lx\n",
+			ctx->pkey, access, reg_pkrs);
+		sd->last_test_pass = false;
+	}
+}
+
 static ssize_t pks_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
@@ -390,6 +436,14 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 		pr_debug("Single key\n");
 		sd->last_test_pass = run_single(file->private_data);
 		break;
+	case ARM_CTX_SWITCH:
+		pr_debug("Arming Context switch test\n");
+		arm_ctx_switch(file->private_data);
+		break;
+	case CHECK_CTX_SWITCH:
+		pr_debug("Checking Context switch test\n");
+		check_ctx_switch(file->private_data);
+		break;
 	default:
 		pr_debug("Unknown test\n");
 		sd->last_test_pass = false;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 2c10b6c50416..5a32645a6e6d 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -17,6 +17,7 @@
  *	...
  */
 #define _GNU_SOURCE
+#include <sched.h>
 #include <unistd.h>
 #include <getopt.h>
 #include <stdio.h>
@@ -32,10 +33,13 @@
 /* Values from the kernel */
 #define CHECK_DEFAULTS		"0"
 #define RUN_SINGLE		"1"
+#define ARM_CTX_SWITCH		"2"
+#define CHECK_CTX_SWITCH	"3"
 #define RUN_CRASH_TEST		"9"
 
 time_t g_start_time;
 int g_debug;
+unsigned long g_cpu;
 
 #define PRINT_DEBUG(fmt, ...) \
 	do { \
@@ -47,6 +51,7 @@ int g_debug;
 	fprintf(stderr, "%s: " fmt, __func__, ##__VA_ARGS__)
 
 static int do_simple_test(const char *debugfs_str);
+static int do_context_switch(const char *debugfs_str);
 
 /*
  * The crash test is a special case which is not included in the run all
@@ -55,6 +60,7 @@ static int do_simple_test(const char *debugfs_str);
 enum {
 	TEST_DEFAULTS = 0,
 	TEST_SINGLE,
+	TEST_CTX_SWITCH,
 	MAX_TESTS,
 } tests;
 
@@ -67,7 +73,8 @@ struct test_item {
 	int (*test_fn)(const char *debugfs_str);
 } test_list[] = {
 	{ "check_defaults", CHECK_DEFAULTS, do_simple_test },
-	{ "single", RUN_SINGLE, do_simple_test }
+	{ "single", RUN_SINGLE, do_simple_test },
+	{ "context_switch", ARM_CTX_SWITCH, do_context_switch }
 };
 
 static char *get_test_name(int test_num)
@@ -101,6 +108,7 @@ static void print_help_and_exit(char *argv0)
 	printf("Usage: %s [-h,-d] [test]\n", argv0);
 	printf("	--help,-h   This help\n");
 	printf("	--debug,-d  Output kernel debug via dynamic debug if available\n");
+	printf("	--cpu,-c <cpu>  Use 'cpu' for context switch default 0\n");
 	printf("\n");
 	printf("        Run all PKS tests or the [test] specified.\n");
 	printf("\n");
@@ -116,6 +124,143 @@ static void print_help_and_exit(char *argv0)
 	printf("\n");
 }
 
+/*
+ * debugfs_str is ignored for this test.
+ */
+static int do_context_switch(const char *debugfs_str)
+{
+	int switch_done[2];
+	int setup_done[2];
+	cpu_set_t cpuset;
+	char result[32];
+	char done = 'P';
+	int rc = 0;
+	pid_t pid;
+	int fd;
+
+	if (g_cpu >= sysconf(_SC_NPROCESSORS_ONLN)) {
+		PRINT_ERROR("CPU %lu is invalid\n", g_cpu);
+		g_cpu = sysconf(_SC_NPROCESSORS_ONLN) - 1;
+		PRINT_ERROR("   running on max CPU: %lu\n", g_cpu);
+	}
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(g_cpu, &cpuset);
+	/*
+	 * Ensure the two processes run on the same CPU so that they go through
+	 * a context switch.
+	 */
+	sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
+
+	if (pipe(setup_done)) {
+		PRINT_ERROR("ERROR: Failed to create pipe\n");
+		return -EIO;
+	}
+	if (pipe(switch_done)) {
+		PRINT_ERROR("ERROR: Failed to create pipe\n");
+		return -EIO;
+	}
+
+	fd = open(PKS_TEST_FILE, O_RDWR);
+	if (fd < 0) {
+		PRINT_DEBUG("Failed to open test file : %s\n", PKS_TEST_FILE);
+		return -ENOENT;
+	}
+
+	/* Avoid duplicated output after fork */
+	fflush(stderr);
+	fflush(stdout);
+
+	pid = fork();
+	if (pid == 0) {
+		char done = 'P';
+
+		g_cpu = sched_getcpu();
+		PRINT_DEBUG("Child: running on cpu %lu...\n", g_cpu);
+
+		/* Allocate and run test. */
+		write(fd, RUN_SINGLE, 1);
+
+		/* Arm for context switch test */
+		write(fd, ARM_CTX_SWITCH, 1);
+
+		PRINT_DEBUG("Child: Tell parent to go\n");
+		write(setup_done[1], &done, sizeof(done));
+
+		/* Context switch out... */
+		PRINT_DEBUG("Child: Waiting for parent...\n");
+		read(switch_done[0], &done, sizeof(done));
+
+		/* Check msr restored */
+		PRINT_DEBUG("Child: Checking result\n");
+		rc = write(fd, CHECK_CTX_SWITCH, 1);
+		if (rc < 0) {
+			if (errno == ENOENT) {
+				sprintf(result, "SKIP");
+				done = 'S';
+			} else {
+				sprintf(result, "FAIL");
+				done = 'F';
+			}
+			goto child_exit;
+		}
+
+		read(fd, result, 10);
+		if (strncmp(result, "PASS", 4))
+			done = 'F';
+
+child_exit:
+		PRINT_DEBUG("Child: Result (%c) %s\n", done, result);
+
+		/* Signal result */
+		write(setup_done[1], &done, sizeof(done));
+		close(fd);
+
+		exit(0);
+	}
+
+	PRINT_DEBUG("Parent: Waiting for child\n");
+	read(setup_done[0], &done, sizeof(done));
+	g_cpu = sched_getcpu();
+	PRINT_DEBUG("Parent: running on cpu %lu\n", g_cpu);
+
+	/* The parent needs a unique file context within the kernel */
+	close(fd);
+	fd = open(PKS_TEST_FILE, O_RDWR);
+	if (fd < 0) {
+		PRINT_ERROR("FATAL ERROR: cannot open %s\n", PKS_TEST_FILE);
+		PRINT_DEBUG("Parent: Signaling child 'fail'\n");
+		done = 'F';
+		write(switch_done[1], &done, sizeof(done));
+		return -ENOENT;
+	}
+
+	/* run test with the same pkey */
+	rc = write(fd, RUN_SINGLE, 1);
+
+	PRINT_DEBUG("Parent: Signaling child\n");
+	write(switch_done[1], &done, sizeof(done));
+
+	if (rc < 0) {
+		rc = -errno;
+		goto close_file;
+	}
+	rc = 0;
+
+	/* Wait for result */
+	read(setup_done[0], &done, sizeof(done));
+	if (done == 'S')
+		rc = -ENOENT;
+	if (done == 'F')
+		rc = -EFAULT;
+
+	PRINT_DEBUG("Parent: exiting with rc (%c) %d\n", done, rc);
+
+close_file:
+	close(fd);
+	return rc;
+}
+
 /*
  * Do a simple test of writing the debugfs value and reading back for 'PASS'
  */
@@ -307,9 +452,10 @@ int main(int argc, char *argv[])
 
 	while (1) {
 		static struct option long_options[] = {
-			{"help",	no_argument,	0,	'h' },
-			{"debug",	no_argument,	0,	'd' },
-			{0,		0,		0,	0 }
+			{"help",	no_argument,		0,	'h' },
+			{"debug",	no_argument,		0,	'd' },
+			{"cpu",		required_argument,	0,	'c' },
+			{0,		0,			0,	0 }
 		};
 		int option_index = 0;
 		int c;
@@ -325,6 +471,9 @@ int main(int argc, char *argv[])
 		case 'd':
 			g_debug++;
 			break;
+		case 'c':
+			g_cpu = strtoul(optarg, NULL, 0);
+			break;
 		default:
 			print_help_and_exit(argv[0]);
 			exit(-1);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 23/45] x86/entry: Add auxiliary pt_regs space
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (21 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 22/45] mm/pkeys: PKS testing, test context switching ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:19 ` [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched() ira.weiny
                   ` (22 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  In order for the MSR to be saved
during an exception the current CPU MSR value needs to be saved
somewhere during the exception and restored when returning to the
previous context.

Two possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2] However, Andy
Lutomirski came up with a way to hide additional values on the stack
which could be accessed as "extended_pt_regs".[3] This method allows any
function with current access to pt_regs to obtain access to the extra
information without expanding the use of irqentry_state_t and leaving
pt_regs intact for compatibility with outside tools like BPF.

Prepare the assembly code to add a hidden auxiliary pt_regs space.  To
simplify, the assembly code only adds space on the stack as defined by
the C code which needs it.  The use of this space is left to the C code
which is required to select ARCH_HAS_PTREGS_AUXILIARY to enable this
support.

Each nested exception gets another copy of this auxiliary space allowing
for any number of levels of exception handling.

Initially the space is left empty and results in no code changes because
ARCH_HAS_PTREGS_AUXILIARY is not set.  Subsequent patches adding data to
pt_regs_auxiliary must set ARCH_HAS_PTREGS_AUXILIARY or a build failure
will occur.  The use of ARCH_HAS_PTREGS_AUXILIARY also avoids the
introduction of 2 instructions (addq/subq) on every entry call when the
extra space is not needed.

32bit is specifically excluded as the current consumer of this, PKS,
will not support 32bit either.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch..

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9:
	Update commit message

Changes for V8:
	Exclude 32bit
	Introduce ARCH_HAS_PTREGS_AUXILIARY to optimize this away when
		not needed.
	From Thomas
		s/EXTENDED_PT_REGS_SIZE/PT_REGS_AUX_SIZE
		Fix up PTREGS_AUX_SIZE macro to be based on the
			structures and used in assembly code via the
			nifty asm-offset macros
		Bound calls into c code with [PUSH|POP]_RTREGS_AUXILIARY
			instead of using a macro 'call'
	Split this patch out and put the PKS specific stuff in a
		separate patch

Changes for V7:
	Rebased to 5.14 entry code
	declare write_pkrs() in pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Remove unnecessary INIT_PKRS_VALUE def
	s/pkrs_save_set_irq/pkrs_save_irq/
		The inital value for exceptions is best managed
		completely within the pkey code.
---
 arch/x86/Kconfig                 |  4 ++++
 arch/x86/entry/calling.h         | 20 ++++++++++++++++++++
 arch/x86/entry/entry_64.S        | 22 ++++++++++++++++++++++
 arch/x86/entry/entry_64_compat.S |  6 ++++++
 arch/x86/include/asm/ptrace.h    | 18 ++++++++++++++++++
 arch/x86/kernel/asm-offsets_64.c | 15 +++++++++++++++
 arch/x86/kernel/head_64.S        |  6 ++++++
 7 files changed, 91 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 459948622a73..64348c94477e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1878,6 +1878,10 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 
 	  If unsure, say y.
 
+config ARCH_HAS_PTREGS_AUXILIARY
+	depends on X86_64
+	bool
+
 choice
 	prompt "TSX enable mode"
 	depends on CPU_SUP_INTEL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..d0ebf9b069c9 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,6 +63,26 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+.macro PUSH_PTREGS_AUXILIARY
+	/* add space for pt_regs_auxiliary */
+	subq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+.macro POP_PTREGS_AUXILIARY
+	/* remove space for pt_regs_auxiliary */
+	addq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+#else
+
+#define PUSH_PTREGS_AUXILIARY
+#define POP_PTREGS_AUXILIARY
+
+#endif
+
 .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 466df3e50276..0684a8093965 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -332,7 +332,9 @@ SYM_CODE_END(ret_from_fork)
 		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	.endif
 
+	PUSH_PTREGS_AUXILIARY
 	call	\cfunc
+	POP_PTREGS_AUXILIARY
 
 	jmp	error_return
 .endm
@@ -435,7 +437,9 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
+	PUSH_PTREGS_AUXILIARY
 	call	\cfunc
+	POP_PTREGS_AUXILIARY
 
 	jmp	paranoid_exit
 
@@ -496,7 +500,9 @@ SYM_CODE_START(\asmsym)
 	 * stack.
 	 */
 	movq	%rsp, %rdi		/* pt_regs pointer */
+	PUSH_PTREGS_AUXILIARY
 	call	vc_switch_off_ist
+	POP_PTREGS_AUXILIARY
 	movq	%rax, %rsp		/* Switch to new stack */
 
 	UNWIND_HINT_REGS
@@ -507,7 +513,9 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
+	PUSH_PTREGS_AUXILIARY
 	call	kernel_\cfunc
+	POP_PTREGS_AUXILIARY
 
 	/*
 	 * No need to switch back to the IST stack. The current stack is either
@@ -542,7 +550,9 @@ SYM_CODE_START(\asmsym)
 	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
 	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
 	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
+	PUSH_PTREGS_AUXILIARY
 	call	\cfunc
+	POP_PTREGS_AUXILIARY
 
 	jmp	paranoid_exit
 
@@ -784,7 +794,9 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
 	UNWIND_HINT_REGS
 
+	PUSH_PTREGS_AUXILIARY
 	call	xen_pv_evtchn_do_upcall
+	POP_PTREGS_AUXILIARY
 
 	jmp	error_return
 SYM_CODE_END(exc_xen_hypervisor_callback)
@@ -984,7 +996,9 @@ SYM_CODE_START_LOCAL(error_entry)
 	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	PUSH_PTREGS_AUXILIARY
 	call	sync_regs
+	POP_PTREGS_AUXILIARY
 	movq	%rax, %rsp			/* switch stack */
 	ENCODE_FRAME_POINTER
 	pushq	%r12
@@ -1040,7 +1054,9 @@ SYM_CODE_START_LOCAL(error_entry)
 	 * as if we faulted immediately after IRET.
 	 */
 	mov	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	fixup_bad_iret
+	POP_PTREGS_AUXILIARY
 	mov	%rax, %rsp
 	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
@@ -1146,7 +1162,9 @@ SYM_CODE_START(asm_exc_nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
+	PUSH_PTREGS_AUXILIARY
 	call	exc_nmi
+	POP_PTREGS_AUXILIARY
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1182,6 +1200,8 @@ SYM_CODE_START(asm_exc_nmi)
 	 * +---------------------------------------------------------+
 	 * | pt_regs                                                 |
 	 * +---------------------------------------------------------+
+	 * | (Optionally) pt_regs_extended                           |
+	 * +---------------------------------------------------------+
 	 *
 	 * The "original" frame is used by hardware.  Before re-enabling
 	 * NMIs, we need to be done with it, and we need to leave enough
@@ -1358,7 +1378,9 @@ end_repeat_nmi:
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
+	PUSH_PTREGS_AUXILIARY
 	call	exc_nmi
+	POP_PTREGS_AUXILIARY
 
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..c6859d8acae4 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -136,7 +136,9 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
 .Lsysenter_flags_fixed:
 
 	movq	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	do_SYSENTER_32
+	POP_PTREGS_AUXILIARY
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -253,7 +255,9 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	do_fast_syscall_32
+	POP_PTREGS_AUXILIARY
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -410,6 +414,8 @@ SYM_CODE_START(entry_INT80_compat)
 	cld
 
 	movq	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	do_int80_syscall_32
+	POP_PTREGS_AUXILIARY
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 703663175a5a..5e7f6e48c0ab 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_PTRACE_H
 #define _ASM_X86_PTRACE_H
 
+#include <linux/container_of.h>
 #include <asm/segment.h>
 #include <asm/page_types.h>
 #include <uapi/asm/ptrace.h>
@@ -91,6 +92,23 @@ struct pt_regs {
 /* top of stack page */
 };
 
+/*
+ * NOTE: Features which add data to pt_regs_auxiliary must select
+ * ARCH_HAS_PTREGS_AUXILIARY.  Failure to do so will result in a build failure.
+ */
+struct pt_regs_auxiliary {
+};
+
+struct pt_regs_extended {
+	struct pt_regs_auxiliary aux;
+	struct pt_regs pt_regs __aligned(8);
+};
+
+static inline struct pt_regs_extended *to_extended_pt_regs(struct pt_regs *regs)
+{
+	return container_of(regs, struct pt_regs_extended, pt_regs);
+}
+
 #endif /* !__i386__ */
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..66f08ac3507a 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -4,6 +4,7 @@
 #endif
 
 #include <asm/ia32.h>
+#include <asm/ptrace.h>
 
 #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 #include <asm/kvm_para.h>
@@ -60,5 +61,19 @@ int main(void)
 	DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary));
 	BLANK();
 #endif
+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+	/* Size of Auxiliary pt_regs data */
+	DEFINE(PTREGS_AUX_SIZE, sizeof(struct pt_regs_extended) -
+				sizeof(struct pt_regs));
+#else
+	/*
+	 * Adding data to struct pt_regs_auxiliary requires setting
+	 * ARCH_HAS_PTREGS_AUXILIARY
+	 */
+	BUILD_BUG_ON((sizeof(struct pt_regs_extended) -
+		      sizeof(struct pt_regs)) != 0);
+#endif
+
 	return 0;
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..8418d9de8d70 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -336,8 +336,10 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
 	movq	initial_vc_handler(%rip), %rax
+	PUSH_PTREGS_AUXILIARY
 	ANNOTATE_RETPOLINE_SAFE
 	call	*%rax
+	POP_PTREGS_AUXILIARY
 
 	/* Unwind pt_regs */
 	POP_REGS
@@ -414,7 +416,9 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	UNWIND_HINT_REGS
 
 	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
+	PUSH_PTREGS_AUXILIARY
 	call do_early_exception
+	POP_PTREGS_AUXILIARY
 
 	decl early_recursion_flag(%rip)
 	jmp restore_regs_and_return_to_kernel
@@ -438,7 +442,9 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
 	/* Call C handler */
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
+	PUSH_PTREGS_AUXILIARY
 	call    do_vc_no_ghcb
+	POP_PTREGS_AUXILIARY
 
 	/* Unwind pt_regs */
 	POP_REGS
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (22 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 23/45] x86/entry: Add auxiliary pt_regs space ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-04-07  2:50   ` Ira Weiny
  2022-03-10 17:19 ` [PATCH V9 25/45] entry: Add calls for save/restore auxiliary pt_regs ira.weiny
                   ` (21 subsequent siblings)
  45 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Auxiliary pt_regs space needs to be manipulated by the generic
entry/exit code.

Normally irqentry_exit() would take care of handling any auxiliary
pt_regs on exit.  Unfortunately, the call to
irqentry_exit_cond_resched() from xen_pv_evtchn_do_upcall() bypasses the
normal irqentry_exit() call.  Because of this bypass
irqentry_exit_cond_resched() will be required to handle any auxiliary
pt_regs exit handling.  However, this prevents irqentry_exit() from
being able to call irqentry_exit_cond_resched() and while maintaining
control of the auxiliary pt_regs.

Separate out the common functionality of irqentry_exit_cond_resched() so
that functionality can be used by irqentry_exit().  Add a pt_regs
parameter in anticipation of having irqentry_exit_cond_resched() handle
the auxiliary pt_regs separately from irqentry_exit().

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Update commit message

Changes for V8
	New Patch
---
 arch/x86/entry/common.c      | 2 +-
 include/linux/entry-common.h | 3 ++-
 kernel/entry/common.c        | 9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..f1ba770d035d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -309,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
-		irqentry_exit_cond_resched();
+		irqentry_exit_cond_resched(regs);
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index ddaffc983e62..14fd329847e7 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -451,10 +451,11 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
 
 /**
  * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ * @regs:	Pointer to pt_regs of interrupted context
  *
  * Conditional reschedule with additional sanity checks.
  */
-void irqentry_exit_cond_resched(void);
+void irqentry_exit_cond_resched(struct pt_regs *regs);
 
 void __irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 490442a48332..f4210a7fc84d 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -395,7 +395,7 @@ void __irqentry_exit_cond_resched(void)
 DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 #endif
 
-void irqentry_exit_cond_resched(void)
+static void exit_cond_resched(void)
 {
 	if (IS_ENABLED(CONFIG_PREEMPTION)) {
 #ifdef CONFIG_PREEMPT_DYNAMIC
@@ -406,6 +406,11 @@ void irqentry_exit_cond_resched(void)
 	}
 }
 
+void irqentry_exit_cond_resched(struct pt_regs *regs)
+{
+	exit_cond_resched();
+}
+
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
@@ -431,7 +436,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		irqentry_exit_cond_resched();
+		exit_cond_resched();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 25/45] entry: Add calls for save/restore auxiliary pt_regs
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (23 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched() ira.weiny
@ 2022-03-10 17:19 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 26/45] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs() ira.weiny
                   ` (20 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Some architectures have auxiliary pt_regs space which is available to
store extra information on the stack.  For ease of implementation the
common C code was left to fill in the data when needed.

Add calls to the architecture save and restore auxiliary pt_regs
functions.  Define empty calls for any architecture which does not have
auxiliary pt_regs.

NOTE: Due to the split nature of the Xen exit code
irqentry_exit_cond_resched() requires an unbalanced call to
arch_restore_aux_pt_regs() regardless of the nature of the preemption
configuration.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Update commit message

Changes for V8
	New patch which introduces a generic auxiliary pt_register save
		restore.
---
 include/linux/entry-common.h |  7 +++++++
 kernel/entry/common.c        | 16 ++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 14fd329847e7..b243f1cfd491 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -99,6 +99,13 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 }
 #endif
 
+#ifndef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs) { }
+static inline void arch_restore_aux_pt_regs(struct pt_regs *regs) { }
+
+#endif
+
 /**
  * enter_from_user_mode - Establish state when coming from user mode
  *
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f4210a7fc84d..c778e9783361 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -323,7 +323,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 
 	if (user_mode(regs)) {
 		irqentry_enter_from_user_mode(regs);
-		return ret;
+		goto aux_save;
 	}
 
 	/*
@@ -362,7 +362,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		instrumentation_end();
 
 		ret.exit_rcu = true;
-		return ret;
+		goto aux_save;
 	}
 
 	/*
@@ -377,6 +377,11 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 
+aux_save:
+	instrumentation_begin();
+	arch_save_aux_pt_regs(regs);
+	instrumentation_end();
+
 	return ret;
 }
 
@@ -408,6 +413,7 @@ static void exit_cond_resched(void)
 
 void irqentry_exit_cond_resched(struct pt_regs *regs)
 {
+	arch_restore_aux_pt_regs(regs);
 	exit_cond_resched();
 }
 
@@ -415,6 +421,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
 
+	instrumentation_begin();
+	arch_restore_aux_pt_regs(regs);
+	instrumentation_end();
+
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		irqentry_exit_to_user_mode(regs);
@@ -464,6 +474,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
+	arch_save_aux_pt_regs(regs);
 	instrumentation_end();
 
 	return irq_state;
@@ -472,6 +483,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
 	instrumentation_begin();
+	arch_restore_aux_pt_regs(regs);
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
 		trace_hardirqs_on_prepare();
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 26/45] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (24 preceding siblings ...)
  2022-03-10 17:19 ` [PATCH V9 25/45] entry: Add calls for save/restore auxiliary pt_regs ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 27/45] x86/pkeys: Preserve PKRS MSR across exceptions ira.weiny
                   ` (19 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The x86 architecture supports the new auxiliary pt_regs space if
ARCH_HAS_PTREGS_AUXILIARY is enabled.

Define the callbacks within the x86 code required by the core entry code
when this support is enabled.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	New patch
---
 arch/x86/include/asm/entry-common.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 43184640b579..5fa5dd2d539c 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -95,4 +95,16 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
+{
+}
+
+static inline void arch_restore_aux_pt_regs(struct pt_regs *regs)
+{
+}
+
+#endif
+
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 27/45] x86/pkeys: Preserve PKRS MSR across exceptions
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (25 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 26/45] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 28/45] x86/fault: Print PKS MSR on fault ira.weiny
                   ` (18 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

PKRS is a per-logical-processor MSR which overlays additional protection
for pages which have been mapped with a protection key.  It is desired
to protect PKS pages while executing exception code while also allowing
exception code to access PKS pages with the proper pks_set_*() calls.

To do this the current thread value must be saved, the CPU MSR value set
to the default value during the exception, and the saved thread value
restored upon completion.  This can be done with the new auxiliary
pt_regs space.

When PKS is configured, configure auxiliary pt_regs, add space to
pt_regs_auxiliary, and define save/restore functions.

Update the PKS test code to maintain functionality by clearing the saved
PKRS value before returning.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch.

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9:
	Update commit message
	s/pks_thread_pkrs/pkrs/
	From Dave Hansen
		s/pks_saved_pkrs/pkrs/

Changes for V8:
	Tie this into the new generic auxiliary pt_regs support.
	Build this on the new irqentry_*() refactoring patches
	Split this patch off from the PKS portion of the auxiliary
		pt_regs functionality.
	From Thomas
		Fix noinstr mess
		s/write_pkrs/pks_write_pkrs
		s/pkrs_init_value/PKRS_INIT_VALUE
	Simplify the number and location of the save/restore calls.
		Cover entry from user space as well.

Changes for V7:
	Rebased to 5.14 entry code
	declare write_pkrs() in pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Remove unnecessary INIT_PKRS_VALUE def
	s/pkrs_save_set_irq/pkrs_save_irq/
		The inital value for exceptions is best managed
		completely within the pkey code.
---
 arch/x86/Kconfig                    |  3 ++-
 arch/x86/include/asm/entry-common.h |  3 +++
 arch/x86/include/asm/pks.h          |  4 ++++
 arch/x86/include/asm/ptrace.h       |  3 +++
 arch/x86/mm/pkeys.c                 | 32 +++++++++++++++++++++++++++++
 lib/pks/pks_test.c                  |  9 +++++++-
 6 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 64348c94477e..f13fd7a73535 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1879,8 +1879,9 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	  If unsure, say y.
 
 config ARCH_HAS_PTREGS_AUXILIARY
+	def_bool y
 	depends on X86_64
-	bool
+	depends on ARCH_ENABLE_SUPERVISOR_PKEYS
 
 choice
 	prompt "TSX enable mode"
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 5fa5dd2d539c..803727b95b3a 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -8,6 +8,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
+#include <asm/pks.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -99,10 +100,12 @@ static __always_inline void arch_exit_to_user_mode(void)
 
 static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
 {
+	pks_save_pt_regs(regs);
 }
 
 static inline void arch_restore_aux_pt_regs(struct pt_regs *regs)
 {
+	pks_restore_pt_regs(regs);
 }
 
 #endif
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index e9ad3ecd7ed0..b69e03a141fe 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -6,6 +6,8 @@
 
 void pks_setup(void);
 void x86_pkrs_load(struct thread_struct *thread);
+void pks_save_pt_regs(struct pt_regs *regs);
+void pks_restore_pt_regs(struct pt_regs *regs);
 
 bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
 			  unsigned long address);
@@ -14,6 +16,8 @@ bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
 
 static inline void pks_setup(void) { }
 static inline void x86_pkrs_load(struct thread_struct *thread) { }
+static inline void pks_save_pt_regs(struct pt_regs *regs) { }
+static inline void pks_restore_pt_regs(struct pt_regs *regs) { }
 
 static inline bool pks_handle_key_fault(struct pt_regs *regs,
 					unsigned long hw_error_code,
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 5e7f6e48c0ab..a3b00ad0d69b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -97,6 +97,9 @@ struct pt_regs {
  * ARCH_HAS_PTREGS_AUXILIARY.  Failure to do so will result in a build failure.
  */
 struct pt_regs_auxiliary {
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	u32 pkrs;
+#endif
 };
 
 struct pt_regs_extended {
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 39867d39460b..29885dfb0980 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -346,6 +346,38 @@ void x86_pkrs_load(struct thread_struct *thread)
 	pks_write_pkrs(thread->pkrs);
 }
 
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * To protect against exceptions having potentially privileged access to memory
+ * of an interrupted thread, save the current thread value and set the PKRS
+ * value to be used during the exception.
+ */
+void pks_save_pt_regs(struct pt_regs *regs)
+{
+	struct pt_regs_auxiliary *aux_pt_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+	aux_pt_regs->pkrs = current->thread.pkrs;
+	pks_write_pkrs(PKS_INIT_VALUE);
+}
+
+void pks_restore_pt_regs(struct pt_regs *regs)
+{
+	struct pt_regs_auxiliary *aux_pt_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+	current->thread.pkrs = aux_pt_regs->pkrs;
+	pks_write_pkrs(current->thread.pkrs);
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
  *
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 16aa44cf498a..86af2f61393d 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -34,11 +34,14 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/pgtable.h>
+#include <linux/pkeys.h>
 #include <linux/pks.h>
 #include <linux/pks-keys.h>
 
 #include <uapi/asm-generic/mman-common.h>
 
+#include <asm/ptrace.h>
+
 #define PKS_TEST_MEM_SIZE (PAGE_SIZE)
 
 #define CHECK_DEFAULTS		0
@@ -110,12 +113,16 @@ static void set_context_for_fault(struct pks_test_ctx *ctx)
 bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
 			     bool write)
 {
+	struct pt_regs_extended *ept_regs = to_extended_pt_regs(regs);
+	struct pt_regs_auxiliary *aux_pt_regs = &ept_regs->aux;
+	u32 pkrs = aux_pt_regs->pkrs;
+
 	pr_debug("PKS Fault callback: ctx %p\n", g_ctx_under_test);
 
 	if (!g_ctx_under_test)
 		return false;
 
-	pks_set_readwrite(g_ctx_under_test->pkey);
+	aux_pt_regs->pkrs = pkey_update_pkval(pkrs, g_ctx_under_test->pkey, 0);
 	g_ctx_under_test->fault_seen = true;
 	return true;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 28/45] x86/fault: Print PKS MSR on fault
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (26 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 27/45] x86/pkeys: Preserve PKRS MSR across exceptions ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 29/45] mm/pkeys: PKS testing, Add exception test ira.weiny
                   ` (17 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

If a PKS fault occurs it will be easier to debug if the PKS MSR value at
the time of the fault is known.

Add pks_show_regs() to __show_regs() to show the PKRS MSR on fault if
enabled.

An 'executive summary' of the pt_regs are saved in __die_header() which
ensures that the first registers are saved in the event of multiple
faults.  Teach this code about the extended pt_registers such that the
PKS code can get to the original pkrs value as well.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dave Hansen
		Move this output to __show_regs() next to the PKRU
			register dump

Changes for V8
	Split this into it's own patch.
---
 arch/x86/include/asm/pks.h   |  3 +++
 arch/x86/kernel/dumpstack.c  | 32 ++++++++++++++++++++++++++++++--
 arch/x86/kernel/process_64.c |  1 +
 arch/x86/mm/pkeys.c          | 11 +++++++++++
 4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index b69e03a141fe..de67d5b5a2af 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -8,6 +8,7 @@ void pks_setup(void);
 void x86_pkrs_load(struct thread_struct *thread);
 void pks_save_pt_regs(struct pt_regs *regs);
 void pks_restore_pt_regs(struct pt_regs *regs);
+void pks_show_regs(struct pt_regs *regs, const char *log_lvl);
 
 bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
 			  unsigned long address);
@@ -18,6 +19,8 @@ static inline void pks_setup(void) { }
 static inline void x86_pkrs_load(struct thread_struct *thread) { }
 static inline void pks_save_pt_regs(struct pt_regs *regs) { }
 static inline void pks_restore_pt_regs(struct pt_regs *regs) { }
+static inline void pks_show_regs(struct pt_regs *regs,
+				 const char *log_lvl) { }
 
 static inline bool pks_handle_key_fault(struct pt_regs *regs,
 					unsigned long hw_error_code,
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 53de044e5654..38be69d15431 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -27,8 +27,36 @@ int panic_on_unrecovered_nmi;
 int panic_on_io_nmi;
 static int die_counter;
 
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static struct pt_regs_extended exec_summary_regs;
+
+static void save_exec_summary(struct pt_regs *regs)
+{
+	exec_summary_regs = *(to_extended_pt_regs(regs));
+}
+
+static struct pt_regs *retrieve_exec_summary(void)
+{
+	return &exec_summary_regs.pt_regs;
+}
+
+#else /* !CONFIG_ARCH_HAS_PTREGS_AUXILIARY */
+
 static struct pt_regs exec_summary_regs;
 
+static void save_exec_summary(struct pt_regs *regs)
+{
+	exec_summary_regs = *regs;
+}
+
+static struct pt_regs *retrieve_exec_summary(void)
+{
+	return &exec_summary_regs;
+}
+
+#endif /* CONFIG_ARCH_HAS_PTREGS_AUXILIARY */
+
 bool noinstr in_task_stack(unsigned long *stack, struct task_struct *task,
 			   struct stack_info *info)
 {
@@ -369,7 +397,7 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 	oops_exit();
 
 	/* Executive summary in case the oops scrolled away */
-	__show_regs(&exec_summary_regs, SHOW_REGS_ALL, KERN_DEFAULT);
+	__show_regs(retrieve_exec_summary(), SHOW_REGS_ALL, KERN_DEFAULT);
 
 	if (!signr)
 		return;
@@ -396,7 +424,7 @@ static void __die_header(const char *str, struct pt_regs *regs, long err)
 
 	/* Save the regs of the first oops for the executive summary later. */
 	if (!die_counter)
-		exec_summary_regs = *regs;
+		save_exec_summary(regs);
 
 	if (IS_ENABLED(CONFIG_PREEMPTION))
 		pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e703cc451128..68d998ea3571 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -140,6 +140,7 @@ void __show_regs(struct pt_regs *regs, enum show_regs_mode mode,
 
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE))
 		printk("%sPKRU: %08x\n", log_lvl, read_pkru());
+	pks_show_regs(regs, log_lvl);
 }
 
 void release_thread(struct task_struct *dead_task)
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 29885dfb0980..7c8e4ea9f022 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -378,6 +378,17 @@ void pks_restore_pt_regs(struct pt_regs *regs)
 	pks_write_pkrs(current->thread.pkrs);
 }
 
+void pks_show_regs(struct pt_regs *regs, const char *log_lvl)
+{
+	struct pt_regs_auxiliary *aux_pt_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+	printk("%sPKRS: 0x%x\n", log_lvl, aux_pt_regs->pkrs);
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
  *
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 29/45] mm/pkeys: PKS testing, Add exception test
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (27 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 28/45] x86/fault: Print PKS MSR on fault ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 30/45] mm/pkeys: Introduce pks_update_exception() ira.weiny
                   ` (16 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

During an exception the interrupted threads PKRS value is preserved
and the exception receives the default value for that pkey.  Upon
return from exception the threads PKRS value is restored.

Add a PKS test which forces a fault to check that this works as
intended.  Check that both the thread as well as the exception PKRS
state is correct at the beginning, during, and after the exception.

Add the test to the test_pks app.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Change for V9
	Add test to test_pks
	Clean up the globals shared with the fault handler
	Use the PKS Test specific fault callback
	s/pks_mk*/pks_set*/
	Change pkey type to u8
	From Dave Hansen
		use pkey

Change for V8
	Split this test off from the testing patch and place it after
	the exception saving code.
---
 arch/x86/mm/pkeys.c                    |   2 +-
 include/linux/pks.h                    |   6 ++
 lib/pks/pks_test.c                     | 133 +++++++++++++++++++++++++
 tools/testing/selftests/x86/test_pks.c |   5 +-
 4 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7c8e4ea9f022..6327e32d7237 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -215,7 +215,7 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
-static DEFINE_PER_CPU(u32, pkrs_cache);
+__static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);
 
 /**
  * DOC: DEFINE_PKS_FAULT_CALLBACK
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 208f88fcb48c..224fc3bbd072 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -46,9 +46,15 @@ static inline void pks_set_readwrite(u8 pkey) {}
 
 #ifdef CONFIG_PKS_TEST
 
+#define __static_or_pks_test
+
 bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
 			     bool write);
 
+#else /* !CONFIG_PKS_TEST */
+
+#define __static_or_pks_test static
+
 #endif /* CONFIG_PKS_TEST */
 
 #endif /* _LINUX_PKS_H */
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 86af2f61393d..762f4a19cb7d 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -48,19 +48,30 @@
 #define RUN_SINGLE		1
 #define ARM_CTX_SWITCH		2
 #define CHECK_CTX_SWITCH	3
+#define RUN_EXCEPTION		4
 #define RUN_CRASH_TEST		9
 
+DECLARE_PER_CPU(u32, pkrs_cache);
+
 static struct dentry *pks_test_dentry;
 
 DEFINE_MUTEX(test_run_lock);
 
 struct pks_test_ctx {
 	u8 pkey;
+	bool pass;
 	char data[64];
 	void *test_page;
 	bool fault_seen;
+	bool validate_exp_handling;
 };
 
+static bool check_pkey_val(u32 pk_reg, u8 pkey, u32 expected)
+{
+	pk_reg = (pk_reg >> PKR_PKEY_SHIFT(pkey)) & PKEY_ACCESS_MASK;
+	return (pk_reg == expected);
+}
+
 static void debug_context(const char *label, struct pks_test_ctx *ctx)
 {
 	pr_debug("%s [%d] %s <-> %p\n",
@@ -96,6 +107,63 @@ static void debug_result(const char *label, int test_num,
 		     sd->last_test_pass ? "PASS" : "FAIL");
 }
 
+/*
+ * Check if the register @pkey value matches @expected value
+ *
+ * Both the cached and actual MSR must match.
+ */
+static bool check_pkrs(u8 pkey, u8 expected)
+{
+	bool ret = true;
+	u64 pkrs;
+	u32 *tmp_cache;
+
+	tmp_cache = get_cpu_ptr(&pkrs_cache);
+	if (!check_pkey_val(*tmp_cache, pkey, expected))
+		ret = false;
+	put_cpu_ptr(tmp_cache);
+
+	rdmsrl(MSR_IA32_PKRS, pkrs);
+	if (!check_pkey_val(pkrs, pkey, expected))
+		ret = false;
+
+	return ret;
+}
+
+static void validate_exception(struct pks_test_ctx *ctx, u32 thread_pkrs)
+{
+	u8 pkey = ctx->pkey;
+
+	/* Check that the thread state was saved */
+	if (!check_pkey_val(thread_pkrs, pkey, PKEY_DISABLE_WRITE)) {
+		pr_err("     FAIL: checking aux_pt_regs->thread_pkrs\n");
+		ctx->pass = false;
+	}
+
+	/* Check that the exception received the default of disabled access */
+	if (!check_pkrs(pkey, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: PKRS cache and MSR\n");
+		ctx->pass = false;
+	}
+
+	/*
+	 * Ensure an update can occur during exception without affecting the
+	 * interrupted thread.  The interrupted thread is verified after the
+	 * exception returns.
+	 */
+	pks_set_readwrite(pkey);
+	if (!check_pkrs(pkey, 0)) {
+		pr_err("     FAIL: exception did not change register to 0\n");
+		ctx->pass = false;
+	}
+	pks_set_noaccess(pkey);
+	if (!check_pkrs(pkey, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: exception did not change register to 0x%x\n",
+			PKEY_DISABLE_ACCESS);
+		ctx->pass = false;
+	}
+}
+
 /* Global data protected by test_run_lock */
 struct pks_test_ctx *g_ctx_under_test;
 
@@ -122,6 +190,16 @@ bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
 	if (!g_ctx_under_test)
 		return false;
 
+	if (g_ctx_under_test->validate_exp_handling) {
+		validate_exception(g_ctx_under_test, pkrs);
+		/*
+		 * Stop this check directly within the exception because the
+		 * fault handler clean up code will call again while checking
+		 * the PMD entry and there is no need to check this again.
+		 */
+		g_ctx_under_test->validate_exp_handling = false;
+	}
+
 	aux_pt_regs->pkrs = pkey_update_pkval(pkrs, g_ctx_under_test->pkey, 0);
 	g_ctx_under_test->fault_seen = true;
 	return true;
@@ -255,6 +333,7 @@ static struct pks_test_ctx *alloc_ctx(u8 pkey)
 		return ERR_PTR(-ENOMEM);
 
 	ctx->pkey = pkey;
+	ctx->pass = true;
 	sprintf(ctx->data, "%s", "DEADBEEF");
 
 	ctx->test_page = alloc_test_page(ctx->pkey);
@@ -295,6 +374,56 @@ static bool run_single(struct pks_session_data *sd)
 	return rc;
 }
 
+static bool run_exception_test(struct pks_session_data *sd)
+{
+	bool pass = true;
+	struct pks_test_ctx *ctx;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx)) {
+		pr_debug("     FAIL: no context\n");
+		return false;
+	}
+
+	set_ctx_data(sd, ctx);
+
+	/*
+	 * Set the thread pkey value to something other than the default of
+	 * access disable but something which still causes a fault, disable
+	 * writes.
+	 */
+	pks_update_protection(ctx->pkey, PKEY_DISABLE_WRITE);
+
+	ctx->validate_exp_handling = true;
+	set_context_for_fault(ctx);
+
+	memcpy(ctx->test_page, ctx->data, 8);
+
+	if (!ctx->fault_seen) {
+		pr_err("     FAIL: did not get an exception\n");
+		pass = false;
+	}
+
+	/*
+	 * The exception code has to enable access to keep the fault from
+	 * looping forever.  Therefore full access is seen here rather than
+	 * write disabled.
+	 *
+	 * However, this does verify that the exception state was independent
+	 * of the interrupted threads state because validate_exception()
+	 * disabled access during the exception.
+	 */
+	if (!check_pkrs(ctx->pkey, 0)) {
+		pr_err("     FAIL: PKRS not restored\n");
+		pass = false;
+	}
+
+	if (!ctx->pass)
+		pass = false;
+
+	return pass;
+}
+
 static void crash_it(struct pks_session_data *sd)
 {
 	struct pks_test_ctx *ctx;
@@ -451,6 +580,10 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 		pr_debug("Checking Context switch test\n");
 		check_ctx_switch(file->private_data);
 		break;
+	case RUN_EXCEPTION:
+		pr_debug("Exception checking\n");
+		sd->last_test_pass = run_exception_test(file->private_data);
+		break;
 	default:
 		pr_debug("Unknown test\n");
 		sd->last_test_pass = false;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 5a32645a6e6d..817df7a14923 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -35,6 +35,7 @@
 #define RUN_SINGLE		"1"
 #define ARM_CTX_SWITCH		"2"
 #define CHECK_CTX_SWITCH	"3"
+#define RUN_EXCEPTION		"4"
 #define RUN_CRASH_TEST		"9"
 
 time_t g_start_time;
@@ -61,6 +62,7 @@ enum {
 	TEST_DEFAULTS = 0,
 	TEST_SINGLE,
 	TEST_CTX_SWITCH,
+	TEST_EXCEPTION,
 	MAX_TESTS,
 } tests;
 
@@ -74,7 +76,8 @@ struct test_item {
 } test_list[] = {
 	{ "check_defaults", CHECK_DEFAULTS, do_simple_test },
 	{ "single", RUN_SINGLE, do_simple_test },
-	{ "context_switch", ARM_CTX_SWITCH, do_context_switch }
+	{ "context_switch", ARM_CTX_SWITCH, do_context_switch },
+	{ "exception", RUN_EXCEPTION, do_simple_test }
 };
 
 static char *get_test_name(int test_num)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 30/45] mm/pkeys: Introduce pks_update_exception()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (28 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 29/45] mm/pkeys: PKS testing, Add exception test ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 31/45] mm/pkeys: PKS testing, test pks_update_exception() ira.weiny
                   ` (15 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Some PKS use cases will want to catch permissions violations with the
fault callback mechanism and optionally allow the access.

The pks_set_*() calls update the protection of the current running
context.  They will not work to change the protections of a thread which
has been interrupted.  Therefore updating a thread from within an
exception requires a different method.

Introduce pks_update_exception() which updates the faulted threads
protections in addition to the current context.

Add documentation

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Add preemption disable around pkrs per cpu cache
	Update commit message
	Change pkey type to u8
	s/pks_saved_pkrs/pkrs

Changes for V8
	Remove the concept of abandoning a pkey in favor of using the
		custom fault handler via this new pks_update_exception()
		call
	Without an abandon call there is no need for an abandon mask on
		sched in, new thread creation, or within exceptions...
	This now lets all invalid access' fault
	Ensure that all entry points into the pks has feature checks...
	Place abandon fault check before the test callback to ensure
		testing does not detect the double fault of the abandon
		code and flag it incorrectly as a fault.
	Change return type of pks_handle_abandoned_pkeys() to bool
---
 Documentation/core-api/protection-keys.rst |  3 ++
 arch/x86/mm/pkeys.c                        | 58 +++++++++++++++++++---
 include/linux/pks.h                        |  5 ++
 3 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 5fdc83a39d4e..22ad58a93423 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -149,6 +149,9 @@ Changing permissions of individual keys
 .. kernel-doc:: include/linux/pks.h
         :identifiers: pks_set_readwrite pks_set_noaccess
 
+.. kernel-doc:: arch/x86/mm/pkeys.c
+        :identifiers: pks_update_exception
+
 Overriding Default Fault Behavior
 ---------------------------------
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6327e32d7237..9b2a6a62d433 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -409,6 +409,18 @@ void pks_setup(void)
 	cr4_set_bits(X86_CR4_PKS);
 }
 
+static void __pks_update_protection(u8 pkey, u8 protection)
+{
+	u32 pkrs;
+
+	pkrs = current->thread.pkrs;
+	current->thread.pkrs = pkey_update_pkval(pkrs, pkey, protection);
+
+	preempt_disable();
+	pks_write_pkrs(current->thread.pkrs);
+	preempt_enable();
+}
+
 /*
  * Do not call this directly, see pks_set*().
  *
@@ -422,21 +434,51 @@ void pks_setup(void)
  */
 void pks_update_protection(u8 pkey, u8 protection)
 {
-	u32 pkrs;
-
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
 	if (WARN_ON_ONCE(pkey >= PKS_KEY_MAX))
 		return;
 
-	pkrs = current->thread.pkrs;
-	current->thread.pkrs = pkey_update_pkval(pkrs, pkey,
-						 protection);
-	preempt_disable();
-	pks_write_pkrs(current->thread.pkrs);
-	preempt_enable();
+	__pks_update_protection(pkey, protection);
 }
 EXPORT_SYMBOL_GPL(pks_update_protection);
 
+/**
+ * pks_update_exception() - Update the protections of a faulted thread
+ *
+ * @regs: Faulting thread registers
+ * @pkey: pkey to update
+ * @protection: protection bits to use.
+ *
+ * CONTEXT: Exception
+ *
+ * pks_update_exception() updates the faulted threads protections in addition
+ * to the protections within the exception.
+ *
+ * This is useful because the pks_set_*() functions will not work to change the
+ * protections of a thread which has been interrupted.  Only the current
+ * context is updated by those functions.  Therefore, if a PKS fault callback
+ * wants to update the faulted threads protections it must call
+ * pks_update_exception().
+ */
+void pks_update_exception(struct pt_regs *regs, u8 pkey, u8 protection)
+{
+	struct pt_regs_extended *ept_regs;
+	u32 old;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	if (WARN_ON_ONCE(pkey >= PKS_KEY_MAX))
+		return;
+
+	__pks_update_protection(pkey, protection);
+
+	ept_regs = to_extended_pt_regs(regs);
+	old = ept_regs->aux.pkrs;
+	ept_regs->aux.pkrs = pkey_update_pkval(old, pkey, protection);
+}
+EXPORT_SYMBOL_GPL(pks_update_exception);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 224fc3bbd072..45156f358776 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -9,6 +9,7 @@
 #include <uapi/asm-generic/mman-common.h>
 
 void pks_update_protection(u8 pkey, u8 protection);
+void pks_update_exception(struct pt_regs *regs, u8 pkey, u8 protection);
 
 /**
  * pks_set_noaccess() - Disable all access to the domain
@@ -41,6 +42,10 @@ typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
 
 static inline void pks_set_noaccess(u8 pkey) {}
 static inline void pks_set_readwrite(u8 pkey) {}
+static inline void pks_update_exception(struct pt_regs *regs,
+					u8 pkey,
+					u8 protection)
+{ }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 31/45] mm/pkeys: PKS testing, test pks_update_exception()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (29 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 30/45] mm/pkeys: Introduce pks_update_exception() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 32/45] mm/pkeys: PKS testing, add test for all keys ira.weiny
                   ` (14 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

A common use case for the custom fault callbacks will be for the
callback to warn of the violation and relax the permissions rather than
crash the kernel.  pks_update_exception() was added for this purpose.

Add a test which uses pks_update_exception() to clear the pkey
permissions.  Verify that the permissions are changed in the interrupted
thread.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Update the commit message
	Clean up test name
	Add test_pks support
	s/pks_mk_*/pks_set_*/
	Simplify the use of globals for the faults
	From Rick Edgecombe
		Use WRITE_ONCE to protect against races with the fault
		handler
		s/RUN_FAULT_ABANDON/RUN_FAULT_CALLBACK

Changes for V8
	New test developed just to double check for regressions while
	reworking the code.
---
 lib/pks/pks_test.c                     | 60 ++++++++++++++++++++++++++
 tools/testing/selftests/x86/test_pks.c |  5 ++-
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 762f4a19cb7d..a9cd2a49abfa 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -49,6 +49,7 @@
 #define ARM_CTX_SWITCH		2
 #define CHECK_CTX_SWITCH	3
 #define RUN_EXCEPTION		4
+#define RUN_EXCEPTION_UPDATE	5
 #define RUN_CRASH_TEST		9
 
 DECLARE_PER_CPU(u32, pkrs_cache);
@@ -64,6 +65,7 @@ struct pks_test_ctx {
 	void *test_page;
 	bool fault_seen;
 	bool validate_exp_handling;
+	bool validate_update_exp;
 };
 
 static bool check_pkey_val(u32 pk_reg, u8 pkey, u32 expected)
@@ -164,6 +166,16 @@ static void validate_exception(struct pks_test_ctx *ctx, u32 thread_pkrs)
 	}
 }
 
+static bool handle_update_exception(struct pt_regs *regs, struct pks_test_ctx *ctx)
+{
+	pr_debug("Updating pkey %d during exception\n", ctx->pkey);
+
+	ctx->fault_seen = true;
+	pks_update_exception(regs, ctx->pkey, 0);
+
+	return true;
+}
+
 /* Global data protected by test_run_lock */
 struct pks_test_ctx *g_ctx_under_test;
 
@@ -190,6 +202,9 @@ bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
 	if (!g_ctx_under_test)
 		return false;
 
+	if (g_ctx_under_test->validate_update_exp)
+		return handle_update_exception(regs, g_ctx_under_test);
+
 	if (g_ctx_under_test->validate_exp_handling) {
 		validate_exception(g_ctx_under_test, pkrs);
 		/*
@@ -518,6 +533,47 @@ static void check_ctx_switch(struct pks_session_data *sd)
 	}
 }
 
+static bool run_exception_update(struct pks_session_data *sd)
+{
+	struct pks_test_ctx *ctx;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx))
+		return false;
+
+	set_ctx_data(sd, ctx);
+
+	ctx->fault_seen = false;
+	ctx->validate_update_exp = true;
+	pks_set_noaccess(ctx->pkey);
+
+	set_context_for_fault(ctx);
+
+	/* fault */
+	memcpy(ctx->test_page, ctx->data, 8);
+
+	if (!ctx->fault_seen) {
+		pr_err("Failed to see the callback\n");
+		return false;
+	}
+
+	ctx->fault_seen = false;
+	ctx->validate_update_exp = false;
+
+	set_context_for_fault(ctx);
+
+	/* no fault */
+	memcpy(ctx->test_page, ctx->data, 8);
+
+	if (ctx->fault_seen) {
+		pr_err("Pkey %d failed to be set RD/WR in the callback\n",
+			ctx->pkey);
+		return false;
+	}
+
+	return true;
+}
+
 static ssize_t pks_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
@@ -584,6 +640,10 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 		pr_debug("Exception checking\n");
 		sd->last_test_pass = run_exception_test(file->private_data);
 		break;
+	case RUN_EXCEPTION_UPDATE:
+		pr_debug("Fault clear test\n");
+		sd->last_test_pass = run_exception_update(file->private_data);
+		break;
 	default:
 		pr_debug("Unknown test\n");
 		sd->last_test_pass = false;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 817df7a14923..243347e48228 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -36,6 +36,7 @@
 #define ARM_CTX_SWITCH		"2"
 #define CHECK_CTX_SWITCH	"3"
 #define RUN_EXCEPTION		"4"
+#define RUN_EXCEPTION_UPDATE	"5"
 #define RUN_CRASH_TEST		"9"
 
 time_t g_start_time;
@@ -63,6 +64,7 @@ enum {
 	TEST_SINGLE,
 	TEST_CTX_SWITCH,
 	TEST_EXCEPTION,
+	TEST_FAULT_CALLBACK,
 	MAX_TESTS,
 } tests;
 
@@ -77,7 +79,8 @@ struct test_item {
 	{ "check_defaults", CHECK_DEFAULTS, do_simple_test },
 	{ "single", RUN_SINGLE, do_simple_test },
 	{ "context_switch", ARM_CTX_SWITCH, do_context_switch },
-	{ "exception", RUN_EXCEPTION, do_simple_test }
+	{ "exception", RUN_EXCEPTION, do_simple_test },
+	{ "exception_update", RUN_EXCEPTION_UPDATE, do_simple_test }
 };
 
 static char *get_test_name(int test_num)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 32/45] mm/pkeys: PKS testing, add test for all keys
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (30 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 31/45] mm/pkeys: PKS testing, test pks_update_exception() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 33/45] mm/pkeys: Add pks_available() ira.weiny
                   ` (13 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

To help test hardware and qemu it is necessary to be able to run through
all the available pkeys and run the access checks.  However, running
this test will conflict with normal PKS consumers.

Make a test, which is mutually exclusive from all other PKS consumers,
that loops through all the pkeys and tests the various access modes.

Update the documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Update commit message
	Create ENABLE_PKS_CONSUMER Kconfig to make this test mutually
		exclusive with any other pks consumer

Changes for V8
	Split this off from the large testing patch
	Remove debugging version
---
 Documentation/core-api/protection-keys.rst | 12 +++----
 arch/x86/mm/pkeys.c                        | 10 ++++++
 include/linux/pks-keys.h                   |  5 +++
 lib/Kconfig.debug                          | 21 +++++++++++
 lib/pks/pks_test.c                         | 41 +++++++++++++++++++++-
 mm/Kconfig                                 |  9 +++++
 tools/testing/selftests/x86/test_pks.c     |  5 ++-
 7 files changed, 95 insertions(+), 8 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 22ad58a93423..68fe7a92cc98 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -117,20 +117,20 @@ Kconfig
 -------
 
 Kernel users intending to use PKS support should depend on
-ARCH_HAS_SUPERVISOR_PKEYS, and select ARCH_ENABLE_SUPERVISOR_PKEYS to turn on
-this support within the core.  For example:
+ARCH_HAS_SUPERVISOR_PKEYS, and select ARCH_ENABLE_PKS_CONSUMER to turn on this
+support within the core.  For example:
 
 .. code-block:: c
 
         config MY_NEW_FEATURE
                 depends on ARCH_HAS_SUPERVISOR_PKEYS
-                select ARCH_ENABLE_SUPERVISOR_PKEYS
+                select ARCH_ENABLE_PKS_CONSUMER
 
 This will make "MY_NEW_FEATURE" unavailable unless the architecture sets
 ARCH_HAS_SUPERVISOR_PKEYS.  It also makes it possible for multiple independent
-features to "select ARCH_ENABLE_SUPERVISOR_PKEYS".  If no features enable PKS
-by selecting ARCH_ENABLE_SUPERVISOR_PKEYS, PKS support will not be compiled
-into the kernel.
+features to "select ARCH_ENABLE_PKS_CONSUMER".  If no features enable PKS by
+selecting ARCH_ENABLE_PKS_CONSUMER, PKS support will not be compiled into the
+kernel.
 
 PKS Key Allocation
 ------------------
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 9b2a6a62d433..fd2ba269e64a 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -243,12 +243,22 @@ __static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);
  *	#endif
  *	};
  */
+#ifndef CONFIG_PKS_TEST_ALL_KEYS
+
 static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = {
 #ifdef CONFIG_PKS_TEST
 	[PKS_KEY_TEST]		= pks_test_fault_callback,
 #endif
 };
 
+#else /* CONFIG_PKS_TEST_ALL_KEYS */
+
+static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = {
+	[1 ... (PKS_KEY_MAX-1)]	= pks_test_fault_callback,
+};
+
+#endif
+
 static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
 				    bool write, u16 key)
 {
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index 43e4ae42db2e..f7e82e462659 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -63,6 +63,11 @@
 #define PKS_KEY_TEST		PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_PKS_TEST)
 #define PKS_KEY_MAX		PKS_NEW_KEY(PKS_KEY_TEST, 1)
 
+#ifdef CONFIG_PKS_TEST_ALL_KEYS
+#undef PKS_KEY_MAX
+#define PKS_KEY_MAX PKS_NUM_PKEYS
+#endif
+
 /* PKS_KEY_DEFAULT_INIT must be RW */
 #define PKS_KEY_DEFAULT_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
 #define PKS_KEY_TEST_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_TEST, AD, \
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5cab2100c133..c9885c2ddea8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2685,6 +2685,12 @@ config HYPERV_TESTING
 	help
 	  Select this option to enable Hyper-V vmbus testing.
 
+# PKS_TEST is a special PKS consumer and therefore sets
+# ARCH_ENABLE_SUPERVISOR_PKEYS directly rather than through
+# ARCH_ENABLE_PKS_CONSUMER
+#
+# This allows PKS_TEST_ALL_KEYS to remain mutially exclusive to any real PKS
+# consumer
 config PKS_TEST
 	bool "PKey (S)upervisor testing"
 	depends on ARCH_HAS_SUPERVISOR_PKEYS
@@ -2697,6 +2703,21 @@ config PKS_TEST
 
 	  If unsure, say N.
 
+config PKS_TEST_ALL_KEYS
+	bool "PKS test all keys"
+	depends on (PKS_TEST && !ARCH_ENABLE_PKS_CONSUMER)
+	help
+	  Select this option to enable testing of all the PKS keys available in
+	  the architecture.  This option is mutually exclusive with PKS
+	  consumers other than PKS_TEST.  This is because it will consume all
+	  PKS keys for testing purposes.
+
+	  Answer N if you don't know what supervisor keys are or want to have
+	  supervisor keys available for other consumers.
+
+	  If unsure, say N.
+
+
 endmenu # "Kernel Testing and Coverage"
 
 source "Documentation/Kconfig"
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index a9cd2a49abfa..e38a487c7065 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -50,12 +50,12 @@
 #define CHECK_CTX_SWITCH	3
 #define RUN_EXCEPTION		4
 #define RUN_EXCEPTION_UPDATE	5
+#define RUN_ALL_KEYS		6
 #define RUN_CRASH_TEST		9
 
 DECLARE_PER_CPU(u32, pkrs_cache);
 
 static struct dentry *pks_test_dentry;
-
 DEFINE_MUTEX(test_run_lock);
 
 struct pks_test_ctx {
@@ -439,6 +439,39 @@ static bool run_exception_test(struct pks_session_data *sd)
 	return pass;
 }
 
+#ifdef CONFIG_PKS_TEST_ALL_KEYS
+
+static bool run_all_keys(void)
+{
+	struct pks_test_ctx *ctx[PKS_NUM_PKEYS];
+	static char name[PKS_NUM_PKEYS][64];
+	int i;
+	bool rc = true;
+
+	for (i = 1; i < PKS_NUM_PKEYS; i++) {
+		sprintf(name[i], "pks ctx %d", i);
+		ctx[i] = alloc_ctx(i);
+	}
+
+	for (i = 1; i < PKS_NUM_PKEYS; i++) {
+		pr_debug("Running pkey '%d'\n", i);
+		if (!IS_ERR(ctx[i])) {
+			/* sticky fail */
+			if (!test_ctx(ctx[i]))
+				rc = false;
+		}
+	}
+
+	for (i = 1; i < PKS_NUM_PKEYS; i++) {
+		if (!IS_ERR(ctx[i]))
+			free_ctx(ctx[i]);
+	}
+
+	return rc;
+}
+
+#endif
+
 static void crash_it(struct pks_session_data *sd)
 {
 	struct pks_test_ctx *ctx;
@@ -644,6 +677,12 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 		pr_debug("Fault clear test\n");
 		sd->last_test_pass = run_exception_update(file->private_data);
 		break;
+#ifdef CONFIG_PKS_TEST_ALL_KEYS
+	case RUN_ALL_KEYS:
+		pr_debug("Run all\n");
+		sd->last_test_pass = run_all_keys();
+		goto unlock_test;
+#endif
 	default:
 		pr_debug("Unknown test\n");
 		sd->last_test_pass = false;
diff --git a/mm/Kconfig b/mm/Kconfig
index 46f2bb15aa4e..850372b6aaec 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -806,6 +806,15 @@ config ARCH_HAS_PKEYS
 	bool
 config ARCH_HAS_SUPERVISOR_PKEYS
 	bool
+
+config ARCH_ENABLE_PKS_CONSUMER
+	select ARCH_ENABLE_SUPERVISOR_PKEYS
+	bool
+
+# WARNING Do not set ARCH_ENABLE_SUPERVISOR_PKEYS directly use
+# ARCH_ENABLE_PKS_CONSUMER instead.
+#
+# See the PKey (S)upervisor testing (PKS_TEST) config option for details.
 config ARCH_ENABLE_SUPERVISOR_PKEYS
 	bool
 
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 243347e48228..a2e5554e7fdb 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -37,6 +37,7 @@
 #define CHECK_CTX_SWITCH	"3"
 #define RUN_EXCEPTION		"4"
 #define RUN_EXCEPTION_UPDATE	"5"
+#define RUN_ALL_KEYS		"6"
 #define RUN_CRASH_TEST		"9"
 
 time_t g_start_time;
@@ -65,6 +66,7 @@ enum {
 	TEST_CTX_SWITCH,
 	TEST_EXCEPTION,
 	TEST_FAULT_CALLBACK,
+	TEST_ALL,
 	MAX_TESTS,
 } tests;
 
@@ -80,7 +82,8 @@ struct test_item {
 	{ "single", RUN_SINGLE, do_simple_test },
 	{ "context_switch", ARM_CTX_SWITCH, do_context_switch },
 	{ "exception", RUN_EXCEPTION, do_simple_test },
-	{ "exception_update", RUN_EXCEPTION_UPDATE, do_simple_test }
+	{ "exception_update", RUN_EXCEPTION_UPDATE, do_simple_test },
+	{ "run_all", RUN_ALL_KEYS, do_simple_test }
 };
 
 static char *get_test_name(int test_num)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 33/45] mm/pkeys: Add pks_available()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (31 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 32/45] mm/pkeys: PKS testing, add test for all keys ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 34/45] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
                   ` (12 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

If PKS is configured within the kernel but the CPU does not support PKS,
the PKS calls remain safe to execute even without protection.  However,
adding the overhead of these calls on CPUs which don't support PKS is
inefficient and best avoided.

Define pks_available() to allow users to check if PKS is enabled on the
current system.

The implementation of pks_available() is placed in the asm headers while
being directly exported via linux/pks.h to allow for the inline calling
of cpu_feature_enabled() by consumers outside of the architecture.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Driven by a request by Dan Williams to make this static inline
		Place this in pks.h to avoid header conflicts while
		allowing for an optimized call to cpu_feature_enabled()

Changes for V8
	s/pks_enabled/pks_available
---
 Documentation/core-api/protection-keys.rst |  3 +++
 arch/x86/include/asm/pks.h                 | 12 ++++++++++++
 include/linux/pks.h                        |  8 ++++++++
 3 files changed, 23 insertions(+)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 68fe7a92cc98..36621cbc2cc6 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -152,6 +152,9 @@ Changing permissions of individual keys
 .. kernel-doc:: arch/x86/mm/pkeys.c
         :identifiers: pks_update_exception
 
+.. kernel-doc:: arch/x86/include/asm/pks.h
+        :identifiers: pks_available
+
 Overriding Default Fault Behavior
 ---------------------------------
 
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index de67d5b5a2af..cab42aadea07 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -2,8 +2,20 @@
 #ifndef _ASM_X86_PKS_H
 #define _ASM_X86_PKS_H
 
+#include <asm/cpufeature.h>
+
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
+/**
+ * pks_available() - Is PKS available on this system
+ *
+ * Return if PKS is currently supported and enabled on this system.
+ */
+static inline bool pks_available(void)
+{
+	return cpu_feature_enabled(X86_FEATURE_PKS);
+}
+
 void pks_setup(void);
 void x86_pkrs_load(struct thread_struct *thread);
 void pks_save_pt_regs(struct pt_regs *regs);
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 45156f358776..163c75992a8a 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -8,6 +8,9 @@
 
 #include <uapi/asm-generic/mman-common.h>
 
+#include <asm/pks.h>
+
+bool pks_available(void);
 void pks_update_protection(u8 pkey, u8 protection);
 void pks_update_exception(struct pt_regs *regs, u8 pkey, u8 protection);
 
@@ -40,6 +43,11 @@ typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+static inline bool pks_available(void)
+{
+	return false;
+}
+
 static inline void pks_set_noaccess(u8 pkey) {}
 static inline void pks_set_readwrite(u8 pkey) {}
 static inline void pks_update_exception(struct pt_regs *regs,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 34/45] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (32 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 33/45] mm/pkeys: Add pks_available() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 35/45] memremap_pages: Introduce pgmap_protection_available() ira.weiny
                   ` (11 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity may be orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.

Add a Kconfig option to configure additional devmap protections using
PKS.

Only PMEM which is advertised to the memory subsystem needs this
protection.  Therefore, the feature depends on NVDIMM_PFN.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Change this to enable arch pks consumer for mutual exclusion
		with testing all pkeys
	From Dan Williams
		Default to no
		Clean up commit message

Changes for V8
	Split this out from
		[PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS)
---
 mm/Kconfig | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 850372b6aaec..ba8a557dcf81 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -776,6 +776,24 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVMAP_ACCESS_PROTECTION
+	bool "Access protection for memremap_pages()"
+	depends on NVDIMM_PFN
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	select ARCH_ENABLE_PKS_CONSUMER
+	default n
+
+	help
+	  Enable extra protections on device memory.  This protects against
+	  unintended access to devices such as a stray writes.  This feature is
+	  particularly useful to protect against corruption of persistent
+	  memory.
+
+	  This depends on architecture support of supervisor PKeys and has no
+	  overhead if the architecture does not support them.
+
+	  If you have persistent memory say 'Y'.
+
 config DEV_PAGEMAP_OPS
 	bool
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 35/45] memremap_pages: Introduce pgmap_protection_available()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (33 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 34/45] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 36/45] memremap_pages: Introduce a PGMAP_PROTECTION flag ira.weiny
                   ` (10 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

PMEM will flag additional dev_pagemap protection through (struct
dev_pagemap)->flags.  However, it is more efficient to know if that
protection is available prior to requesting it and failing the mapping.

Define pgmap_protection_available() to check if protection is available
prior to being requested.  The name of pgmap_protection_available() was
specifically chosen to isolate the implementation of the protection from
higher level users.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Clean up commit message
	From Dan Williams
		make call stack static inline throughout this call and
			pks_available() such that callers calls
			cpu_feature_enabled() directly

Changes for V8
	Split this out to it's own patch.
	s/pgmap_protection_enabled/pgmap_protection_available
---
 include/linux/mm.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5744a3fc4716..9ab799403004 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,7 @@
 #include <linux/sizes.h>
 #include <linux/sched.h>
 #include <linux/pgtable.h>
+#include <linux/pks.h>
 #include <linux/kasan.h>
 
 struct mempolicy;
@@ -1143,6 +1144,22 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+
+static inline bool pgmap_protection_available(void)
+{
+	return pks_available();
+}
+
+#else
+
+static inline bool pgmap_protection_available(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define folio_ref_zero_or_close_to_overflow(folio) \
 	((unsigned int) folio_ref_count(folio) + 127u <= 127u)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 36/45] memremap_pages: Introduce a PGMAP_PROTECTION flag
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (34 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 35/45] memremap_pages: Introduce pgmap_protection_available() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 37/45] memremap_pages: Introduce devmap_protected() ira.weiny
                   ` (9 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.

Some systems which have configured DEVMAP_ACCESS_PROTECTION may not have
PMEM installed.  Or the PMEM may not be mapped into the direct map.  In
addition, some callers of memremap_pages() will not want the mapped
pages protected.

Define a new PGMAP flag to distinguish page maps which are protected.
Use this flag to enable runtime protection support.  A static key is
used to optimize the runtime support.

Specifying this flag on a system which can't support protections will
fail.  Callers are expected to check if protections are supported via
pgmap_protection_available().  It was considered to have callers specify
the flag and check if the dev_pagemap object returned was protected or
not.  But this was considered less efficient than a direct check
beforehand.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Clean up commit message

Changes for V8
	Split this out into it's own patch
---
 include/linux/memremap.h |  1 +
 mm/memremap.c            | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acba..84402f73712c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -80,6 +80,7 @@ struct dev_pagemap_ops {
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
+#define PGMAP_PROTECTION	(1 << 1)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..38d321cc59c2 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -63,6 +63,37 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+
+/*
+ * Note; all devices which have asked for protections share the same key.  The
+ * key may, or may not, have been provided by the core.  If not, protection
+ * will be disabled.  The key acquisition is attempted when the first ZONE
+ * DEVICE requests it and freed when all zones have been unmapped.
+ *
+ * Also this must be EXPORT_SYMBOL rather than EXPORT_SYMBOL_GPL because it is
+ * intended to be used in the kmap API.
+ */
+DEFINE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+EXPORT_SYMBOL(dev_pgmap_protection_static_key);
+
+static void devmap_protection_enable(void)
+{
+	static_branch_inc(&dev_pgmap_protection_static_key);
+}
+
+static void devmap_protection_disable(void)
+{
+	static_branch_dec(&dev_pgmap_protection_static_key);
+}
+
+#else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */
+
+static void devmap_protection_enable(void) { }
+static void devmap_protection_disable(void) { }
+
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
 static void pgmap_array_delete(struct range *range)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -162,6 +193,9 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put(pgmap);
+
+	if (pgmap->flags & PGMAP_PROTECTION)
+		devmap_protection_disable();
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -308,6 +342,12 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
 		return ERR_PTR(-EINVAL);
 
+	if (pgmap->flags & PGMAP_PROTECTION) {
+		if (!pgmap_protection_available())
+			return ERR_PTR(-EINVAL);
+		devmap_protection_enable();
+	}
+
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 37/45] memremap_pages: Introduce devmap_protected()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (35 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 36/45] memremap_pages: Introduce a PGMAP_PROTECTION flag ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 38/45] memremap_pages: Reserve a PKS pkey for eventual use by PMEM ira.weiny
                   ` (8 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Consumers of protected dev_pagemaps can check the PGMAP_PROTECTION flag to
see if the devmap is protected.  However, most contexts will have a struct
page not the pagemap structure directly.

Define devmap_protected() to determine if a page is part of a
dev_pagemap mapping and if the page is protected by additional
protections.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/mm.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9ab799403004..4ca24329848a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1151,6 +1151,23 @@ static inline bool pgmap_protection_available(void)
 	return pks_available();
 }
 
+DECLARE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+
+/*
+ * devmap_protected() requires a reference on the page to ensure there is no
+ * races with dev_pagemap tear down.
+ */
+static inline bool devmap_protected(struct page *page)
+{
+	if (!static_branch_unlikely(&dev_pgmap_protection_static_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	if (page->pgmap->flags & PGMAP_PROTECTION)
+		return true;
+	return false;
+}
+
 #else
 
 static inline bool pgmap_protection_available(void)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 38/45] memremap_pages: Reserve a PKS pkey for eventual use by PMEM
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (36 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 37/45] memremap_pages: Introduce devmap_protected() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 39/45] memremap_pages: Set PKS pkey in PTEs if requested ira.weiny
                   ` (7 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Reserve a pkey for use by the memmap facility and set the default
protections to Access Disabled.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Adjust for new key allocation
	From Dave Hansen
		use pkey
---
 include/linux/pks-keys.h | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index f7e82e462659..32075ac54964 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -61,7 +61,9 @@
 /* PKS_KEY_DEFAULT must be 0 */
 #define PKS_KEY_DEFAULT		0
 #define PKS_KEY_TEST		PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_PKS_TEST)
-#define PKS_KEY_MAX		PKS_NEW_KEY(PKS_KEY_TEST, 1)
+#define PKS_KEY_PGMAP_PROTECTION \
+		PKS_NEW_KEY(PKS_KEY_TEST, CONFIG_DEVMAP_ACCESS_PROTECTION)
+#define PKS_KEY_MAX		PKS_NEW_KEY(PKS_KEY_PGMAP_PROTECTION, 1)
 
 #ifdef CONFIG_PKS_TEST_ALL_KEYS
 #undef PKS_KEY_MAX
@@ -72,6 +74,8 @@
 #define PKS_KEY_DEFAULT_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
 #define PKS_KEY_TEST_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_TEST, AD, \
 							CONFIG_PKS_TEST)
+#define PKS_KEY_PGMAP_INIT	PKS_DECLARE_INIT_VALUE(PKS_KEY_PGMAP_PROTECTION, \
+					AD, CONFIG_DEVMAP_ACCESS_PROTECTION)
 
 #define PKS_ALL_AD_MASK \
 	GENMASK(PKS_NUM_PKEYS * PKR_BITS_PER_PKEY, \
@@ -79,7 +83,8 @@
 
 #define PKS_INIT_VALUE ((PKS_ALL_AD & PKS_ALL_AD_MASK) | \
 			PKS_KEY_DEFAULT_INIT | \
-			PKS_KEY_TEST_INIT \
+			PKS_KEY_TEST_INIT | \
+			PKS_KEY_PGMAP_INIT \
 			)
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 39/45] memremap_pages: Set PKS pkey in PTEs if requested
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (37 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 38/45] memremap_pages: Reserve a PKS pkey for eventual use by PMEM ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 40/45] memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls ira.weiny
                   ` (6 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

When a devmap caller requests protections, the dev_pagemap PTE's need to
have a PKEY set.

When PGMAP_PROTECTIONS is requested add the pkey to the page
protections.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dave Hansen
		use pkey
---
 mm/memremap.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/memremap.c b/mm/memremap.c
index 38d321cc59c2..cefdf541bcc1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -82,6 +82,14 @@ static void devmap_protection_enable(void)
 	static_branch_inc(&dev_pgmap_protection_static_key);
 }
 
+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+	pgprotval_t val;
+
+	val = pgprot_val(prot);
+	return __pgprot(val | _PAGE_PKEY(PKS_KEY_PGMAP_PROTECTION));
+}
+
 static void devmap_protection_disable(void)
 {
 	static_branch_dec(&dev_pgmap_protection_static_key);
@@ -92,6 +100,10 @@ static void devmap_protection_disable(void)
 static void devmap_protection_enable(void) { }
 static void devmap_protection_disable(void) { }
 
+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+	return prot;
+}
 #endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
 
 static void pgmap_array_delete(struct range *range)
@@ -346,6 +358,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		if (!pgmap_protection_available())
 			return ERR_PTR(-EINVAL);
 		devmap_protection_enable();
+		params.pgprot = devmap_protection_adjust_pgprot(params.pgprot);
 	}
 
 	switch (pgmap->type) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 40/45] memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (38 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 39/45] memremap_pages: Set PKS pkey in PTEs if requested ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 41/45] memremap_pages: Add memremap.pks_fault_mode ira.weiny
                   ` (5 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

A thread that wants to access memory protected by PGMAP protections must
first enable access, and then disable access when it is done.

Introduce pgmap_set_{readwrite|noaccess}() for this purpose.  The two
calls are destined to be used by the kmap API and take a struct page for
convenience.  They determine if the page is protected and, if so,
perform the requested operation.

Toggling between Read/Write and No Access was chosen as it fits well
with the accessibility of a kmap'ed page.  Discussions did occur
regarding making a finer grained mapping for Read Only but that is
something which can be added at a later date.

In addition, two lower level functions are exported.  They take the
dev_pagemap object directly for internal consumers who have knowledge of
the of the dev_pagemap.

All changes in the protections must be through the above calls.  They
abstract the protection implementation (currently the PKS api) from
upper layer consumers.

The calls are made nestable by the use of a per task reference count.
This ensures that the first call to re-enable protection does not
'break' the last access of the device memory.  Expansion of the task
struct is unavoidable due to the desire to maintain kmap_local_page() as
non-atomic and migratable.  The only other idea to track a reference
count was in a per-cpu variable.  However, doing so would make
kmap_local_page() equivalent to kmap_atomic() which is undesirable.

Access to device memory during exceptions (#PF) is expected only from
user faults.  Therefore there is no need to maintain the reference count
during exceptions.

NOTE: It is not anticipated that any code path will directly nest these
calls.  For this reason multiple reviewers, including Dan and Thomas,
asked why this reference counting was needed at this level rather than
in a higher level call such as kmap_local_page().  The reason is that
pgmap_set_readwrite() can nest with kmap_{atomic,local_page}().
Therefore this reference counting is pushed to the lower level to ensure
that any combination of calls is nestable.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dan Williams
		Update the commit message with details on why the thread
			struct needs to be expanded.
	Following on Dave Hansens suggestion for pks_mk
		s/pgmap_mk_*/pgmap_set_*/

Changes for V8
	Split these functions into their own patch.
		This helps to clarify the commit message and usage.
---
 include/linux/mm.h    | 35 +++++++++++++++++++++++++++++++++++
 include/linux/sched.h |  7 +++++++
 init/init_task.c      |  3 +++
 mm/memremap.c         | 14 ++++++++++++++
 4 files changed, 59 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ca24329848a..c85189b24eca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1168,8 +1168,43 @@ static inline bool devmap_protected(struct page *page)
 	return false;
 }
 
+void __pgmap_set_readwrite(struct dev_pagemap *pgmap);
+void __pgmap_set_noaccess(struct dev_pagemap *pgmap);
+
+static inline bool pgmap_check_pgmap_prot(struct page *page)
+{
+	if (!devmap_protected(page))
+		return false;
+
+	/*
+	 * There is no known use case to change permissions in an irq for pgmap
+	 * pages
+	 */
+	lockdep_assert_in_irq();
+	return true;
+}
+
+static inline void pgmap_set_readwrite(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_set_readwrite(page->pgmap);
+}
+
+static inline void pgmap_set_noaccess(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_set_noaccess(page->pgmap);
+}
+
 #else
 
+static inline void __pgmap_set_readwrite(struct dev_pagemap *pgmap) { }
+static inline void __pgmap_set_noaccess(struct dev_pagemap *pgmap) { }
+static inline void pgmap_set_readwrite(struct page *page) { }
+static inline void pgmap_set_noaccess(struct page *page) { }
+
 static inline bool pgmap_protection_available(void)
 {
 	return false;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..a79f2090e291 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,13 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	/*
+	 * NOTE: pgmap_prot_count is modified within a single thread of
+	 * execution.  So it does not need to be atomic_t.
+	 */
+	u32                             pgmap_prot_count;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 73cc8f03511a..948b32cf8139 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -209,6 +209,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	.pgmap_prot_count = 0,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/mm/memremap.c b/mm/memremap.c
index cefdf541bcc1..6fa259748a0b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -95,6 +95,20 @@ static void devmap_protection_disable(void)
 	static_branch_dec(&dev_pgmap_protection_static_key);
 }
 
+void __pgmap_set_readwrite(struct dev_pagemap *pgmap)
+{
+	if (!current->pgmap_prot_count++)
+		pks_set_readwrite(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_set_readwrite);
+
+void __pgmap_set_noaccess(struct dev_pagemap *pgmap)
+{
+	if (!--current->pgmap_prot_count)
+		pks_set_noaccess(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_set_noaccess);
+
 #else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */
 
 static void devmap_protection_enable(void) { }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 41/45] memremap_pages: Add memremap.pks_fault_mode
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (39 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 40/45] memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 42/45] kmap: Make kmap work for devmap protected pages ira.weiny
                   ` (4 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

When PKS protections for PMEM are enabled the kernel may capture stray
writes, or it may capture false positive access violations. An example
of a false positive access violation is a code path that neglects to
call kmap_{atomic,local_page}, but is otherwise a valid access. In the
false positive scenario there is no actual risk to data integrity, but
the kernel still needs to make a decision as to whether to report the
access violation and continue, or treat the violation as fatal. That
policy decision is captured in a new pks_fault_mode kernel parameter.

2 modes are available:

	'relaxed' (default) -- WARN_ONCE, removed the protections, and
	continuing to operate.

	'strict' -- Stop kernel execution via fault.  This is the most
	protective of the PMEM memory but may be undesirable in some
	configurations.

NOTE: There was some debate about if a 3rd mode called 'silent' should
be available.  'silent' would be the same as 'relaxed' but not print any
output.  While 'silent' is nice for admins to reduce console/log output
it would result in less motivation to fix invalid access to the
protected pmem pages.  Therefore, 'silent' is left out.

NOTE: The __param_check macro requires a type to correctly verify the
values passed as the module parameter.  Therefore a typedef is made of
the pks_fault_modes and the checkpatch warning regarding new typedefs is
ignored.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dan Williams
		Clarify commit message
		Remove code comment regarding checkpatch
	From Rick Edgecombe
		Remove unnecessary initialization

Changes for V8
	Use pks_update_exception() instead of abandoning the pkey.
	Split out pgmap_protection_flag_invalid() into a separate patch
		for clarity.
	From Rick Edgecombe
		Fix sysfs_streq() checks
	From Randy Dunlap
		Fix Documentation closing parans

Changes for V7
	Leverage Rick Edgecombe's fault callback infrastructure to relax invalid
		uses and prevent crashes
	From Dan Williams
		Use sysfs_* calls for parameter
		Make pgmap_disable_protection inline
		Remove pfn from warn output
	Remove silent parameter option
---
 .../admin-guide/kernel-parameters.txt         | 12 ++++
 arch/x86/mm/pkeys.c                           |  4 ++
 include/linux/mm.h                            |  3 +
 mm/memremap.c                                 | 65 +++++++++++++++++++
 4 files changed, 84 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7123524a86b8..c9556843012d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4158,6 +4158,18 @@
 	pirq=		[SMP,APIC] Manual mp-table setup
 			See Documentation/x86/i386/IO-APIC.rst.
 
+	memremap.pks_fault_mode=	[X86] Control the behavior of page map
+			protection violations.
+			(depends on CONFIG_DEVMAP_ACCESS_PROTECTION)
+
+			Format: { relaxed | strict }
+
+			relaxed - Print a warning, disable the protection and
+				  continue execution.
+			strict - Stop kernel execution via fault
+
+			default: relaxed
+
 	plip=		[PPT,NET] Parallel port network link
 			Format: { parport<nr> | timid | 0 }
 			See also Documentation/admin-guide/parport.rst.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index fd2ba269e64a..19ca3ef5389c 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -8,6 +8,7 @@
 #include <linux/pkeys.h>                /* PKEY_*                       */
 #include <linux/pks.h>
 #include <linux/pks-keys.h>
+#include <linux/mm.h>                   /* fault callback               */
 #include <uapi/asm-generic/mman-common.h>
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
@@ -249,6 +250,9 @@ static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = {
 #ifdef CONFIG_PKS_TEST
 	[PKS_KEY_TEST]		= pks_test_fault_callback,
 #endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	[PKS_KEY_PGMAP_PROTECTION]   = pgmap_pks_fault_callback,
+#endif
 };
 
 #else /* CONFIG_PKS_TEST_ALL_KEYS */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c85189b24eca..34ed04a3ea74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1198,6 +1198,9 @@ static inline void pgmap_set_noaccess(struct page *page)
 	__pgmap_set_noaccess(page->pgmap);
 }
 
+bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
+			      bool write);
+
 #else
 
 static inline void __pgmap_set_readwrite(struct dev_pagemap *pgmap) { }
diff --git a/mm/memremap.c b/mm/memremap.c
index 6fa259748a0b..aa2e40681bcf 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -13,6 +13,8 @@
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
 
+#include <uapi/asm-generic/mman-common.h>
+
 static DEFINE_XARRAY(pgmap_array);
 
 /*
@@ -95,6 +97,69 @@ static void devmap_protection_disable(void)
 	static_branch_dec(&dev_pgmap_protection_static_key);
 }
 
+typedef enum {
+	PKS_MODE_STRICT  = 0,
+	PKS_MODE_RELAXED = 1,
+} pks_fault_modes;
+
+pks_fault_modes pks_fault_mode = PKS_MODE_RELAXED;
+
+static int param_set_pks_fault_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret = -EINVAL;
+
+	if (sysfs_streq(val, "relaxed")) {
+		pks_fault_mode = PKS_MODE_RELAXED;
+		ret = 0;
+	} else if (sysfs_streq(val, "strict")) {
+		pks_fault_mode = PKS_MODE_STRICT;
+		ret = 0;
+	}
+
+	return ret;
+}
+
+static int param_get_pks_fault_mode(char *buffer, const struct kernel_param *kp)
+{
+	int ret;
+
+	switch (pks_fault_mode) {
+	case PKS_MODE_STRICT:
+		ret = sysfs_emit(buffer, "strict\n");
+		break;
+	case PKS_MODE_RELAXED:
+		ret = sysfs_emit(buffer, "relaxed\n");
+		break;
+	default:
+		ret = sysfs_emit(buffer, "<unknown>\n");
+		break;
+	}
+
+	return ret;
+}
+
+static const struct kernel_param_ops param_ops_pks_fault_modes = {
+	.set = param_set_pks_fault_mode,
+	.get = param_get_pks_fault_mode,
+};
+
+#define param_check_pks_fault_modes(name, p) \
+	__param_check(name, p, pks_fault_modes)
+module_param(pks_fault_mode, pks_fault_modes, 0644);
+
+bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
+			      bool write)
+{
+	/* In strict mode just let the fault handler oops */
+	if (pks_fault_mode == PKS_MODE_STRICT)
+		return false;
+
+	WARN_ONCE(1, "Page map protection being disabled");
+	pks_update_exception(regs, PKS_KEY_PGMAP_PROTECTION, PKEY_READ_WRITE);
+	return true;
+}
+EXPORT_SYMBOL_GPL(pgmap_pks_fault_callback);
+
 void __pgmap_set_readwrite(struct dev_pagemap *pgmap)
 {
 	if (!current->pgmap_prot_count++)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 42/45] kmap: Make kmap work for devmap protected pages
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (40 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 41/45] memremap_pages: Add memremap.pks_fault_mode ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 43/45] dax: Stray access protection for dax_direct_access() ira.weiny
                   ` (3 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Today, kmap_{local_page,atomic}() handle granting access to HIGHMEM
pages without the caller needing to know if the page is HIGHMEM, or not.
Use that existing infrastructure to grant access to PGMAP (PKS)
protected pages.

kmap_{local_page,atomic}() are both thread local mappings so they work
well with the thread specific protections available within PKS.

On the other hand, the kmap() call is not changed.  kmap() allows for a
mapping to be shared with other threads, while PKS protections operate
on a thread local basis.  For this reason, and the desire to move away
from mappings like this, kmap() is left unsupported.

This behavior is safe because neither of the 2 current DAX-capable
filesystems (ext4 and xfs) perform such global mappings.  And known
device drivers that would handle devmap pages are not using kmap().  Any
future filesystems that gain DAX support, or device drivers wanting to
support devmap protected pages will need to use kmap_local_page().

Note: HIGHMEM support is mutually exclusive with PGMAP protection.  The
rationale is mainly to reduce complexity, but also because direct-map
exposure is already mitigated by default on HIGHMEM systems because
by definition HIGHMEM systems do not have large capacities of memory
in the direct map.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	From Dan Williams
		Update commit message
			Clarify why kmap is not 'compatible' with PKS
			Explain the HIGHMEM system exclusion more
	Remove pgmap_protection_flag_invalid() from kmap
	s/pks_mk*/pks_set*/

Changes for V8
	Reword commit message
---
 include/linux/highmem-internal.h | 4 ++++
 mm/Kconfig                       | 1 +
 2 files changed, 5 insertions(+)

diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index 0a0b2b09b1b8..71605cf6044b 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -174,6 +174,7 @@ static inline void kunmap(struct page *page)
 
 static inline void *kmap_local_page(struct page *page)
 {
+	pgmap_set_readwrite(page);
 	return page_address(page);
 }
 
@@ -197,6 +198,7 @@ static inline void __kunmap_local(void *addr)
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(addr);
 #endif
+	pgmap_set_noaccess(kmap_to_page(addr));
 }
 
 static inline void *kmap_atomic(struct page *page)
@@ -206,6 +208,7 @@ static inline void *kmap_atomic(struct page *page)
 	else
 		preempt_disable();
 	pagefault_disable();
+	pgmap_set_readwrite(page);
 	return page_address(page);
 }
 
@@ -224,6 +227,7 @@ static inline void __kunmap_atomic(void *addr)
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(addr);
 #endif
+	pgmap_set_noaccess(kmap_to_page(addr));
 	pagefault_enable();
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		migrate_enable();
diff --git a/mm/Kconfig b/mm/Kconfig
index ba8a557dcf81..4e33ff11b7a9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -779,6 +779,7 @@ config ZONE_DEVICE
 config DEVMAP_ACCESS_PROTECTION
 	bool "Access protection for memremap_pages()"
 	depends on NVDIMM_PFN
+	depends on !HIGHMEM
 	depends on ARCH_HAS_SUPERVISOR_PKEYS
 	select ARCH_ENABLE_PKS_CONSUMER
 	default n
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 43/45] dax: Stray access protection for dax_direct_access()
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (41 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 42/45] kmap: Make kmap work for devmap protected pages ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 44/45] nvdimm/pmem: Enable stray access protection ira.weiny
                   ` (2 subsequent siblings)
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

dax_direct_access() provides a way to obtain the direct map address of
PMEM memory.  With the new devmap protections the use of this address
needs to be bracketed by calls to enable and disable protection of
those pages.  These calls only need to be used to guard actual access to
the memory.  Other uses of dax_direct_access() do not need to use these
guards.

Introduce 2 new calls dax_set_readwrite() and dax_set_noaccess().
Bracket all uses of the address returned by dax_direct_access() with
those calls.

For consumers who require a permanent address to the dax device, such as
the DM write cache, dax_map_protected() is used to query for additional
protections.

Update the DM write cache code to create a permanent mapping if
dax_map_protected() is true.

Cc: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Do not add a new dax operation.  Instead teach struct dax_device
		about the dev_pagemap PGMAP_PROTECTION flag and call the
		ops directly if needed.
	s/dax_mk_*/dax_set_*/

Changes for V8
	Rebase changes on 5.17-rc1
	Clean up the cover letter
		dax_read_lock() is not required
		s/dax_protected()/dax_map_protected()/
	Testing revealed a dax_flush() which was not properly protected.

Changes for V7
	Rework cover letter.
	Do not include a FS_DAX_LIMITED restriction for dcss.  It  will
		simply not implement the protection and there is no need
		to special case this.
		Clean up commit message because I did not originally
		understand the nuance of the s390 device.
	Introduce dax_{protected,mk_readwrite,mk_noaccess}()
	From Dan Williams
		Remove old clean up cruft from previous versions
		Remove map_protected
	Remove 'global' parameters all calls
---
 drivers/dax/super.c        | 59 ++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-writecache.c |  8 +++++-
 fs/dax.c                   |  8 ++++++
 fs/fuse/virtio_fs.c        |  2 ++
 include/linux/dax.h        |  5 ++++
 5 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e3029389d809..6dbceffb43b4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -28,6 +28,7 @@ struct dax_device {
 	void *private;
 	unsigned long flags;
 	const struct dax_operations *ops;
+	struct dev_pagemap *pgmap;
 };
 
 static dev_t dax_devt;
@@ -117,6 +118,8 @@ enum dax_device_flags {
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
  * @kaddr: output parameter that returns a virtual address mapping of pfn
+ *         Direct access through this pointer must be guarded by calls to
+ *         dax_set_{readwrite,noaccess}()
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
  * Return: negative errno if an error occurs, otherwise the number of
@@ -209,6 +212,56 @@ void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 #endif
 EXPORT_SYMBOL_GPL(dax_flush);
 
+bool dax_map_protected(struct dax_device *dax_dev)
+{
+	struct dev_pagemap *pgmap = dax_dev->pgmap;
+
+	if (!dax_alive(dax_dev))
+		return false;
+
+	return pgmap && (pgmap->flags & PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(dax_map_protected);
+
+/**
+ * dax_set_readwrite() - make protected dax devices read/write
+ * @dax_dev: the dax device representing the memory to access
+ *
+ * Any access of the kaddr memory returned from dax_direct_access() must be
+ * guarded by dax_set_readwrite() and dax_set_noaccess().  This ensures that any
+ * dax devices which have additional protections are allowed to relax those
+ * protections for the thread using this memory.
+ *
+ * NOTE these calls must be contained within a single thread of execution and
+ * both must be guarded by dax_read_lock()  Which is also a requirement for
+ * dax_direct_access() anyway.
+ */
+void dax_set_readwrite(struct dax_device *dax_dev)
+{
+	if (!dax_map_protected(dax_dev))
+		return;
+
+	__pgmap_set_readwrite(dax_dev->pgmap);
+}
+EXPORT_SYMBOL_GPL(dax_set_readwrite);
+
+/**
+ * dax_set_noaccess() - restore protection to dax devices if needed
+ * @dax_dev: the dax device representing the memory to access
+ *
+ * See dax_direct_access() and dax_set_readwrite()
+ *
+ * NOTE Must be called prior to dax_read_unlock()
+ */
+void dax_set_noaccess(struct dax_device *dax_dev)
+{
+	if (!dax_map_protected(dax_dev))
+		return;
+
+	__pgmap_set_noaccess(dax_dev->pgmap);
+}
+EXPORT_SYMBOL_GPL(dax_set_noaccess);
+
 void dax_write_cache(struct dax_device *dax_dev, bool wc)
 {
 	if (wc)
@@ -248,6 +301,12 @@ void set_dax_nomc(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(set_dax_nomc);
 
+void set_dax_pgmap(struct dax_device *dax_dev, struct dev_pagemap *pgmap)
+{
+	dax_dev->pgmap = pgmap;
+}
+EXPORT_SYMBOL_GPL(set_dax_pgmap);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
 	lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 4f31591d2d25..5d6d7b6bad30 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -297,7 +297,13 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		r = -EOPNOTSUPP;
 		goto err2;
 	}
-	if (da != p) {
+
+	/*
+	 * Force the write cache to map the pages directly if the dax device
+	 * mapping is protected or if the number of pages returned was not what
+	 * was requested.
+	 */
+	if (dax_map_protected(wc->ssd_dev->dax_dev) || da != p) {
 		long i;
 		wc->memory_map = NULL;
 		pages = kvmalloc_array(p, sizeof(struct page *), GFP_KERNEL);
diff --git a/fs/dax.c b/fs/dax.c
index cd03485867a7..c126520b41d5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -728,7 +728,9 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
 		return rc;
 	}
 	vto = kmap_atomic(vmf->cow_page);
+	dax_set_readwrite(iter->iomap.dax_dev);
 	copy_user_page(vto, kaddr, vmf->address, vmf->cow_page);
+	dax_set_noaccess(iter->iomap.dax_dev);
 	kunmap_atomic(vto);
 	dax_read_unlock(id);
 	return 0;
@@ -937,8 +939,10 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 	count = 1UL << dax_entry_order(entry);
 	index = xas->xa_index & ~(count - 1);
 
+	dax_set_readwrite(dax_dev);
 	dax_entry_mkclean(mapping, index, pfn);
 	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
+	dax_set_noaccess(dax_dev);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -1125,8 +1129,10 @@ static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
 
 	ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
 	if (ret > 0) {
+		dax_set_readwrite(dax_dev);
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
+		dax_set_noaccess(dax_dev);
 	}
 	return ret;
 }
@@ -1260,12 +1266,14 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		if (map_len > end - pos)
 			map_len = end - pos;
 
+		dax_set_readwrite(dax_dev);
 		if (iov_iter_rw(iter) == WRITE)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
 			xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
+		dax_set_noaccess(dax_dev);
 
 		pos += xfer;
 		length -= xfer;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c..542c8dc95021 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -774,8 +774,10 @@ static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
 	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
 	if (rc < 0)
 		return rc;
+	dax_set_readwrite(dax_dev);
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
 	dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
+	dax_set_noaccess(dax_dev);
 	return 0;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..30fe49f9ec9d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -91,6 +91,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 
 void set_dax_nocache(struct dax_device *dax_dev);
 void set_dax_nomc(struct dax_device *dax_dev);
+void set_dax_pgmap(struct dax_device *dax_dev, struct dev_pagemap *pgmap);
 
 struct writeback_control;
 #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
@@ -187,6 +188,10 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
+bool dax_map_protected(struct dax_device *dax_dev);
+void dax_set_readwrite(struct dax_device *dax_dev);
+void dax_set_noaccess(struct dax_device *dax_dev);
+
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops);
 vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 44/45] nvdimm/pmem: Enable stray access protection
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (42 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 43/45] dax: Stray access protection for dax_direct_access() ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-10 17:20 ` [PATCH V9 45/45] devdax: " ira.weiny
  2022-03-31 17:13 ` [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection Ira Weiny
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Now that all valid kernel access' to PMEM have been annotated with
{__}pgmap_set_{readwrite,noaccess}() PGMAP_PROTECTION is safe to enable
in the pmem layer.

Set PGMAP_PROTECTION if pgmap protections are available and set the
pgmap property of the dax device for it's use.

Internally, the pmem driver uses a cached virtual address,
pmem->virt_addr (pmem_addr).  Call __pgmap_set_{readwrite,noaccess}()
directly when PGMAP_PROTECTION is active on those mappings.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Remove the dax operations and pass the pgmap to the dax_device
		for its use.
	s/pgmap_mk_*/pgmap_set_*/
	s/pmem_mk_*/pmem_set_*/

Changes for V8
	Rebase to 5.17-rc1
	Remove global param
	Add internal structure which uses the pmem device and pgmap
		device directly in the *_mk_*() calls.
	Add pmem dax ops callbacks
	Use pgmap_protection_available()
	s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
---
 drivers/nvdimm/pmem.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..2c7b18da7974 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
 	return BLK_STS_OK;
 }
 
+static void pmem_set_readwrite(struct pmem_device *pmem)
+{
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		__pgmap_set_readwrite(&pmem->pgmap);
+}
+
+static void pmem_set_noaccess(struct pmem_device *pmem)
+{
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		__pgmap_set_noaccess(&pmem->pgmap);
+}
+
 static blk_status_t pmem_do_read(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
@@ -149,7 +161,11 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
 		return BLK_STS_IOERR;
 
+	/* Enable direct use of pmem->virt_addr */
+	pmem_set_readwrite(pmem);
 	rc = read_pmem(page, page_off, pmem_addr, len);
+	pmem_set_noaccess(pmem);
+
 	flush_dcache_page(page);
 	return rc;
 }
@@ -181,11 +197,15 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 	 * after clear poison.
 	 */
 	flush_dcache_page(page);
+
+	/* Enable direct use of pmem->virt_addr */
+	pmem_set_readwrite(pmem);
 	write_pmem(pmem_addr, page, page_off, len);
 	if (unlikely(bad_pmem)) {
 		rc = pmem_clear_poison(pmem, pmem_off, len);
 		write_pmem(pmem_addr, page, page_off, len);
 	}
+	pmem_set_noaccess(pmem);
 
 	return rc;
 }
@@ -427,6 +447,8 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		if (pgmap_protection_available())
+			pmem->pgmap.flags |= PGMAP_PROTECTION;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +462,8 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.range.end = res->end;
 		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		if (pgmap_protection_available())
+			pmem->pgmap.flags |= PGMAP_PROTECTION;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
 		bb_range = pmem->pgmap.range;
@@ -481,6 +505,8 @@ static int pmem_attach_disk(struct device *dev,
 	}
 	set_dax_nocache(dax_dev);
 	set_dax_nomc(dax_dev);
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		set_dax_pgmap(dax_dev, &pmem->pgmap);
 	if (is_nvdimm_sync(nd_region))
 		set_dax_synchronous(dax_dev);
 	rc = dax_add_host(dax_dev, disk);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V9 45/45] devdax: Enable stray access protection
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (43 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 44/45] nvdimm/pmem: Enable stray access protection ira.weiny
@ 2022-03-10 17:20 ` ira.weiny
  2022-03-31 17:13 ` [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection Ira Weiny
  45 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-03-10 17:20 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Device dax is primarily accessed through user space and kernel access is
controlled through the kmap interfaces.

Now that all valid kernel initiated access to dax devices have been
accounted for, turn on PGMAP_PKEYS_PROTECT for device dax.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V9
	Add Review tag

Changes for V8
	Rebase to 5.17-rc1
	Use pgmap_protection_available()
	s/PGMAP_PKEYS_PROTECT/PGMAP_PROTECTION/
---
 drivers/dax/device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d33a0613ed0c..cee375ef2cac 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -452,6 +452,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	if (dev_dax->align > PAGE_SIZE)
 		pgmap->vmemmap_shift =
 			order_base_2(dev_dax->align >> PAGE_SHIFT);
+	if (pgmap_protection_available())
+		pgmap->flags |= PGMAP_PROTECTION;
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection
  2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (44 preceding siblings ...)
  2022-03-10 17:20 ` [PATCH V9 45/45] devdax: " ira.weiny
@ 2022-03-31 17:13 ` Ira Weiny
  45 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-03-31 17:13 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

On Thu, Mar 10, 2022 at 09:19:34AM -0800, Ira wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> 
> I'm looking for Intel acks on the series prior to submitting to maintainers.
> Most of the changes from V8 to V9 was in getting the tests straightened out.
> But there are some improvements in the actual code.

Is there any feedback on this?

Ira

> 
> 
> Changes for V9
> 
> Review and update all commit messages.
> Update cover letter below
> 
> PKS Core
> 	Separate user and supervisor pkey code in the headers
> 		create linux/pks.h for supervisor calls
> 		This facilitated making the pmem code more efficient 
> 	Completely rearchitect the test code
> 		[After Dave Hansen and Rick Edgecombe found issues in the test
> 			code it was easier to rearchitect the code completely
> 			rather than attempt to fix it.]
> 		Remove pks_test_callback in favor of using fault hooks
> 			Fault hooks also isolate the fault callbacks from being
> 			false positives if non-test consumers are running
> 		Make additional PKS_TEST_RUN_ALL Kconfig option which is
> 			mutually exclusive to any non-test PKS consumer
> 			PKS_TEST_RUN_ALL takes over all pkey callbacks
> 		Ensure that each test runs within it's own context and is
> 			mutually exclusive from running while any other test is
> 			running.
> 		Ensure test session and context memory is cleaned up on file
> 			close
> 		Use pr_debug() and dynamic debug for in kernel debug messages
> 		Enhance test_pks selftest
> 			Add the ability to run all tests not just the context
> 				switch test
> 			Standardize output [PASS][FAIL][SKIP]
> 			Add '-d' option enables dynamic debug to see the kernel
> 				debug messages
> 
> 	Incorporate feedback from Rick Edgecombe
> 		Update all pkey types to u8
> 		Fix up test code barriers
> 	Move patch declaring PKS_INIT_VALUE ahead of the patch which enables
> 		PKS so that PKS_INIT_VALUE can be used when pks_setup() is
> 		first created
> 	From Dan Williams
> 		Use macros instead of an enum for a pkey allocation scheme
> 			which is predicated on the config options of consumers
> 			This almost worked perfectly.  It required a bit of
> 			tweeking to be able to allocate all of the keys.
> 
> 	From Dave Hansen
> 		Reposition some code to be near/similar to user pkeys
> 			s/pks_write_current/x86_pkrs_load
> 			s/pks_saved_pkrs/pkrs
> 		Update Documentation
> 		s/PKR_{RW,AD,WD}_KEY/PKR_{RW,AD,WD}_MASK
> 		Consistently use lower case for pkey
> 		Update commit messages
> 		Add Acks
> 
> PMEM Stray Write
> 	Building on the change to the pks_mk_*() function rename
> 		s/pgmap_mk_*/pgmap_set_*/
> 		s/dax_mk_*/dax_set_*/
> 	From Dan Williams
> 		Avoid adding new dax operations by teaching dax_device about pgmap
> 		Remove pgmap_protection_flag_invalid() patch (Just let
> 			kmap'ings fail)
> 
> 
> PKS/PMEM Stray write protection
> ===============================
> 
> This series is broken into 2 parts.
> 
> 	1) Introduce Protection Key Supervisor (PKS), testing, and
> 	   documentation
> 	2) Use PKS to protect PMEM from stray writes
> 
> Introduce Protection Key Supervisor (PKS)
> -----------------------------------------
> 
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to pages beyond the normal paging protections.  PKS works in a
> similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
> checked in addition to normal paging protections.  And page mappings are
> assigned to a domain by setting a 4 bit pkey in the PTE of that mapping.
> 
> Unlike PKU, permissions are changed via a MSR update.  This update avoids TLB
> flushes making this an efficient way to alter protections vs PTE updates.
> 
> Also, unlike PTE updates PKS permission changes apply only to the current
> processor.  Therefore changing permissions apply only to that thread and not
> any other cpu/process.  This allows protections to remain in place on other
> cpus for additional protection and isolation.
> 
> Even though PKS updates are thread local, XSAVE is not supported for the PKRS
> MSR.  Therefore this implementation saves and restores the MSR across context
> switches and during exceptions within software.  Nested exceptions are
> supported by each exception getting a new PKS state.
> 
> For consistent behavior with current paging protections, pkey 0 is reserved and
> configured to allow full access via the pkey mechanism, thus preserving the
> default paging protections because PTEs naturally have a pkey value of 0.
> 
> Other keys, (1-15) are statically allocated by kernel consumers when
> configured.  This is done by adding the appropriate PKS_NEW_KEY and
> PKS_DECLARE_INIT_VALUE macros to pks-keys.h.
> 
> Two PKS consumers, PKS_TEST and PMEM stray write protection, are included in
> this series.  When the number of users grows larger the sharing of keys will
> need to be resolved depending on the needs of the users at that time.  Many
> methods have been contemplated but the number of kernel users and use cases
> envisioned is still quite small, much less than the 15 available keys.
> 
> To summarize, the following are key attributes of PKS.
> 
> 	1) Fast switching of permissions
> 		1a) Prevents access without page table manipulations
> 		1b) No TLB flushes required
> 	2) Works on a per thread basis, thus allowing protections to be
> 	   preserved on threads which are not actively accessing data through
> 	   the mapping.
> 
> PKS is available with 4 and 5 level paging.  For this and simplicity of
> implementation, the feature is restricted to x86_64.
> 
> 
> Use PKS to protect PMEM from stray writes
> -----------------------------------------
> 
> DAX leverages the direct-map to enable 'struct page' services for PMEM.  Given
> that PMEM capacity may be an order of magnitude higher capacity than System RAM
> it presents a large vulnerability surface to stray writes.  Such a stray write
> becomes a silent data corruption bug.
> 
> Stray pointers to System RAM may result in a crash or other undesirable
> behavior which, while unfortunate, are usually recoverable with a reboot.
> Stray writes to PMEM are permanent in nature and thus are more likely to result
> in permanent user data loss.  Given that PMEM access from the kernel is limited
> to a constrained set of locations (PMEM driver, Filesystem-DAX, direct-I/O, and
> any properly kmap'ed page), it is amenable to PKS protection.
> 
> Set up an infrastructure for extra device access protection. Then implement the
> protection using the new Protection Keys Supervisor (PKS) on architectures
> which support it.
> 
> Because PMEM pages are all associated with a struct dev_pagemap and flags in
> struct page are valuable the flag of protecting memory can be stored in struct
> dev_pagemap.  All PMEM is protected by the same pkey.  So a single flag is all
> that is needed in each dev_pagemap to indicate protection.
> 
> General access in the kernel is supported by modifying the kmap infrastructure
> which can detect if a page is pks protected and enable access until the
> corresponding unmap is called.
> 
> Because PKS is a thread local mechanism and because kmap was never really
> intended to create a long term mapping, this implementation does not support
> the kmap()/kunmap() calls.  Calling kmap() on a PMEM protected page is allowed
> but accessing that mapping will cause a fault.
> 
> Originally this series modified many of the kmap call sites to indicate they
> were thread local.[1]  And an attempt to support kmap()[2] was made.  But now
> that kmap_local_page() has been developed[3] and in more wide spread use,
> kmap() can safely be left unsupported.
> 
> How the fault is handled is configurable via a new module parameter
> memremap.pks_fault_mode.  Two modes are supported.
> 
> 	'relaxed' (default) -- WARN_ONCE, disable the protection and allow
> 	                       access
> 
> 	'strict' -- prevent any unguarded access to a protected dev_pagemap
> 		    range
> 
> This 'safety valve' feature has already been useful in the development of this
> feature.
> 
> 
> [1] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/
> 
> [2] https://lore.kernel.org/lkml/87mtycqcjf.fsf@nanos.tec.linutronix.de/
> 
> [3] https://lore.kernel.org/lkml/20210128061503.1496847-1-ira.weiny@intel.com/
>     https://lore.kernel.org/lkml/20210210062221.3023586-1-ira.weiny@intel.com/
>     https://lore.kernel.org/lkml/20210205170030.856723-1-ira.weiny@intel.com/
>     https://lore.kernel.org/lkml/20210217024826.3466046-1-ira.weiny@intel.com/
> 
> 
> ----------------------------------------------------------------------------
> Changes for V8
> 
> Feedback from Thomas
> 	* clean up noinstr mess
> 	* Fix static PKEY allocation mess
> 	* Ensure all functions are consistently named.
> 	* Split up patches to do 1 thing per patch
> 	* pkey_update_pkval() implementation
> 	* Streamline the use of pks_write_pkrs() by not disabling preemption
> 		- Leave this to the callers who require it.
> 		- Use documentation and lockdep to prevent errors
> 	* Clean up commit messages to explain in detail _why_ each patch is
> 		there.
> 
> Feedback from Dave H.
> 	* Leave out pks_mk_readonly() as it is not used by the PMEM use case
> 
> Feedback from Peter Anvin
> 	* Replace pks_abandon_pkey() with pks_update_exception()
> 		This is an even greater simplification in that it no longer
> 		attempts to shield users from faults.  As the main use case for
> 		abandoning a key was to allow a system to continue running even
> 		with an error.  This should be a rare event so the performance
> 		should not be an issue.
> 
> * Simplify ARCH_ENABLE_SUPERVISOR_PKEYS
> 
> * Update PKS Test code
> 	- Add default value test
> 	- Split up the test code into patches which follow each feature
> 	  addition
> 	- simplify test code processing
> 	- ensure consistent reporting of errors.
> 
> * Ensure all entry points to the PKS code are protected by
> 	cpu_feature_enabled(X86_FEATURE_PKS)
> 	- At the same time make sure non-entry points or sub-functions to the
> 	  PKS code are not _unnecessarily_ protected by the feature check
> 
> * Update documentation
> 	- Use kernel docs to place the docs with the code for easier internal
> 	  developer use
> 
> * Adjust the PMEM use cases for the core changes
> 
> * Split the PMEM patches up to be 1 change per patch and help clarify review
> 
> * Review all header files and remove those no longer needed
> 
> * Review/update/clarify all commit messages
> 
> Fenghua Yu (1):
> mm/pkeys: Define PKS page table macros
> 
> Ira Weiny (43):
> entry: Create an internal irqentry_exit_cond_resched() call
> Documentation/protection-keys: Clean up documentation for User Space
> pkeys
> x86/pkeys: Clarify PKRU_AD_KEY macro
> x86/pkeys: Make PKRU macros generic
> x86/fpu: Refactor arch_set_user_pkey_access()
> mm/pkeys: Add Kconfig options for PKS
> x86/pkeys: Add PKS CPU feature bit
> x86/fault: Adjust WARN_ON for pkey fault
> Documentation/pkeys: Add initial PKS documentation
> mm/pkeys: Provide for PKS key allocation
> x86/pkeys: Enable PKS on cpus which support it
> mm/pkeys: PKS testing, add initial test code
> x86/selftests: Add test_pks
> x86/pkeys: Introduce pks_write_pkrs()
> x86/pkeys: Preserve the PKS MSR on context switch
> mm/pkeys: Introduce pks_set_readwrite()
> mm/pkeys: Introduce pks_set_noaccess()
> mm/pkeys: PKS testing, add a fault call back
> mm/pkeys: PKS testing, add pks_set_*() tests
> mm/pkeys: PKS testing, test context switching
> x86/entry: Add auxiliary pt_regs space
> entry: Split up irqentry_exit_cond_resched()
> entry: Add calls for save/restore auxiliary pt_regs
> x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
> x86/pkeys: Preserve PKRS MSR across exceptions
> x86/fault: Print PKS MSR on fault
> mm/pkeys: PKS testing, Add exception test
> mm/pkeys: Introduce pks_update_exception()
> mm/pkeys: PKS testing, test pks_update_exception()
> mm/pkeys: PKS testing, add test for all keys
> mm/pkeys: Add pks_available()
> memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
> memremap_pages: Introduce pgmap_protection_available()
> memremap_pages: Introduce a PGMAP_PROTECTION flag
> memremap_pages: Introduce devmap_protected()
> memremap_pages: Reserve a PKS pkey for eventual use by PMEM
> memremap_pages: Set PKS pkey in PTEs if requested
> memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls
> memremap_pages: Add memremap.pks_fault_mode
> kmap: Make kmap work for devmap protected pages
> dax: Stray access protection for dax_direct_access()
> nvdimm/pmem: Enable stray access protection
> devdax: Enable stray access protection
> 
> Rick Edgecombe (1):
> mm/pkeys: Introduce PKS fault callbacks
> 
> .../admin-guide/kernel-parameters.txt | 12 +
> Documentation/core-api/protection-keys.rst | 130 ++-
> arch/x86/Kconfig | 6 +
> arch/x86/entry/calling.h | 20 +
> arch/x86/entry/common.c | 2 +-
> arch/x86/entry/entry_64.S | 22 +
> arch/x86/entry/entry_64_compat.S | 6 +
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/entry-common.h | 15 +
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/include/asm/pgtable_types.h | 22 +
> arch/x86/include/asm/pkeys.h | 2 +
> arch/x86/include/asm/pkeys_common.h | 18 +
> arch/x86/include/asm/pkru.h | 20 +-
> arch/x86/include/asm/pks.h | 46 ++
> arch/x86/include/asm/processor.h | 15 +-
> arch/x86/include/asm/ptrace.h | 21 +
> arch/x86/include/uapi/asm/processor-flags.h | 2 +
> arch/x86/kernel/asm-offsets_64.c | 15 +
> arch/x86/kernel/cpu/common.c | 2 +
> arch/x86/kernel/dumpstack.c | 32 +-
> arch/x86/kernel/fpu/xstate.c | 22 +-
> arch/x86/kernel/head_64.S | 6 +
> arch/x86/kernel/process_64.c | 3 +
> arch/x86/mm/fault.c | 17 +-
> arch/x86/mm/pkeys.c | 320 +++++++-
> drivers/dax/device.c | 2 +
> drivers/dax/super.c | 59 ++
> drivers/md/dm-writecache.c | 8 +-
> drivers/nvdimm/pmem.c | 26 +
> fs/dax.c | 8 +
> fs/fuse/virtio_fs.c | 2 +
> include/linux/dax.h | 5 +
> include/linux/entry-common.h | 15 +-
> include/linux/highmem-internal.h | 4 +
> include/linux/memremap.h | 1 +
> include/linux/mm.h | 72 ++
> include/linux/pgtable.h | 4 +
> include/linux/pks-keys.h | 92 +++
> include/linux/pks.h | 73 ++
> include/linux/sched.h | 7 +
> include/uapi/asm-generic/mman-common.h | 1 +
> init/init_task.c | 3 +
> kernel/entry/common.c | 44 +-
> kernel/sched/core.c | 40 +-
> lib/Kconfig.debug | 33 +
> lib/Makefile | 3 +
> lib/pks/Makefile | 3 +
> lib/pks/pks_test.c | 755 ++++++++++++++++++
> mm/Kconfig | 32 +
> mm/memremap.c | 132 +++
> tools/testing/selftests/x86/Makefile | 2 +-
> tools/testing/selftests/x86/test_pks.c | 514 ++++++++++++
> 54 files changed, 2617 insertions(+), 109 deletions(-)
> create mode 100644 arch/x86/include/asm/pkeys_common.h
> create mode 100644 arch/x86/include/asm/pks.h
> create mode 100644 include/linux/pks-keys.h
> create mode 100644 include/linux/pks.h
> create mode 100644 lib/pks/Makefile
> create mode 100644 lib/pks/pks_test.c
> create mode 100644 tools/testing/selftests/x86/test_pks.c
> 
> --
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call
  2022-03-10 17:19 ` [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
@ 2022-04-07  2:48   ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-04-07  2:48 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

On Thu, Mar 10, 2022 at 09:19:35AM -0800, Ira wrote:
> From: Ira Weiny <ira.weiny@intel.com>

Rebasing to 5.18-rc1 revealed a different fix has been applied for this
work.[1]

Please disregard this patch.

Ira

[1] 4624a14f4daa ("sched/preempt: Simplify irqentry_exit_cond_resched()
callers") 

> 
> The static call to irqentry_exit_cond_resched() was not properly being
> overridden when called from xen_pv_evtchn_do_upcall().
> 
> Define __irqentry_exit_cond_resched() as the static call and place the
> override logic in irqentry_exit_cond_resched().
> 
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for V9
> 	Update the commit message a bit
> 
> Because this was found via code inspection and it does not actually fix
> any seen bug I've not added a fixes tag.
> 
> But for reference:
> Fixes: 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
> ---
>  include/linux/entry-common.h |  5 ++++-
>  kernel/entry/common.c        | 23 +++++++++++++--------
>  kernel/sched/core.c          | 40 ++++++++++++++++++------------------
>  3 files changed, 38 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 2e2b8d6140ed..ddaffc983e62 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -455,10 +455,13 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
>   * Conditional reschedule with additional sanity checks.
>   */
>  void irqentry_exit_cond_resched(void);
> +
> +void __irqentry_exit_cond_resched(void);
>  #ifdef CONFIG_PREEMPT_DYNAMIC
> -DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> +DECLARE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
>  #endif
>  
> +
>  /**
>   * irqentry_exit - Handle return from exception that used irqentry_enter()
>   * @regs:	Pointer to pt_regs (exception entry regs)
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bad713684c2e..490442a48332 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -380,7 +380,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  	return ret;
>  }
>  
> -void irqentry_exit_cond_resched(void)
> +void __irqentry_exit_cond_resched(void)
>  {
>  	if (!preempt_count()) {
>  		/* Sanity check RCU and thread stack */
> @@ -392,9 +392,20 @@ void irqentry_exit_cond_resched(void)
>  	}
>  }
>  #ifdef CONFIG_PREEMPT_DYNAMIC
> -DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> +DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
>  #endif
>  
> +void irqentry_exit_cond_resched(void)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPTION)) {
> +#ifdef CONFIG_PREEMPT_DYNAMIC
> +		static_call(__irqentry_exit_cond_resched)();
> +#else
> +		__irqentry_exit_cond_resched();
> +#endif
> +	}
> +}
> +
>  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  {
>  	lockdep_assert_irqs_disabled();
> @@ -420,13 +431,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		}
>  
>  		instrumentation_begin();
> -		if (IS_ENABLED(CONFIG_PREEMPTION)) {
> -#ifdef CONFIG_PREEMPT_DYNAMIC
> -			static_call(irqentry_exit_cond_resched)();
> -#else
> -			irqentry_exit_cond_resched();
> -#endif
> -		}
> +		irqentry_exit_cond_resched();
>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();
>  		instrumentation_end();
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9745613d531c..f56db4bd9730 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6571,29 +6571,29 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
>   * SC:might_resched
>   * SC:preempt_schedule
>   * SC:preempt_schedule_notrace
> - * SC:irqentry_exit_cond_resched
> + * SC:__irqentry_exit_cond_resched
>   *
>   *
>   * NONE:
> - *   cond_resched               <- __cond_resched
> - *   might_resched              <- RET0
> - *   preempt_schedule           <- NOP
> - *   preempt_schedule_notrace   <- NOP
> - *   irqentry_exit_cond_resched <- NOP
> + *   cond_resched                 <- __cond_resched
> + *   might_resched                <- RET0
> + *   preempt_schedule             <- NOP
> + *   preempt_schedule_notrace     <- NOP
> + *   __irqentry_exit_cond_resched <- NOP
>   *
>   * VOLUNTARY:
> - *   cond_resched               <- __cond_resched
> - *   might_resched              <- __cond_resched
> - *   preempt_schedule           <- NOP
> - *   preempt_schedule_notrace   <- NOP
> - *   irqentry_exit_cond_resched <- NOP
> + *   cond_resched                 <- __cond_resched
> + *   might_resched                <- __cond_resched
> + *   preempt_schedule             <- NOP
> + *   preempt_schedule_notrace     <- NOP
> + *   __irqentry_exit_cond_resched <- NOP
>   *
>   * FULL:
> - *   cond_resched               <- RET0
> - *   might_resched              <- RET0
> - *   preempt_schedule           <- preempt_schedule
> - *   preempt_schedule_notrace   <- preempt_schedule_notrace
> - *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + *   cond_resched                 <- RET0
> + *   might_resched                <- RET0
> + *   preempt_schedule             <- preempt_schedule
> + *   preempt_schedule_notrace     <- preempt_schedule_notrace
> + *   __irqentry_exit_cond_resched <- __irqentry_exit_cond_resched
>   */
>  
>  enum {
> @@ -6629,7 +6629,7 @@ void sched_dynamic_update(int mode)
>  	static_call_update(might_resched, __cond_resched);
>  	static_call_update(preempt_schedule, __preempt_schedule_func);
>  	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
> -	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> +	static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
>  
>  	switch (mode) {
>  	case preempt_dynamic_none:
> @@ -6637,7 +6637,7 @@ void sched_dynamic_update(int mode)
>  		static_call_update(might_resched, (void *)&__static_call_return0);
>  		static_call_update(preempt_schedule, NULL);
>  		static_call_update(preempt_schedule_notrace, NULL);
> -		static_call_update(irqentry_exit_cond_resched, NULL);
> +		static_call_update(__irqentry_exit_cond_resched, NULL);
>  		pr_info("Dynamic Preempt: none\n");
>  		break;
>  
> @@ -6646,7 +6646,7 @@ void sched_dynamic_update(int mode)
>  		static_call_update(might_resched, __cond_resched);
>  		static_call_update(preempt_schedule, NULL);
>  		static_call_update(preempt_schedule_notrace, NULL);
> -		static_call_update(irqentry_exit_cond_resched, NULL);
> +		static_call_update(__irqentry_exit_cond_resched, NULL);
>  		pr_info("Dynamic Preempt: voluntary\n");
>  		break;
>  
> @@ -6655,7 +6655,7 @@ void sched_dynamic_update(int mode)
>  		static_call_update(might_resched, (void *)&__static_call_return0);
>  		static_call_update(preempt_schedule, __preempt_schedule_func);
>  		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
> -		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> +		static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
>  		pr_info("Dynamic Preempt: full\n");
>  		break;
>  	}
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched()
  2022-03-10 17:19 ` [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched() ira.weiny
@ 2022-04-07  2:50   ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-04-07  2:50 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, Shankar, Ravi V, linux-kernel

On Thu, Mar 10, 2022 at 09:19:58AM -0800, Ira wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Auxiliary pt_regs space needs to be manipulated by the generic
> entry/exit code.

Because of fix to the irqentry_exit_cond_resched() code[1] this patch needed
rework upon rebasing to 5.18-rc1

This basic design of this patch remains but the code is different.  The
irqentry_exit_cond_resched() still needs to have pt_regs passed into it.

However, this could be safely ignored for this review cycle as well.

As soon as I have a series based on 5.18 I'll resend the full series.

Thanks for understanding,
Ira

[1] 4624a14f4daa ("sched/preempt: Simplify irqentry_exit_cond_resched()
callers") 

> 
> Normally irqentry_exit() would take care of handling any auxiliary
> pt_regs on exit.  Unfortunately, the call to
> irqentry_exit_cond_resched() from xen_pv_evtchn_do_upcall() bypasses the
> normal irqentry_exit() call.  Because of this bypass
> irqentry_exit_cond_resched() will be required to handle any auxiliary
> pt_regs exit handling.  However, this prevents irqentry_exit() from
> being able to call irqentry_exit_cond_resched() and while maintaining
> control of the auxiliary pt_regs.
> 
> Separate out the common functionality of irqentry_exit_cond_resched() so
> that functionality can be used by irqentry_exit().  Add a pt_regs
> parameter in anticipation of having irqentry_exit_cond_resched() handle
> the auxiliary pt_regs separately from irqentry_exit().
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for V9
> 	Update commit message
> 
> Changes for V8
> 	New Patch
> ---
>  arch/x86/entry/common.c      | 2 +-
>  include/linux/entry-common.h | 3 ++-
>  kernel/entry/common.c        | 9 +++++++--
>  3 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 6c2826417b33..f1ba770d035d 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -309,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>  	inhcall = get_and_clear_inhcall();
>  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> -		irqentry_exit_cond_resched();
> +		irqentry_exit_cond_resched(regs);
>  		instrumentation_end();
>  		restore_inhcall(inhcall);
>  	} else {
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index ddaffc983e62..14fd329847e7 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -451,10 +451,11 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
>  
>  /**
>   * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
> + * @regs:	Pointer to pt_regs of interrupted context
>   *
>   * Conditional reschedule with additional sanity checks.
>   */
> -void irqentry_exit_cond_resched(void);
> +void irqentry_exit_cond_resched(struct pt_regs *regs);
>  
>  void __irqentry_exit_cond_resched(void);
>  #ifdef CONFIG_PREEMPT_DYNAMIC
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 490442a48332..f4210a7fc84d 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -395,7 +395,7 @@ void __irqentry_exit_cond_resched(void)
>  DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
>  #endif
>  
> -void irqentry_exit_cond_resched(void)
> +static void exit_cond_resched(void)
>  {
>  	if (IS_ENABLED(CONFIG_PREEMPTION)) {
>  #ifdef CONFIG_PREEMPT_DYNAMIC
> @@ -406,6 +406,11 @@ void irqentry_exit_cond_resched(void)
>  	}
>  }
>  
> +void irqentry_exit_cond_resched(struct pt_regs *regs)
> +{
> +	exit_cond_resched();
> +}
> +
>  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  {
>  	lockdep_assert_irqs_disabled();
> @@ -431,7 +436,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		}
>  
>  		instrumentation_begin();
> -		irqentry_exit_cond_resched();
> +		exit_cond_resched();
>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();
>  		instrumentation_end();
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-04-07  2:51 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-10 17:19 [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection ira.weiny
2022-03-10 17:19 ` [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
2022-04-07  2:48   ` Ira Weiny
2022-03-10 17:19 ` [PATCH V9 02/45] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
2022-03-10 17:19 ` [PATCH V9 03/45] x86/pkeys: Clarify PKRU_AD_KEY macro ira.weiny
2022-03-10 17:19 ` [PATCH V9 04/45] x86/pkeys: Make PKRU macros generic ira.weiny
2022-03-10 17:19 ` [PATCH V9 05/45] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
2022-03-10 17:19 ` [PATCH V9 06/45] mm/pkeys: Add Kconfig options for PKS ira.weiny
2022-03-10 17:19 ` [PATCH V9 07/45] x86/pkeys: Add PKS CPU feature bit ira.weiny
2022-03-10 17:19 ` [PATCH V9 08/45] x86/fault: Adjust WARN_ON for pkey fault ira.weiny
2022-03-10 17:19 ` [PATCH V9 09/45] Documentation/pkeys: Add initial PKS documentation ira.weiny
2022-03-10 17:19 ` [PATCH V9 10/45] mm/pkeys: Provide for PKS key allocation ira.weiny
2022-03-10 17:19 ` [PATCH V9 11/45] x86/pkeys: Enable PKS on cpus which support it ira.weiny
2022-03-10 17:19 ` [PATCH V9 12/45] mm/pkeys: Define PKS page table macros ira.weiny
2022-03-10 17:19 ` [PATCH V9 13/45] mm/pkeys: PKS testing, add initial test code ira.weiny
2022-03-10 17:19 ` [PATCH V9 14/45] x86/selftests: Add test_pks ira.weiny
2022-03-10 17:19 ` [PATCH V9 15/45] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
2022-03-10 17:19 ` [PATCH V9 16/45] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
2022-03-10 17:19 ` [PATCH V9 17/45] mm/pkeys: Introduce pks_set_readwrite() ira.weiny
2022-03-10 17:19 ` [PATCH V9 18/45] mm/pkeys: Introduce pks_set_noaccess() ira.weiny
2022-03-10 17:19 ` [PATCH V9 19/45] mm/pkeys: Introduce PKS fault callbacks ira.weiny
2022-03-10 17:19 ` [PATCH V9 20/45] mm/pkeys: PKS testing, add a fault call back ira.weiny
2022-03-10 17:19 ` [PATCH V9 21/45] mm/pkeys: PKS testing, add pks_set_*() tests ira.weiny
2022-03-10 17:19 ` [PATCH V9 22/45] mm/pkeys: PKS testing, test context switching ira.weiny
2022-03-10 17:19 ` [PATCH V9 23/45] x86/entry: Add auxiliary pt_regs space ira.weiny
2022-03-10 17:19 ` [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched() ira.weiny
2022-04-07  2:50   ` Ira Weiny
2022-03-10 17:19 ` [PATCH V9 25/45] entry: Add calls for save/restore auxiliary pt_regs ira.weiny
2022-03-10 17:20 ` [PATCH V9 26/45] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs() ira.weiny
2022-03-10 17:20 ` [PATCH V9 27/45] x86/pkeys: Preserve PKRS MSR across exceptions ira.weiny
2022-03-10 17:20 ` [PATCH V9 28/45] x86/fault: Print PKS MSR on fault ira.weiny
2022-03-10 17:20 ` [PATCH V9 29/45] mm/pkeys: PKS testing, Add exception test ira.weiny
2022-03-10 17:20 ` [PATCH V9 30/45] mm/pkeys: Introduce pks_update_exception() ira.weiny
2022-03-10 17:20 ` [PATCH V9 31/45] mm/pkeys: PKS testing, test pks_update_exception() ira.weiny
2022-03-10 17:20 ` [PATCH V9 32/45] mm/pkeys: PKS testing, add test for all keys ira.weiny
2022-03-10 17:20 ` [PATCH V9 33/45] mm/pkeys: Add pks_available() ira.weiny
2022-03-10 17:20 ` [PATCH V9 34/45] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
2022-03-10 17:20 ` [PATCH V9 35/45] memremap_pages: Introduce pgmap_protection_available() ira.weiny
2022-03-10 17:20 ` [PATCH V9 36/45] memremap_pages: Introduce a PGMAP_PROTECTION flag ira.weiny
2022-03-10 17:20 ` [PATCH V9 37/45] memremap_pages: Introduce devmap_protected() ira.weiny
2022-03-10 17:20 ` [PATCH V9 38/45] memremap_pages: Reserve a PKS pkey for eventual use by PMEM ira.weiny
2022-03-10 17:20 ` [PATCH V9 39/45] memremap_pages: Set PKS pkey in PTEs if requested ira.weiny
2022-03-10 17:20 ` [PATCH V9 40/45] memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls ira.weiny
2022-03-10 17:20 ` [PATCH V9 41/45] memremap_pages: Add memremap.pks_fault_mode ira.weiny
2022-03-10 17:20 ` [PATCH V9 42/45] kmap: Make kmap work for devmap protected pages ira.weiny
2022-03-10 17:20 ` [PATCH V9 43/45] dax: Stray access protection for dax_direct_access() ira.weiny
2022-03-10 17:20 ` [PATCH V9 44/45] nvdimm/pmem: Enable stray access protection ira.weiny
2022-03-10 17:20 ` [PATCH V9 45/45] devdax: " ira.weiny
2022-03-31 17:13 ` [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection Ira Weiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.