All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching
@ 2021-05-06  4:34 Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr Christopher M. Riedl
                   ` (10 more replies)
  0 siblings, 11 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

	1. Map page in per-CPU VM area w/ PAGE_KERNEL protection
	2. Patch text
	3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching [0].

x86 introduced "temporary mm" structs which allow the creation of
mappings local to a particular CPU [1]. This series intends to bring the
notion of a temporary mm to powerpc and harden powerpc by using such a
mapping for patching a kernel with strict RWX permissions.

The first four patches implement an LKDTM test "proof-of-concept" which
exploits the potential vulnerability (ie. the temporary mapping during
patching is exposed in the kernel page tables and accessible by other
CPUs) using a simple brute-force approach. This test is implemented for
both powerpc and x86_64. The test passes on powerpc with this new
series, fails on upstream powerpc, passes on upstream x86_64, and fails
on an older (ancient) x86_64 tree without the x86_64 temporary mm
patches. The remaining patches add support for and use a temporary mm
for code patching on powerpc.

Tested boot, ftrace, and repeated LKDTM "hijack":
	- QEMU+KVM (host: POWER9 Blackbird): Radix MMU w/ KUAP
	- QEMU+KVM (host: POWER9 Blackbird): Hash MMU w/o KUAP
	- QEMU+KVM (host: POWER9 Blackbird): Hash MMU w/ KUAP

Tested repeated LKDTM "hijack":
	- QEMU+KVM (host: AMD desktop): x86_64 upstream
	- QEMU+KVM (host: AMD desktop): x86_64 w/o percpu temp mm to
	  verify the LKDTM "hijack" fails

Tested boot and ftrace:
	- QEMU+TCG: ppc44x (bamboo)
	- QEMU+TCG: g5 (mac99)

I also tested with various extra config options enabled as suggested in
section 12) in Documentation/process/submit-checklist.rst.

	[ Apologies about the resend - some of the patches were dropped by
          the CC'd linux-hardening list due to tls problems between mailbox.org
          and vger.kernel.org. I re-signed my patches w/ my IBM email; no other
          changes. ]

v4:	* It's time to revisit this series again since @jpn and @mpe fixed
	  our known STRICT_*_RWX bugs on powerpc/64s.
	* Rebase on linuxppc/next:
          commit ee1bc694fbaec ("powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n")
	* Completely rework how map_patch() works on book3s64 Hash MMU
	* Split the LKDTM x86_64 and powerpc bits into separate patches
	* Annotate commit messages with changes from v3 instead of
	  listing them here completely out-of context...

v3:	* Rebase on linuxppc/next: commit 9123e3a74ec7 ("Linux 5.9-rc1")
	* Move temporary mm implementation into code-patching.c where it
	  belongs
	* Implement LKDTM hijacker test on x86_64 (on IBM time oof) Do
	* not use address zero for the patching address in the
	  temporary mm (thanks @dja for pointing this out!)
	* Wrap the LKDTM test w/ CONFIG_SMP as suggested by Christophe
	  Leroy
	* Comments to clarify PTE pre-allocation and patching addr
	  selection

v2:	* Rebase on linuxppc/next:
	  commit 105fb38124a4 ("powerpc/8xx: Modify ptep_get()")
	* Always dirty pte when mapping patch
	* Use `ppc_inst_len` instead of `sizeof` on instructions
	* Declare LKDTM patching addr accessor in header where it belongs	

v1:	* Rebase on linuxppc/next (4336b9337824)
	* Save and restore second hw watchpoint
	* Use new ppc_inst_* functions for patching check and in LKDTM test

rfc-v2:	* Many fixes and improvements mostly based on extensive feedback
          and testing by Christophe Leroy (thanks!).
	* Make patching_mm and patching_addr static and move
	  '__ro_after_init' to after the variable name (more common in
	  other parts of the kernel)
	* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to
	  fix PPC64e compile
	* Add comment explaining why we use BUG_ON() during the init
	  call to setup for patching later
	* Move ptep into patch_mapping to avoid walking page tables a
	  second time when unmapping the temporary mapping
	* Use KUAP under non-radix, also manually dirty the PTE for patch
	  mapping on non-BOOK3S_64 platforms
	* Properly return any error from __patch_instruction
        * Do not use 'memcmp' where a simple comparison is appropriate
	* Simplify expression for patch address by removing pointer maths
	* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.amit@gmail.com/

Christopher M. Riedl (11):
  powerpc: Add LKDTM accessor for patching addr
  lkdtm/powerpc: Add test to hijack a patch mapping
  x86_64: Add LKDTM accessor for patching addr
  lkdtm/x86_64: Add test to hijack a patch mapping
  powerpc/64s: Add ability to skip SLB preload
  powerpc: Introduce temporary mm
  powerpc/64s: Make slb_allocate_user() non-static
  powerpc: Initialize and use a temporary mm for patching
  lkdtm/powerpc: Fix code patching hijack test
  powerpc: Protect patching_mm with a lock
  powerpc: Use patch_instruction_unlocked() in loops

 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   1 +
 arch/powerpc/include/asm/book3s/64/mmu.h      |   3 +
 arch/powerpc/include/asm/code-patching.h      |   8 +
 arch/powerpc/include/asm/debug.h              |   1 +
 arch/powerpc/include/asm/mmu_context.h        |  13 +
 arch/powerpc/kernel/epapr_paravirt.c          |   9 +-
 arch/powerpc/kernel/optprobes.c               |  22 +-
 arch/powerpc/kernel/process.c                 |   5 +
 arch/powerpc/lib/code-patching.c              | 348 +++++++++++++-----
 arch/powerpc/lib/feature-fixups.c             | 114 ++++--
 arch/powerpc/mm/book3s64/mmu_context.c        |   2 +
 arch/powerpc/mm/book3s64/slb.c                |  60 +--
 arch/powerpc/xmon/xmon.c                      |  22 +-
 arch/x86/include/asm/text-patching.h          |   4 +
 arch/x86/kernel/alternative.c                 |   7 +
 drivers/misc/lkdtm/core.c                     |   1 +
 drivers/misc/lkdtm/lkdtm.h                    |   1 +
 drivers/misc/lkdtm/perms.c                    | 149 ++++++++
 18 files changed, 608 insertions(+), 162 deletions(-)

-- 
2.26.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 02/11] lkdtm/powerpc: Add test to hijack a patch mapping Christopher M. Riedl
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
---
 arch/powerpc/include/asm/code-patching.h | 4 ++++
 arch/powerpc/lib/code-patching.c         | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
index f1d029bf906e5..e51c81e4a9bda 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -188,4 +188,8 @@ static inline unsigned long ppc_kallsyms_lookup_name(const char *name)
 				 ___PPC_RA(__REG_R1) | PPC_LR_STKOFF)
 #endif /* CONFIG_PPC64 */
 
+#if IS_BUILTIN(CONFIG_LKDTM) && IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_POWERPC_CODE_PATCHING_H */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 870b30d9be2f8..2b1b3e9043ade 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -48,6 +48,13 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+	return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+}
+#endif
+
 static int text_area_cpu_up(unsigned int cpu)
 {
 	struct vm_struct *area;
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 02/11] lkdtm/powerpc: Add test to hijack a patch mapping
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 03/11] x86_64: Add LKDTM accessor for patching addr Christopher M. Riedl
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

When live patching with STRICT_KERNEL_RWX the CPU doing the patching
must temporarily remap the page(s) containing the patch site with +W
permissions. While this temporary mapping is in use, another CPU could
write to the same mapping and maliciously alter kernel text. Implement a
LKDTM test to attempt to exploit such an opening during code patching.
The test is implemented on powerpc and requires LKDTM built into the
kernel (building LKDTM as a module is insufficient).

The LKDTM "hijack" test works as follows:

  1. A CPU executes an infinite loop to patch an instruction. This is
     the "patching" CPU.
  2. Another CPU attempts to write to the address of the temporary
     mapping used by the "patching" CPU. This other CPU is the
     "hijacker" CPU. The hijack either fails with a fault/error or
     succeeds, in which case some kernel text is now overwritten.

The virtual address of the temporary patch mapping is provided via an
LKDTM-specific accessor to the hijacker CPU. This test assumes a
hypothetical situation where this address was leaked previously.

How to run the test:

	mount -t debugfs none /sys/kernel/debug
	(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)

A passing test indicates that it is not possible to overwrite kernel
text from another CPU by using the temporary mapping established by
a CPU for patching.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * Separate the powerpc and x86_64 bits into individual patches.
     * Use __put_kernel_nofault() when attempting to hijack the mapping
     * Use raw_smp_processor_id() to avoid triggering the BUG() when
       calling smp_processor_id() in preemptible code - the only thing
       that matters is that one of the threads is bound to a different
       CPU - we are not using smp_processor_id() to access any per-cpu
       data or similar where preemption should be disabled.
     * Rework the patching_cpu() kthread stop condition to avoid:
       https://lwn.net/Articles/628628/
---
 drivers/misc/lkdtm/core.c  |   1 +
 drivers/misc/lkdtm/lkdtm.h |   1 +
 drivers/misc/lkdtm/perms.c | 135 +++++++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index b2aff4d87c014..857d218840eb8 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -146,6 +146,7 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(WRITE_RO),
 	CRASHTYPE(WRITE_RO_AFTER_INIT),
 	CRASHTYPE(WRITE_KERN),
+	CRASHTYPE(HIJACK_PATCH),
 	CRASHTYPE(REFCOUNT_INC_OVERFLOW),
 	CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
 	CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 5ae48c64df24d..c8de54d189c27 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -61,6 +61,7 @@ void lkdtm_EXEC_USERSPACE(void);
 void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
 void lkdtm_ACCESS_NULL(void);
+void lkdtm_HIJACK_PATCH(void);
 
 /* refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 2dede2ef658f3..c6f96ebffccfd 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mman.h>
 #include <linux/uaccess.h>
+#include <linux/kthread.h>
 #include <asm/cacheflush.h>
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -222,6 +223,140 @@ void lkdtm_ACCESS_NULL(void)
 	pr_err("FAIL: survived bad write\n");
 }
 
+#if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
+	defined(CONFIG_PPC))
+/*
+ * This is just a dummy location to patch-over.
+ */
+static void patching_target(void)
+{
+	return;
+}
+
+#include <asm/code-patching.h>
+struct ppc_inst * const patch_site = (struct ppc_inst *)&patching_target;
+
+static inline int lkdtm_do_patch(u32 data)
+{
+	return patch_instruction(patch_site, ppc_inst(data));
+}
+
+static inline u32 lkdtm_read_patch_site(void)
+{
+	struct ppc_inst inst = READ_ONCE(*patch_site);
+	return ppc_inst_val(ppc_inst_read(&inst));
+}
+
+/* Returns True if the write succeeds */
+static inline bool lkdtm_try_write(u32 data, u32 *addr)
+{
+	__put_kernel_nofault(addr, &data, u32, err);
+	return true;
+
+err:
+	return false;
+}
+
+static int lkdtm_patching_cpu(void *data)
+{
+	int err = 0;
+	u32 val = 0xdeadbeef;
+
+	pr_info("starting patching_cpu=%d\n", raw_smp_processor_id());
+
+	do {
+		err = lkdtm_do_patch(val);
+	} while (lkdtm_read_patch_site() == val && !err && !kthread_should_stop());
+
+	if (err)
+		pr_warn("XFAIL: patch_instruction returned error: %d\n", err);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule();
+	}
+
+	return err;
+}
+
+void lkdtm_HIJACK_PATCH(void)
+{
+	struct task_struct *patching_kthrd;
+	int patching_cpu, hijacker_cpu, attempts;
+	unsigned long addr;
+	bool hijacked;
+	const u32 bad_data = 0xbad00bad;
+	const u32 original_insn = lkdtm_read_patch_site();
+
+	if (!IS_ENABLED(CONFIG_SMP)) {
+		pr_err("XFAIL: this test requires CONFIG_SMP\n");
+		return;
+	}
+
+	if (num_online_cpus() < 2) {
+		pr_warn("XFAIL: this test requires at least two cpus\n");
+		return;
+	}
+
+	hijacker_cpu = raw_smp_processor_id();
+	patching_cpu = cpumask_any_but(cpu_online_mask, hijacker_cpu);
+
+	patching_kthrd = kthread_create_on_node(&lkdtm_patching_cpu, NULL,
+						cpu_to_node(patching_cpu),
+						"lkdtm_patching_cpu");
+	kthread_bind(patching_kthrd, patching_cpu);
+	wake_up_process(patching_kthrd);
+
+	addr = offset_in_page(patch_site) | read_cpu_patching_addr(patching_cpu);
+
+	pr_info("starting hijacker_cpu=%d\n", hijacker_cpu);
+	for (attempts = 0; attempts < 100000; ++attempts) {
+		/* Try to write to the other CPU's temp patch mapping */
+		hijacked = lkdtm_try_write(bad_data, (u32 *)addr);
+
+		if (hijacked) {
+			if (kthread_stop(patching_kthrd)) {
+				pr_info("hijack attempts: %d\n", attempts);
+				pr_err("XFAIL: error stopping patching cpu\n");
+				return;
+			}
+			break;
+		}
+	}
+	pr_info("hijack attempts: %d\n", attempts);
+
+	if (hijacked) {
+		if (lkdtm_read_patch_site() == bad_data)
+			pr_err("overwrote kernel text\n");
+		/*
+		 * There are window conditions where the hijacker cpu manages to
+		 * write to the patch site but the site gets overwritten again by
+		 * the patching cpu. We still consider that a "successful" hijack
+		 * since the hijacker cpu did not fault on the write.
+		 */
+		pr_err("FAIL: wrote to another cpu's patching area\n");
+	} else {
+		kthread_stop(patching_kthrd);
+	}
+
+	/* Restore the original data to be able to run the test again */
+	lkdtm_do_patch(original_insn);
+}
+
+#else
+
+void lkdtm_HIJACK_PATCH(void)
+{
+	if (!IS_ENABLED(CONFIG_PPC))
+		pr_err("XFAIL: this test only runs on powerpc\n");
+	if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
+		pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
+	if (!IS_BUILTIN(CONFIG_LKDTM))
+		pr_err("XFAIL: this test requires CONFIG_LKDTM=y (not =m!)\n");
+}
+
+#endif
+
 void __init lkdtm_perms_init(void)
 {
 	/* Make sure we can write to __ro_after_init values during __init */
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 03/11] x86_64: Add LKDTM accessor for patching addr
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 02/11] lkdtm/powerpc: Add test to hijack a patch mapping Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 04/11] lkdtm/x86_64: Add test to hijack a patch mapping Christopher M. Riedl
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
---
 arch/x86/include/asm/text-patching.h | 4 ++++
 arch/x86/kernel/alternative.c        | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index b7421780e4e92..f0caf9ee13bd8 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -167,4 +167,8 @@ void int3_emulate_ret(struct pt_regs *regs)
 }
 #endif /* !CONFIG_UML_X86 */
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d2..4c95fdd9b1965 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -852,6 +852,13 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+	return poking_addr;
+}
+#endif
+
 static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 04/11] lkdtm/x86_64: Add test to hijack a patch mapping
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (2 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 03/11] x86_64: Add LKDTM accessor for patching addr Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload Christopher M. Riedl
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

A previous commit implemented an LKDTM test on powerpc to exploit the
temporary mapping established when patching code with STRICT_KERNEL_RWX
enabled. Extend the test to work on x86_64 as well.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
---
 drivers/misc/lkdtm/perms.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index c6f96ebffccfd..55c3bec6d3b72 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -224,7 +224,7 @@ void lkdtm_ACCESS_NULL(void)
 }
 
 #if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
-	defined(CONFIG_PPC))
+	(defined(CONFIG_PPC) || defined(CONFIG_X86_64)))
 /*
  * This is just a dummy location to patch-over.
  */
@@ -233,28 +233,51 @@ static void patching_target(void)
 	return;
 }
 
+#ifdef CONFIG_PPC
 #include <asm/code-patching.h>
 struct ppc_inst * const patch_site = (struct ppc_inst *)&patching_target;
+#endif
+
+#ifdef CONFIG_X86_64
+#include <asm/text-patching.h>
+u32 * const patch_site = (u32 *)&patching_target;
+#endif
 
 static inline int lkdtm_do_patch(u32 data)
 {
+#ifdef CONFIG_PPC
 	return patch_instruction(patch_site, ppc_inst(data));
+#endif
+#ifdef CONFIG_X86_64
+	text_poke(patch_site, &data, sizeof(u32));
+	return 0;
+#endif
 }
 
 static inline u32 lkdtm_read_patch_site(void)
 {
+#ifdef CONFIG_PPC
 	struct ppc_inst inst = READ_ONCE(*patch_site);
 	return ppc_inst_val(ppc_inst_read(&inst));
+#endif
+#ifdef CONFIG_X86_64
+	return READ_ONCE(*patch_site);
+#endif
 }
 
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
+#ifdef CONFIG_PPC
 	__put_kernel_nofault(addr, &data, u32, err);
 	return true;
 
 err:
 	return false;
+#endif
+#ifdef CONFIG_X86_64
+	return !__put_user(data, addr);
+#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
@@ -347,8 +370,8 @@ void lkdtm_HIJACK_PATCH(void)
 
 void lkdtm_HIJACK_PATCH(void)
 {
-	if (!IS_ENABLED(CONFIG_PPC))
-		pr_err("XFAIL: this test only runs on powerpc\n");
+	if (!IS_ENABLED(CONFIG_PPC) && !IS_ENABLED(CONFIG_X86_64))
+		pr_err("XFAIL: this test only runs on powerpc and x86_64\n");
 	if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
 		pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
 	if (!IS_BUILTIN(CONFIG_LKDTM))
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (3 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 04/11] lkdtm/x86_64: Add test to hijack a patch mapping Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-06-21  3:13   ` Daniel Axtens
  2021-05-06  4:34 ` [RESEND PATCH v4 06/11] powerpc: Introduce temporary mm Christopher M. Riedl
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

Switching to a different mm with Hash translation causes SLB entries to
be preloaded from the current thread_info. This reduces SLB faults, for
example when threads share a common mm but operate on different address
ranges.

Preloading entries from the thread_info struct may not always be
appropriate - such as when switching to a temporary mm. Introduce a new
boolean in mm_context_t to skip the SLB preload entirely. Also move the
SLB preload code into a separate function since switch_slb() is already
quite long. The default behavior (preloading SLB entries from the
current thread_info struct) remains unchanged.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * New to series.
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
 arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
 arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
 4 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index eace8c3f7b0a1..b23a9dcdee5af 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -130,6 +130,9 @@ typedef struct {
 	u32 pkey_allocation_map;
 	s16 execute_only_pkey; /* key holding execute-only protection */
 #endif
+
+	/* Do not preload SLB entries from thread_info during switch_slb() */
+	bool skip_slb_preload;
 } mm_context_t;
 
 static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 4bc45d3ed8b0e..264787e90b1a1 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
 	return 0;
 }
 
+#ifdef CONFIG_PPC_BOOK3S_64
+
+static inline void skip_slb_preload_mm(struct mm_struct *mm)
+{
+	mm->context.skip_slb_preload = true;
+}
+
+#else
+
+static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
+
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 #include <asm-generic/mmu_context.h>
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
index c10fc8a72fb37..3479910264c59 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	atomic_set(&mm->context.active_cpus, 0);
 	atomic_set(&mm->context.copros, 0);
 
+	mm->context.skip_slb_preload = false;
+
 	return 0;
 }
 
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index c91bd85eb90e3..da0836cb855af 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
 	asm volatile("slbie %0" : : "r" (slbie_data));
 }
 
+static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
+{
+	struct thread_info *ti = task_thread_info(tsk);
+	unsigned char i;
+
+	/*
+	 * We gradually age out SLBs after a number of context switches to
+	 * reduce reload overhead of unused entries (like we do with FP/VEC
+	 * reload). Each time we wrap 256 switches, take an entry out of the
+	 * SLB preload cache.
+	 */
+	tsk->thread.load_slb++;
+	if (!tsk->thread.load_slb) {
+		unsigned long pc = KSTK_EIP(tsk);
+
+		preload_age(ti);
+		preload_add(ti, pc);
+	}
+
+	for (i = 0; i < ti->slb_preload_nr; i++) {
+		unsigned char idx;
+		unsigned long ea;
+
+		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
+		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
+
+		slb_allocate_user(mm, ea);
+	}
+}
+
 /* Flush all user entries from the segment table of the current processor. */
 void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 {
-	struct thread_info *ti = task_thread_info(tsk);
 	unsigned char i;
 
 	/*
@@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 
 	copy_mm_to_paca(mm);
 
-	/*
-	 * We gradually age out SLBs after a number of context switches to
-	 * reduce reload overhead of unused entries (like we do with FP/VEC
-	 * reload). Each time we wrap 256 switches, take an entry out of the
-	 * SLB preload cache.
-	 */
-	tsk->thread.load_slb++;
-	if (!tsk->thread.load_slb) {
-		unsigned long pc = KSTK_EIP(tsk);
-
-		preload_age(ti);
-		preload_add(ti, pc);
-	}
-
-	for (i = 0; i < ti->slb_preload_nr; i++) {
-		unsigned char idx;
-		unsigned long ea;
-
-		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
-		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
-
-		slb_allocate_user(mm, ea);
-	}
+	if (!mm->context.skip_slb_preload)
+		preload_slb_entries(tsk, mm);
 
 	/*
 	 * Synchronize slbmte preloads with possible subsequent user memory
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 06/11] powerpc: Introduce temporary mm
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (4 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 07/11] powerpc/64s: Make slb_allocate_user() non-static Christopher M. Riedl
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. A side benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use.

With the Book3s64 Hash MMU the SLB is preloaded with entries from the
current thread_info struct during switch_slb(). This could cause a
Machine Check (MCE) due to an SLB Multihit when creating arbitrary
userspace mappings in the temporary mm later. Disable SLB preload from
the thread_info struct for any temporary mm to avoid this.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
       using/unusing the temp mm as suggested by Jann Horn to keep
       the context.active counter in-sync on mm/nohash.
     * Disable SLB preload in the temporary mm when initializing the
       temp_mm struct.
     * Include asm/debug.h header to fix build issue with
       ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c    |  5 +++
 arch/powerpc/lib/code-patching.c | 67 ++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c3..dfd82635ea8b3 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 89e34aa273e21..8e94cabaea3c3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -864,6 +864,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
 	return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+	memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
 	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 2b1b3e9043ade..cbdfba8a39360 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,8 @@
 #include <asm/code-patching.h>
 #include <asm/setup.h>
 #include <asm/inst.h>
+#include <asm/mmu_context.h>
+#include <asm/debug.h>
 
 static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst instr,
 			       struct ppc_inst *patch_addr)
@@ -46,6 +48,71 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+	struct mm_struct *temp;
+	struct mm_struct *prev;
+	struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+	/* Do not preload SLB entries from the thread_info struct */
+	if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
+		skip_slb_preload_mm(mm);
+
+	temp_mm->temp = mm;
+	temp_mm->prev = NULL;
+	memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void use_temporary_mm(struct temp_mm *temp_mm)
+{
+	lockdep_assert_irqs_disabled();
+
+	temp_mm->prev = current->active_mm;
+	switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+	WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+	if (ppc_breakpoint_available()) {
+		struct arch_hw_breakpoint null_brk = {0};
+		int i = 0;
+
+		for (; i < nr_wp_slots(); ++i) {
+			__get_breakpoint(i, &temp_mm->brk[i]);
+			if (temp_mm->brk[i].type != 0)
+				__set_breakpoint(i, &null_brk);
+		}
+	}
+}
+
+static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
+{
+	lockdep_assert_irqs_disabled();
+
+	switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
+
+	/*
+	 * On book3s64 the active_cpus counter increments in
+	 * switch_mm_irqs_off(). With the Hash MMU this counter affects if TLB
+	 * flushes are local. We have to manually decrement that counter here
+	 * along with removing our current CPU from the mm's cpumask so that in
+	 * the future a different CPU can reuse the temporary mm and still rely
+	 * on local TLB flushes.
+	 */
+	dec_mm_active_cpus(temp_mm->temp);
+	cpumask_clear_cpu(smp_processor_id(), mm_cpumask(temp_mm->temp));
+
+	if (ppc_breakpoint_available()) {
+		int i = 0;
+
+		for (; i < nr_wp_slots(); ++i)
+			if (temp_mm->brk[i].type != 0)
+				__set_breakpoint(i, &temp_mm->brk[i]);
+	}
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
 #if IS_BUILTIN(CONFIG_LKDTM)
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 07/11] powerpc/64s: Make slb_allocate_user() non-static
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (5 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 06/11] powerpc: Introduce temporary mm Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching Christopher M. Riedl
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

With Book3s64 Hash translation, manually inserting a PTE requires
updating the Linux PTE, inserting a SLB entry, and inserting the hashed
page. The first is handled via the usual kernel abstractions, the second
requires slb_allocate_user() which is currently 'static', and the third
is available via hash_page_mm() already.

Make slb_allocate_user() non-static and add a prototype so the next
patch can use it during code-patching.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * New to series.
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 +
 arch/powerpc/mm/book3s64/slb.c                | 4 +---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3004f3323144d..189854eebba77 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -525,6 +525,7 @@ void slb_dump_contents(struct slb_entry *slb_ptr);
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
 void preload_new_slb_context(unsigned long start, unsigned long sp);
+long slb_allocate_user(struct mm_struct *mm, unsigned long ea);
 #endif /* __ASSEMBLY__ */
 
 /*
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index da0836cb855af..532eb51bc5211 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -29,8 +29,6 @@
 #include "internal.h"
 
 
-static long slb_allocate_user(struct mm_struct *mm, unsigned long ea);
-
 bool stress_slb_enabled __initdata;
 
 static int __init parse_stress_slb(char *p)
@@ -791,7 +789,7 @@ static long slb_allocate_kernel(unsigned long ea, unsigned long id)
 	return slb_insert_entry(ea, context, flags, ssize, true);
 }
 
-static long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
+long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
 {
 	unsigned long context;
 	unsigned long flags;
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (6 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 07/11] powerpc/64s: Make slb_allocate_user() non-static Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-06-21  3:19   ` Daniel Axtens
  2021-07-01  6:12     ` Nicholas Piggin
  2021-05-06  4:34 ` [RESEND PATCH v4 09/11] lkdtm/powerpc: Fix code patching hijack test Christopher M. Riedl
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
the Book3s64 Hash MMU operates - by default the space above
DEFAULT_MAP_WINDOW is not available. For now, the patching address for
all platforms/MMUs is randomized inside this range.  The number of
possible random addresses is dependent on PAGE_SIZE and limited by
DEFAULT_MAP_WINDOW.

Bits of entropy with 64K page size on BOOK3S_64:

        bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

        PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
        bits of entropy = log2(128TB / 64K) bits of entropy = 31

Randomization occurs only once during initialization at boot.

Introduce two new functions, map_patch() and unmap_patch(), to
respectively create and remove the temporary mapping with write
permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
the page for patching with PAGE_SHARED since the kernel cannot access
userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.

Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
for the patching_addr when using the Hash MMU on Book3s64 to avoid
taking an SLB and Hash fault during patching.

Since patching_addr is now a userspace address, lock/unlock KUAP on
non-Book3s64 platforms. On Book3s64 with a Radix MMU, mapping the page
with PAGE_KERNEL sets EAA[0] for the PTE which ignores the AMR (KUAP)
according to PowerISA v3.0b Figure 35. On Book3s64 with a Hash MMU, the
hash PTE for the mapping is inserted with HPTE_USE_KERNEL_KEY which
similarly avoids the need for switching KUAP.

Finally, add a new WARN_ON() to check that the instruction was patched
as intended after the temporary mapping is torn down.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * In the previous series this was two separate patches: one to init
       the temporary mm in poking_init() (unused in powerpc at the time)
       and the other to use it for patching (which removed all the
       per-cpu vmalloc code). Now that we use poking_init() in the
       existing per-cpu vmalloc approach, that separation doesn't work
       as nicely anymore so I just merged the two patches into one.
     * Preload the SLB entry and hash the page for the patching_addr
       when using Hash on book3s64 to avoid taking an SLB and Hash fault
       during patching. The previous implementation was a hack which
       changed current->mm to allow the SLB and Hash fault handlers to
       work with the temporary mm since both of those code-paths always
       assume mm == current->mm.
     * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
       have to manage the mm->context.active_cpus counter and mm cpumask
       since they determine (via mm_is_thread_local()) if the TLB flush
       in pte_clear() is local or not - it should always be local when
       we're using the temporary mm. On book3s64's Radix MMU we can
       just call local_flush_tlb_mm().
     * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
       KUAP.
---
 arch/powerpc/lib/code-patching.c | 209 ++++++++++++++++++-------------
 1 file changed, 121 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index cbdfba8a39360..7e15abc09ec04 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,8 @@
 #include <linux/cpuhotplug.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/sched/task.h>
+#include <linux/random.h>
 
 #include <asm/tlbflush.h>
 #include <asm/page.h>
@@ -19,6 +21,7 @@
 #include <asm/inst.h>
 #include <asm/mmu_context.h>
 #include <asm/debug.h>
+#include <asm/tlb.h>
 
 static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst instr,
 			       struct ppc_inst *patch_addr)
@@ -113,113 +116,142 @@ static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
 	}
 }
 
-static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static struct mm_struct *patching_mm __ro_after_init;
+static unsigned long patching_addr __ro_after_init;
+
+void __init poking_init(void)
+{
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+
+	/*
+	 * Some parts of the kernel (static keys for example) depend on
+	 * successful code patching. Code patching under STRICT_KERNEL_RWX
+	 * requires this setup - otherwise we cannot patch at all. We use
+	 * BUG_ON() here and later since an early failure is preferred to
+	 * buggy behavior and/or strange crashes later.
+	 */
+	patching_mm = copy_init_mm();
+	BUG_ON(!patching_mm);
+
+	/*
+	 * Choose a randomized, page-aligned address from the range:
+	 * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE]
+	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
+	 * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay
+	 * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
+	 */
+	patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
+			% (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
+
+	/*
+	 * PTE allocation uses GFP_KERNEL which means we need to pre-allocate
+	 * the PTE here. We cannot do the allocation during patching with IRQs
+	 * disabled (ie. "atomic" context).
+	 */
+	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
+	BUG_ON(!ptep);
+	pte_unmap_unlock(ptep, ptl);
+}
 
 #if IS_BUILTIN(CONFIG_LKDTM)
 unsigned long read_cpu_patching_addr(unsigned int cpu)
 {
-	return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+	return patching_addr;
 }
 #endif
 
-static int text_area_cpu_up(unsigned int cpu)
+struct patch_mapping {
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+	struct temp_mm temp_mm;
+};
+
+#ifdef CONFIG_PPC_BOOK3S_64
+
+static inline int hash_prefault_mapping(pgprot_t pgprot)
 {
-	struct vm_struct *area;
+	int err;
 
-	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
-	if (!area) {
-		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
-			cpu);
-		return -1;
-	}
-	this_cpu_write(text_poke_area, area);
+	if (radix_enabled())
+		return 0;
 
-	return 0;
-}
+	err = slb_allocate_user(patching_mm, patching_addr);
+	if (err)
+		pr_warn("map patch: failed to allocate slb entry\n");
 
-static int text_area_cpu_down(unsigned int cpu)
-{
-	free_vm_area(this_cpu_read(text_poke_area));
-	return 0;
+	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
+			   HPTE_USE_KERNEL_KEY);
+	if (err)
+		pr_warn("map patch: failed to insert hashed page\n");
+
+	/* See comment in switch_slb() in mm/book3s64/slb.c */
+	isync();
+
+	return err;
 }
 
-/*
- * Run as a late init call. This allows all the boot time patching to be done
- * simply by patching the code, and then we're called here prior to
- * mark_rodata_ro(), which happens after all init calls are run. Although
- * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge
- * it as being preferable to a kernel that will crash later when someone tries
- * to use patch_instruction().
- */
-static int __init setup_text_poke_area(void)
-{
-	BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-		"powerpc/text_poke:online", text_area_cpu_up,
-		text_area_cpu_down));
+#else
 
+static inline int hash_prefault_mapping(pgprot_t pgprot)
+{
 	return 0;
 }
-late_initcall(setup_text_poke_area);
+
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
 {
-	unsigned long pfn;
-	int err;
+	struct page *page;
+	pte_t pte;
+	pgprot_t pgprot;
 
 	if (is_vmalloc_or_module_addr(addr))
-		pfn = vmalloc_to_pfn(addr);
+		page = vmalloc_to_page(addr);
 	else
-		pfn = __pa_symbol(addr) >> PAGE_SHIFT;
+		page = virt_to_page(addr);
 
-	err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+	if (radix_enabled())
+		pgprot = PAGE_KERNEL;
+	else
+		pgprot = PAGE_SHARED;
 
-	pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
-	if (err)
+	patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
+					     &patch_mapping->ptl);
+	if (unlikely(!patch_mapping->ptep)) {
+		pr_warn("map patch: failed to allocate pte for patching\n");
 		return -1;
+	}
 
-	return 0;
-}
-
-static inline int unmap_patch_area(unsigned long addr)
-{
-	pte_t *ptep;
-	pmd_t *pmdp;
-	pud_t *pudp;
-	p4d_t *p4dp;
-	pgd_t *pgdp;
-
-	pgdp = pgd_offset_k(addr);
-	if (unlikely(!pgdp))
-		return -EINVAL;
-
-	p4dp = p4d_offset(pgdp, addr);
-	if (unlikely(!p4dp))
-		return -EINVAL;
+	pte = mk_pte(page, pgprot);
+	pte = pte_mkdirty(pte);
+	set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
 
-	pudp = pud_offset(p4dp, addr);
-	if (unlikely(!pudp))
-		return -EINVAL;
+	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+	use_temporary_mm(&patch_mapping->temp_mm);
 
-	pmdp = pmd_offset(pudp, addr);
-	if (unlikely(!pmdp))
-		return -EINVAL;
+	/*
+	 * On Book3s64 with the Hash MMU we have to manually insert the SLB
+	 * entry and HPTE to prevent taking faults on the patching_addr later.
+	 */
+	return(hash_prefault_mapping(pgprot));
+}
 
-	ptep = pte_offset_kernel(pmdp, addr);
-	if (unlikely(!ptep))
-		return -EINVAL;
+static void unmap_patch(struct patch_mapping *patch_mapping)
+{
+	/* Book3s64 Hash MMU: pte_clear() flushes the TLB */
+	pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
 
-	pr_devel("clearing mm %p, pte %p, addr %lx\n", &init_mm, ptep, addr);
+	/* Book3s64 Radix MMU: explicitly flush the TLB (no-op in Hash MMU) */
+	local_flush_tlb_mm(patching_mm);
 
-	/*
-	 * In hash, pte_clear flushes the tlb, in radix, we have to
-	 */
-	pte_clear(&init_mm, addr, ptep);
-	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
 
-	return 0;
+	/* Book3s64 Hash MMU: switch_mm_irqs_off() invalidates the SLB */
+	unuse_temporary_mm(&patch_mapping->temp_mm);
 }
 
 static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
@@ -227,32 +259,33 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 	int err;
 	struct ppc_inst *patch_addr = NULL;
 	unsigned long flags;
-	unsigned long text_poke_addr;
-	unsigned long kaddr = (unsigned long)addr;
+	struct patch_mapping patch_mapping;
 
 	/*
-	 * During early early boot patch_instruction is called
-	 * when text_poke_area is not ready, but we still need
-	 * to allow patching. We just do the plain old patching
+	 * The patching_mm is initialized before calling mark_rodata_ro. Prior
+	 * to this, patch_instruction is called when we don't have (and don't
+	 * need) the patching_mm so just do plain old patching.
 	 */
-	if (!this_cpu_read(text_poke_area))
+	if (!patching_mm)
 		return raw_patch_instruction(addr, instr);
 
 	local_irq_save(flags);
 
-	text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
-	if (map_patch_area(addr, text_poke_addr)) {
-		err = -1;
+	err = map_patch(addr, &patch_mapping);
+	if (err)
 		goto out;
-	}
 
-	patch_addr = (struct ppc_inst *)(text_poke_addr + (kaddr & ~PAGE_MASK));
+	patch_addr = (struct ppc_inst *)(patching_addr | offset_in_page(addr));
 
-	__patch_instruction(addr, instr, patch_addr);
+	if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
+		allow_read_write_user(patch_addr, patch_addr, ppc_inst_len(instr));
+	err = __patch_instruction(addr, instr, patch_addr);
+	if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
+		prevent_read_write_user(patch_addr, patch_addr, ppc_inst_len(instr));
 
-	err = unmap_patch_area(text_poke_addr);
-	if (err)
-		pr_warn("failed to unmap %lx\n", text_poke_addr);
+	unmap_patch(&patch_mapping);
+
+	WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
 
 out:
 	local_irq_restore(flags);
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 09/11] lkdtm/powerpc: Fix code patching hijack test
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (7 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock Christopher M. Riedl
  2021-05-06  4:34 ` [RESEND PATCH v4 11/11] powerpc: Use patch_instruction_unlocked() in loops Christopher M. Riedl
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

Code patching on powerpc with a STRICT_KERNEL_RWX uses a userspace
address in a temporary mm now. Use __put_user() to avoid write failures
due to KUAP when attempting a "hijack" on the patching address.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
---
 drivers/misc/lkdtm/perms.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 55c3bec6d3b72..af9bf285fe326 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -268,16 +268,7 @@ static inline u32 lkdtm_read_patch_site(void)
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
-#ifdef CONFIG_PPC
-	__put_kernel_nofault(addr, &data, u32, err);
-	return true;
-
-err:
-	return false;
-#endif
-#ifdef CONFIG_X86_64
 	return !__put_user(data, addr);
-#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (8 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 09/11] lkdtm/powerpc: Fix code patching hijack test Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  2021-05-06 10:51     ` Peter Zijlstra
  2021-05-06  4:34 ` [RESEND PATCH v4 11/11] powerpc: Use patch_instruction_unlocked() in loops Christopher M. Riedl
  10 siblings, 1 reply; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

Powerpc allows for multiple CPUs to patch concurrently. When patching
with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
CPUs for the few times that patching occurs. Use a spinlock to protect
the patching_mm from concurrent use.

Modify patch_instruction() to acquire the lock, perform the patch op,
and then release the lock.

Also introduce {lock,unlock}_patching() along with
patch_instruction_unlocked() to avoid per-iteration lock overhead when
patch_instruction() is called in a loop. A follow-up patch converts some
uses of patch_instruction() to use patch_instruction_unlocked() instead.

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * New to series.
---
 arch/powerpc/include/asm/code-patching.h |  4 ++
 arch/powerpc/lib/code-patching.c         | 85 +++++++++++++++++++++---
 2 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
index e51c81e4a9bda..2efa11b68cd8f 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -28,8 +28,12 @@ int create_branch(struct ppc_inst *instr, const struct ppc_inst *addr,
 int create_cond_branch(struct ppc_inst *instr, const struct ppc_inst *addr,
 		       unsigned long target, int flags);
 int patch_branch(struct ppc_inst *addr, unsigned long target, int flags);
+int patch_branch_unlocked(struct ppc_inst *addr, unsigned long target, int flags);
 int patch_instruction(struct ppc_inst *addr, struct ppc_inst instr);
+int patch_instruction_unlocked(struct ppc_inst *addr, struct ppc_inst instr);
 int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr);
+unsigned long lock_patching(void);
+void unlock_patching(unsigned long flags);
 
 static inline unsigned long patch_site_addr(s32 *site)
 {
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 7e15abc09ec04..0a496bb52bbf4 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -52,13 +52,17 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
 
+static DEFINE_SPINLOCK(patching_lock);
+
 struct temp_mm {
 	struct mm_struct *temp;
 	struct mm_struct *prev;
 	struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+	spinlock_t *lock; /* protect access to the temporary mm */
 };
 
-static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm,
+				spinlock_t *lock)
 {
 	/* Do not preload SLB entries from the thread_info struct */
 	if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
@@ -66,12 +70,14 @@ static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
 
 	temp_mm->temp = mm;
 	temp_mm->prev = NULL;
+	temp_mm->lock = lock;
 	memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
 }
 
 static inline void use_temporary_mm(struct temp_mm *temp_mm)
 {
 	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(temp_mm->lock);
 
 	temp_mm->prev = current->active_mm;
 	switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
@@ -93,11 +99,13 @@ static inline void use_temporary_mm(struct temp_mm *temp_mm)
 static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
 {
 	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(temp_mm->lock);
 
 	switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
 
 	/*
-	 * On book3s64 the active_cpus counter increments in
+	 * The temporary mm can only be in use on a single CPU at a time due to
+	 * the temp_mm->lock. On book3s64 the active_cpus counter increments in
 	 * switch_mm_irqs_off(). With the Hash MMU this counter affects if TLB
 	 * flushes are local. We have to manually decrement that counter here
 	 * along with removing our current CPU from the mm's cpumask so that in
@@ -230,7 +238,7 @@ static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
 	pte = pte_mkdirty(pte);
 	set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
 
-	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+	init_temp_mm(&patch_mapping->temp_mm, patching_mm, &patching_lock);
 	use_temporary_mm(&patch_mapping->temp_mm);
 
 	/*
@@ -258,7 +266,6 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 {
 	int err;
 	struct ppc_inst *patch_addr = NULL;
-	unsigned long flags;
 	struct patch_mapping patch_mapping;
 
 	/*
@@ -269,11 +276,12 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 	if (!patching_mm)
 		return raw_patch_instruction(addr, instr);
 
-	local_irq_save(flags);
+	lockdep_assert_held(&patching_lock);
+	lockdep_assert_irqs_disabled();
 
 	err = map_patch(addr, &patch_mapping);
 	if (err)
-		goto out;
+		return err;
 
 	patch_addr = (struct ppc_inst *)(patching_addr | offset_in_page(addr));
 
@@ -287,11 +295,33 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 
 	WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
 
-out:
-	local_irq_restore(flags);
-
 	return err;
 }
+
+unsigned long lock_patching(void)
+{
+	unsigned long flags;
+
+	/* We don't need the lock if we're not using the patching_mm. */
+	if (!patching_mm)
+		return 0;
+
+	spin_lock_irqsave(&patching_lock, flags);
+	return flags;
+}
+
+void unlock_patching(const unsigned long flags)
+{
+	/* We never held the lock if we're not using the patching_mm. */
+	if (!patching_mm)
+		return;
+
+	lockdep_assert_held(&patching_lock);
+	lockdep_assert_irqs_disabled();
+
+	spin_unlock_irqrestore(&patching_lock, flags);
+}
+
 #else /* !CONFIG_STRICT_KERNEL_RWX */
 
 static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
@@ -299,19 +329,46 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 	return raw_patch_instruction(addr, instr);
 }
 
+unsigned long lock_patching(void)
+{
+	return 0;
+}
+
+void unlock_patching(const unsigned long flags) {}
+
 #endif /* CONFIG_STRICT_KERNEL_RWX */
 
 int patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 {
+	int err;
+	unsigned long flags;
+
 	/* Make sure we aren't patching a freed init section */
 	if (init_mem_is_free && init_section_contains(addr, 4)) {
 		pr_debug("Skipping init section patching addr: 0x%px\n", addr);
 		return 0;
 	}
-	return do_patch_instruction(addr, instr);
+
+	flags = lock_patching();
+	err = do_patch_instruction(addr, instr);
+	unlock_patching(flags);
+
+	return err;
 }
 NOKPROBE_SYMBOL(patch_instruction);
 
+int patch_instruction_unlocked(struct ppc_inst *addr, struct ppc_inst instr)
+{
+	/* Make sure we aren't patching a freed init section */
+	if (init_mem_is_free && init_section_contains(addr, 4)) {
+		pr_debug("Skipping init section patching addr: 0x%p\n", addr);
+		return 0;
+	}
+
+	return do_patch_instruction(addr, instr);
+}
+NOKPROBE_SYMBOL(patch_instruction_unlocked);
+
 int patch_branch(struct ppc_inst *addr, unsigned long target, int flags)
 {
 	struct ppc_inst instr;
@@ -320,6 +377,14 @@ int patch_branch(struct ppc_inst *addr, unsigned long target, int flags)
 	return patch_instruction(addr, instr);
 }
 
+int patch_branch_unlocked(struct ppc_inst *addr, unsigned long target, int flags)
+{
+	struct ppc_inst instr;
+
+	create_branch(&instr, addr, target, flags);
+	return patch_instruction_unlocked(addr, instr);
+}
+
 bool is_offset_in_branch_range(long offset)
 {
 	/*
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RESEND PATCH v4 11/11] powerpc: Use patch_instruction_unlocked() in loops
  2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
                   ` (9 preceding siblings ...)
  2021-05-06  4:34 ` [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock Christopher M. Riedl
@ 2021-05-06  4:34 ` Christopher M. Riedl
  10 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-06  4:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

Now that patching requires a lock to prevent concurrent access to
patching_mm, every call to patch_instruction() acquires and releases a
spinlock. There are several places where patch_instruction() is called
in a loop. Convert these to acquire the lock once before the loop, call
patch_instruction_unlocked() in the loop body, and then release the lock
again after the loop terminates - as in:

	for (i = 0; i < n; ++i)
		patch_instruction(...); <-- lock/unlock every iteration

changes to:

	flags = lock_patching(); <-- lock once

	for (i = 0; i < n; ++i)
		patch_instruction_unlocked(...);

	unlock_patching(flags); <-- unlock once

Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>

---

v4:  * New to series.
---
 arch/powerpc/kernel/epapr_paravirt.c |   9 ++-
 arch/powerpc/kernel/optprobes.c      |  22 ++++--
 arch/powerpc/lib/feature-fixups.c    | 114 +++++++++++++++++++--------
 arch/powerpc/xmon/xmon.c             |  22 ++++--
 4 files changed, 120 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/kernel/epapr_paravirt.c b/arch/powerpc/kernel/epapr_paravirt.c
index 2ed14d4a47f59..b639e71cf9dec 100644
--- a/arch/powerpc/kernel/epapr_paravirt.c
+++ b/arch/powerpc/kernel/epapr_paravirt.c
@@ -28,6 +28,7 @@ static int __init early_init_dt_scan_epapr(unsigned long node,
 	const u32 *insts;
 	int len;
 	int i;
+	unsigned long flags;
 
 	insts = of_get_flat_dt_prop(node, "hcall-instructions", &len);
 	if (!insts)
@@ -36,14 +37,18 @@ static int __init early_init_dt_scan_epapr(unsigned long node,
 	if (len % 4 || len > (4 * 4))
 		return -1;
 
+	flags = lock_patching();
+
 	for (i = 0; i < (len / 4); i++) {
 		struct ppc_inst inst = ppc_inst(be32_to_cpu(insts[i]));
-		patch_instruction((struct ppc_inst *)(epapr_hypercall_start + i), inst);
+		patch_instruction_unlocked((struct ppc_inst *)(epapr_hypercall_start + i), inst);
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
-		patch_instruction((struct ppc_inst *)(epapr_ev_idle_start + i), inst);
+		patch_instruction_unlocked((struct ppc_inst *)(epapr_ev_idle_start + i), inst);
 #endif
 	}
 
+	unlock_patching(flags);
+
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
 	if (of_get_flat_dt_prop(node, "has-idle", NULL))
 		epapr_has_idle = true;
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index cdf87086fa33a..deaeb6e8d1a00 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -200,7 +200,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe *p)
 	struct ppc_inst branch_op_callback, branch_emulate_step, temp;
 	kprobe_opcode_t *op_callback_addr, *emulate_step_addr, *buff;
 	long b_offset;
-	unsigned long nip, size;
+	unsigned long nip, size, flags;
 	int rc, i;
 
 	kprobe_ppc_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
@@ -237,13 +237,20 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe *p)
 	/* We can optimize this via patch_instruction_window later */
 	size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);
 	pr_devel("Copying template to %p, size %lu\n", buff, size);
+
+	flags = lock_patching();
+
 	for (i = 0; i < size; i++) {
-		rc = patch_instruction((struct ppc_inst *)(buff + i),
-				       ppc_inst(*(optprobe_template_entry + i)));
-		if (rc < 0)
+		rc = patch_instruction_unlocked((struct ppc_inst *)(buff + i),
+						ppc_inst(*(optprobe_template_entry + i)));
+		if (rc < 0) {
+			unlock_patching(flags);
 			goto error;
+		}
 	}
 
+	unlock_patching(flags);
+
 	/*
 	 * Fixup the template with instructions to:
 	 * 1. load the address of the actual probepoint
@@ -322,6 +329,9 @@ void arch_optimize_kprobes(struct list_head *oplist)
 	struct ppc_inst instr;
 	struct optimized_kprobe *op;
 	struct optimized_kprobe *tmp;
+	unsigned long flags;
+
+	flags = lock_patching();
 
 	list_for_each_entry_safe(op, tmp, oplist, list) {
 		/*
@@ -333,9 +343,11 @@ void arch_optimize_kprobes(struct list_head *oplist)
 		create_branch(&instr,
 			      (struct ppc_inst *)op->kp.addr,
 			      (unsigned long)op->optinsn.insn, 0);
-		patch_instruction((struct ppc_inst *)op->kp.addr, instr);
+		patch_instruction_unlocked((struct ppc_inst *)op->kp.addr, instr);
 		list_del_init(&op->list);
 	}
+
+	unlock_patching(flags);
 }
 
 void arch_unoptimize_kprobe(struct optimized_kprobe *op)
diff --git a/arch/powerpc/lib/feature-fixups.c b/arch/powerpc/lib/feature-fixups.c
index 1fd31b4b0e139..2c3d413c9d9b3 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -123,6 +123,7 @@ static void do_stf_entry_barrier_fixups(enum stf_barrier_type types)
 	unsigned int instrs[3], *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	start = PTRRELOC(&__start___stf_entry_barrier_fixup);
 	end = PTRRELOC(&__stop___stf_entry_barrier_fixup);
@@ -144,24 +145,29 @@ static void do_stf_entry_barrier_fixups(enum stf_barrier_type types)
 		instrs[i++] = 0x63ff0000; /* ori 31,31,0 speculation barrier */
 	}
 
+	flags = lock_patching();
+
 	for (i = 0; start < end; start++, i++) {
 		dest = (void *)start + *start;
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
 
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instrs[0]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instrs[0]));
 
 		if (types & STF_BARRIER_FALLBACK)
-			patch_branch((struct ppc_inst *)(dest + 1),
-				     (unsigned long)&stf_barrier_fallback,
-				     BRANCH_SET_LINK);
+			patch_branch_unlocked((struct ppc_inst *)(dest + 1),
+					      (unsigned long)&stf_barrier_fallback,
+					      BRANCH_SET_LINK);
 		else
-			patch_instruction((struct ppc_inst *)(dest + 1),
-					  ppc_inst(instrs[1]));
+			patch_instruction_unlocked((struct ppc_inst *)(dest + 1),
+						   ppc_inst(instrs[1]));
 
-		patch_instruction((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 2),
+					   ppc_inst(instrs[2]));
 	}
 
+	unlock_patching(flags);
+
 	printk(KERN_DEBUG "stf-barrier: patched %d entry locations (%s barrier)\n", i,
 		(types == STF_BARRIER_NONE)                  ? "no" :
 		(types == STF_BARRIER_FALLBACK)              ? "fallback" :
@@ -175,6 +181,7 @@ static void do_stf_exit_barrier_fixups(enum stf_barrier_type types)
 	unsigned int instrs[6], *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	start = PTRRELOC(&__start___stf_exit_barrier_fixup);
 	end = PTRRELOC(&__stop___stf_exit_barrier_fixup);
@@ -207,18 +214,23 @@ static void do_stf_exit_barrier_fixups(enum stf_barrier_type types)
 		instrs[i++] = 0x7e0006ac; /* eieio + bit 6 hint */
 	}
 
+	flags = lock_patching();
+
 	for (i = 0; start < end; start++, i++) {
 		dest = (void *)start + *start;
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
 
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instrs[0]));
-		patch_instruction((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
-		patch_instruction((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
-		patch_instruction((struct ppc_inst *)(dest + 3), ppc_inst(instrs[3]));
-		patch_instruction((struct ppc_inst *)(dest + 4), ppc_inst(instrs[4]));
-		patch_instruction((struct ppc_inst *)(dest + 5), ppc_inst(instrs[5]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instrs[0]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 3), ppc_inst(instrs[3]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 4), ppc_inst(instrs[4]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 5), ppc_inst(instrs[5]));
 	}
+
+	unlock_patching(flags);
+
 	printk(KERN_DEBUG "stf-barrier: patched %d exit locations (%s barrier)\n", i,
 		(types == STF_BARRIER_NONE)                  ? "no" :
 		(types == STF_BARRIER_FALLBACK)              ? "fallback" :
@@ -239,6 +251,7 @@ void do_uaccess_flush_fixups(enum l1d_flush_type types)
 	unsigned int instrs[4], *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	start = PTRRELOC(&__start___uaccess_flush_fixup);
 	end = PTRRELOC(&__stop___uaccess_flush_fixup);
@@ -262,18 +275,22 @@ void do_uaccess_flush_fixups(enum l1d_flush_type types)
 	if (types & L1D_FLUSH_MTTRIG)
 		instrs[i++] = 0x7c12dba6; /* mtspr TRIG2,r0 (SPR #882) */
 
+	flags = lock_patching();
+
 	for (i = 0; start < end; start++, i++) {
 		dest = (void *)start + *start;
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
 
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instrs[0]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instrs[0]));
 
-		patch_instruction((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
-		patch_instruction((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
-		patch_instruction((struct ppc_inst *)(dest + 3), ppc_inst(instrs[3]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 3), ppc_inst(instrs[3]));
 	}
 
+	unlock_patching(flags);
+
 	printk(KERN_DEBUG "uaccess-flush: patched %d locations (%s flush)\n", i,
 		(types == L1D_FLUSH_NONE)       ? "no" :
 		(types == L1D_FLUSH_FALLBACK)   ? "fallback displacement" :
@@ -289,6 +306,7 @@ void do_entry_flush_fixups(enum l1d_flush_type types)
 	unsigned int instrs[3], *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	instrs[0] = 0x60000000; /* nop */
 	instrs[1] = 0x60000000; /* nop */
@@ -309,6 +327,8 @@ void do_entry_flush_fixups(enum l1d_flush_type types)
 	if (types & L1D_FLUSH_MTTRIG)
 		instrs[i++] = 0x7c12dba6; /* mtspr TRIG2,r0 (SPR #882) */
 
+	flags = lock_patching();
+
 	start = PTRRELOC(&__start___entry_flush_fixup);
 	end = PTRRELOC(&__stop___entry_flush_fixup);
 	for (i = 0; start < end; start++, i++) {
@@ -316,15 +336,17 @@ void do_entry_flush_fixups(enum l1d_flush_type types)
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
 
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instrs[0]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instrs[0]));
 
 		if (types == L1D_FLUSH_FALLBACK)
-			patch_branch((struct ppc_inst *)(dest + 1), (unsigned long)&entry_flush_fallback,
-				     BRANCH_SET_LINK);
+			patch_branch_unlocked((struct ppc_inst *)(dest + 1),
+					      (unsigned long)&entry_flush_fallback,
+					      BRANCH_SET_LINK);
 		else
-			patch_instruction((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
+			patch_instruction_unlocked((struct ppc_inst *)(dest + 1),
+						   ppc_inst(instrs[1]));
 
-		patch_instruction((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
 	}
 
 	start = PTRRELOC(&__start___scv_entry_flush_fixup);
@@ -334,17 +356,20 @@ void do_entry_flush_fixups(enum l1d_flush_type types)
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
 
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instrs[0]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instrs[0]));
 
 		if (types == L1D_FLUSH_FALLBACK)
-			patch_branch((struct ppc_inst *)(dest + 1), (unsigned long)&scv_entry_flush_fallback,
-				     BRANCH_SET_LINK);
+			patch_branch_unlocked((struct ppc_inst *)(dest + 1),
+					      (unsigned long)&scv_entry_flush_fallback,
+					      BRANCH_SET_LINK);
 		else
-			patch_instruction((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
+			patch_instruction_unlocked((struct ppc_inst *)(dest + 1),
+						   ppc_inst(instrs[1]));
 
-		patch_instruction((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
 	}
 
+	unlock_patching(flags);
 
 	printk(KERN_DEBUG "entry-flush: patched %d locations (%s flush)\n", i,
 		(types == L1D_FLUSH_NONE)       ? "no" :
@@ -361,6 +386,7 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
 	unsigned int instrs[3], *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	start = PTRRELOC(&__start___rfi_flush_fixup);
 	end = PTRRELOC(&__stop___rfi_flush_fixup);
@@ -382,16 +408,20 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
 	if (types & L1D_FLUSH_MTTRIG)
 		instrs[i++] = 0x7c12dba6; /* mtspr TRIG2,r0 (SPR #882) */
 
+	flags = lock_patching();
+
 	for (i = 0; start < end; start++, i++) {
 		dest = (void *)start + *start;
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
 
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instrs[0]));
-		patch_instruction((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
-		patch_instruction((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instrs[0]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 1), ppc_inst(instrs[1]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 2), ppc_inst(instrs[2]));
 	}
 
+	unlock_patching(flags);
+
 	printk(KERN_DEBUG "rfi-flush: patched %d locations (%s flush)\n", i,
 		(types == L1D_FLUSH_NONE)       ? "no" :
 		(types == L1D_FLUSH_FALLBACK)   ? "fallback displacement" :
@@ -407,6 +437,7 @@ void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void *fixup_
 	unsigned int instr, *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	start = fixup_start;
 	end = fixup_end;
@@ -418,13 +449,17 @@ void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void *fixup_
 		instr = 0x63ff0000; /* ori 31,31,0 speculation barrier */
 	}
 
+	flags = lock_patching();
+
 	for (i = 0; start < end; start++, i++) {
 		dest = (void *)start + *start;
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instr));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instr));
 	}
 
+	unlock_patching(flags);
+
 	printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
 
@@ -448,6 +483,7 @@ void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void *fixup_
 	unsigned int instr[2], *dest;
 	long *start, *end;
 	int i;
+	unsigned long flags;
 
 	start = fixup_start;
 	end = fixup_end;
@@ -461,27 +497,37 @@ void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void *fixup_
 		instr[1] = PPC_INST_SYNC;
 	}
 
+	flags = lock_patching();
+
 	for (i = 0; start < end; start++, i++) {
 		dest = (void *)start + *start;
 
 		pr_devel("patching dest %lx\n", (unsigned long)dest);
-		patch_instruction((struct ppc_inst *)dest, ppc_inst(instr[0]));
-		patch_instruction((struct ppc_inst *)(dest + 1), ppc_inst(instr[1]));
+		patch_instruction_unlocked((struct ppc_inst *)dest, ppc_inst(instr[0]));
+		patch_instruction_unlocked((struct ppc_inst *)(dest + 1), ppc_inst(instr[1]));
 	}
 
+	unlock_patching(flags);
+
 	printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
 
 static void patch_btb_flush_section(long *curr)
 {
 	unsigned int *start, *end;
+	unsigned long flags;
 
 	start = (void *)curr + *curr;
 	end = (void *)curr + *(curr + 1);
+
+	flags = lock_patching();
+
 	for (; start < end; start++) {
 		pr_devel("patching dest %lx\n", (unsigned long)start);
-		patch_instruction((struct ppc_inst *)start, ppc_inst(PPC_INST_NOP));
+		patch_instruction_unlocked((struct ppc_inst *)start, ppc_inst(PPC_INST_NOP));
 	}
+
+	unlock_patching(flags);
 }
 
 void do_btb_flush_fixups(void)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index ff2b92bfeedcc..e8a00041c04bf 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -905,6 +905,9 @@ static void insert_bpts(void)
 	int i;
 	struct ppc_inst instr, instr2;
 	struct bpt *bp, *bp2;
+	unsigned long flags;
+
+	flags = lock_patching();
 
 	bp = bpts;
 	for (i = 0; i < NBPTS; ++i, ++bp) {
@@ -945,19 +948,21 @@ static void insert_bpts(void)
 			continue;
 		}
 
-		patch_instruction(bp->instr, instr);
-		patch_instruction(ppc_inst_next(bp->instr, &instr),
-				  ppc_inst(bpinstr));
+		patch_instruction_unlocked(bp->instr, instr);
+		patch_instruction_unlocked(ppc_inst_next(bp->instr, &instr),
+					   ppc_inst(bpinstr));
 		if (bp->enabled & BP_CIABR)
 			continue;
-		if (patch_instruction((struct ppc_inst *)bp->address,
-				      ppc_inst(bpinstr)) != 0) {
+		if (patch_instruction_unlocked((struct ppc_inst *)bp->address,
+						ppc_inst(bpinstr)) != 0) {
 			printf("Couldn't write instruction at %lx, "
 			       "disabling breakpoint there\n", bp->address);
 			bp->enabled &= ~BP_TRAP;
 			continue;
 		}
 	}
+
+	unlock_patching(flags);
 }
 
 static void insert_cpu_bpts(void)
@@ -984,6 +989,9 @@ static void remove_bpts(void)
 	int i;
 	struct bpt *bp;
 	struct ppc_inst instr;
+	unsigned long flags;
+
+	flags = lock_patching();
 
 	bp = bpts;
 	for (i = 0; i < NBPTS; ++i, ++bp) {
@@ -991,11 +999,13 @@ static void remove_bpts(void)
 			continue;
 		if (mread_instr(bp->address, &instr)
 		    && ppc_inst_equal(instr, ppc_inst(bpinstr))
-		    && patch_instruction(
+		    && patch_instruction_unlocked(
 			(struct ppc_inst *)bp->address, ppc_inst_read(bp->instr)) != 0)
 			printf("Couldn't remove breakpoint at %lx\n",
 			       bp->address);
 	}
+
+	unlock_patching(flags);
 }
 
 static void remove_cpu_bpts(void)
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock
  2021-05-06  4:34 ` [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock Christopher M. Riedl
@ 2021-05-06 10:51     ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2021-05-06 10:51 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, tglx, x86, linux-hardening, keescook

On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> Powerpc allows for multiple CPUs to patch concurrently. When patching
> with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> CPUs for the few times that patching occurs. Use a spinlock to protect
> the patching_mm from concurrent use.
> 
> Modify patch_instruction() to acquire the lock, perform the patch op,
> and then release the lock.
> 
> Also introduce {lock,unlock}_patching() along with
> patch_instruction_unlocked() to avoid per-iteration lock overhead when
> patch_instruction() is called in a loop. A follow-up patch converts some
> uses of patch_instruction() to use patch_instruction_unlocked() instead.

x86 uses text_mutex for all this, why not do the same?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock
@ 2021-05-06 10:51     ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2021-05-06 10:51 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: tglx, x86, linuxppc-dev, linux-hardening, keescook

On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> Powerpc allows for multiple CPUs to patch concurrently. When patching
> with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> CPUs for the few times that patching occurs. Use a spinlock to protect
> the patching_mm from concurrent use.
> 
> Modify patch_instruction() to acquire the lock, perform the patch op,
> and then release the lock.
> 
> Also introduce {lock,unlock}_patching() along with
> patch_instruction_unlocked() to avoid per-iteration lock overhead when
> patch_instruction() is called in a loop. A follow-up patch converts some
> uses of patch_instruction() to use patch_instruction_unlocked() instead.

x86 uses text_mutex for all this, why not do the same?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock
  2021-05-06 10:51     ` Peter Zijlstra
@ 2021-05-07 20:03       ` Christopher M. Riedl
  -1 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-07 20:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, x86, linuxppc-dev, linux-hardening, keescook

On Thu May 6, 2021 at 5:51 AM CDT, Peter Zijlstra wrote:
> On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> > Powerpc allows for multiple CPUs to patch concurrently. When patching
> > with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> > CPUs for the few times that patching occurs. Use a spinlock to protect
> > the patching_mm from concurrent use.
> > 
> > Modify patch_instruction() to acquire the lock, perform the patch op,
> > and then release the lock.
> > 
> > Also introduce {lock,unlock}_patching() along with
> > patch_instruction_unlocked() to avoid per-iteration lock overhead when
> > patch_instruction() is called in a loop. A follow-up patch converts some
> > uses of patch_instruction() to use patch_instruction_unlocked() instead.
>
> x86 uses text_mutex for all this, why not do the same?

I wasn't entirely sure if there is a problem with potentially going to
sleep in some of the places where patch_instruction() is called - the
spinlock avoids that (hypothetical) problem.

I just tried switching to text_mutex and at least on a P9 machine the
series boots w/ the Hash and Radix MMUs (with some lockdep errors). I
can rework this in the next version to use text_mutex if I don't find
any new problems with more extensive testing. It does mean more changes
to use patch_instruction_unlocked() in kprobe/optprobe/ftace in
arch/powerpc since iirc those are called with text_mutex already held.

Thanks!
Chris R.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock
@ 2021-05-07 20:03       ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-05-07 20:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, x86, linuxppc-dev, linux-hardening, keescook

On Thu May 6, 2021 at 5:51 AM CDT, Peter Zijlstra wrote:
> On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> > Powerpc allows for multiple CPUs to patch concurrently. When patching
> > with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> > CPUs for the few times that patching occurs. Use a spinlock to protect
> > the patching_mm from concurrent use.
> > 
> > Modify patch_instruction() to acquire the lock, perform the patch op,
> > and then release the lock.
> > 
> > Also introduce {lock,unlock}_patching() along with
> > patch_instruction_unlocked() to avoid per-iteration lock overhead when
> > patch_instruction() is called in a loop. A follow-up patch converts some
> > uses of patch_instruction() to use patch_instruction_unlocked() instead.
>
> x86 uses text_mutex for all this, why not do the same?

I wasn't entirely sure if there is a problem with potentially going to
sleep in some of the places where patch_instruction() is called - the
spinlock avoids that (hypothetical) problem.

I just tried switching to text_mutex and at least on a P9 machine the
series boots w/ the Hash and Radix MMUs (with some lockdep errors). I
can rework this in the next version to use text_mutex if I don't find
any new problems with more extensive testing. It does mean more changes
to use patch_instruction_unlocked() in kprobe/optprobe/ftace in
arch/powerpc since iirc those are called with text_mutex already held.

Thanks!
Chris R.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock
  2021-05-07 20:03       ` Christopher M. Riedl
  (?)
@ 2021-05-07 22:26       ` Peter Zijlstra
  -1 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2021-05-07 22:26 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: tglx, x86, linuxppc-dev, linux-hardening, keescook

On Fri, May 07, 2021 at 03:03:51PM -0500, Christopher M. Riedl wrote:
> On Thu May 6, 2021 at 5:51 AM CDT, Peter Zijlstra wrote:
> > On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> > > Powerpc allows for multiple CPUs to patch concurrently. When patching
> > > with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> > > CPUs for the few times that patching occurs. Use a spinlock to protect
> > > the patching_mm from concurrent use.
> > > 
> > > Modify patch_instruction() to acquire the lock, perform the patch op,
> > > and then release the lock.
> > > 
> > > Also introduce {lock,unlock}_patching() along with
> > > patch_instruction_unlocked() to avoid per-iteration lock overhead when
> > > patch_instruction() is called in a loop. A follow-up patch converts some
> > > uses of patch_instruction() to use patch_instruction_unlocked() instead.
> >
> > x86 uses text_mutex for all this, why not do the same?
> 
> I wasn't entirely sure if there is a problem with potentially going to
> sleep in some of the places where patch_instruction() is called - the
> spinlock avoids that (hypothetical) problem.

So I'm not saying you like have to do this; but I did wonder if there's
a reason not to, and given you didn't mention it, I had to ask.

> I just tried switching to text_mutex and at least on a P9 machine the
> series boots w/ the Hash and Radix MMUs (with some lockdep errors). I
> can rework this in the next version to use text_mutex if I don't find
> any new problems with more extensive testing. It does mean more changes
> to use patch_instruction_unlocked() in kprobe/optprobe/ftace in
> arch/powerpc since iirc those are called with text_mutex already held.

The x86 text_poke() has a lockdep_assert_held(&text_mutex) in to make
sure nobody 'forgets' :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-05-06  4:34 ` [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload Christopher M. Riedl
@ 2021-06-21  3:13   ` Daniel Axtens
  2021-07-01  3:48       ` Christopher M. Riedl
  0 siblings, 1 reply; 45+ messages in thread
From: Daniel Axtens @ 2021-06-21  3:13 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

"Christopher M. Riedl" <cmr@linux.ibm.com> writes:

> Switching to a different mm with Hash translation causes SLB entries to
> be preloaded from the current thread_info. This reduces SLB faults, for
> example when threads share a common mm but operate on different address
> ranges.
>
> Preloading entries from the thread_info struct may not always be
> appropriate - such as when switching to a temporary mm. Introduce a new
> boolean in mm_context_t to skip the SLB preload entirely. Also move the
> SLB preload code into a separate function since switch_slb() is already
> quite long. The default behavior (preloading SLB entries from the
> current thread_info struct) remains unchanged.
>
> Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>
> ---
>
> v4:  * New to series.
> ---
>  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>  4 files changed, 50 insertions(+), 24 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index eace8c3f7b0a1..b23a9dcdee5af 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -130,6 +130,9 @@ typedef struct {
>  	u32 pkey_allocation_map;
>  	s16 execute_only_pkey; /* key holding execute-only protection */
>  #endif
> +
> +	/* Do not preload SLB entries from thread_info during switch_slb() */
> +	bool skip_slb_preload;
>  } mm_context_t;
>  
>  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 4bc45d3ed8b0e..264787e90b1a1 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +
> +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> +{
> +	mm->context.skip_slb_preload = true;
> +}
> +
> +#else
> +
> +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> +
> +#endif /* CONFIG_PPC_BOOK3S_64 */
> +
>  #include <asm-generic/mmu_context.h>
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> index c10fc8a72fb37..3479910264c59 100644
> --- a/arch/powerpc/mm/book3s64/mmu_context.c
> +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>  	atomic_set(&mm->context.active_cpus, 0);
>  	atomic_set(&mm->context.copros, 0);
>  
> +	mm->context.skip_slb_preload = false;
> +
>  	return 0;
>  }
>  
> diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> index c91bd85eb90e3..da0836cb855af 100644
> --- a/arch/powerpc/mm/book3s64/slb.c
> +++ b/arch/powerpc/mm/book3s64/slb.c
> @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>  	asm volatile("slbie %0" : : "r" (slbie_data));
>  }
>  
> +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
Should this be explicitly inline or even __always_inline? I'm thinking
switch_slb is probably a fairly hot path on hash?

> +{
> +	struct thread_info *ti = task_thread_info(tsk);
> +	unsigned char i;
> +
> +	/*
> +	 * We gradually age out SLBs after a number of context switches to
> +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> +	 * reload). Each time we wrap 256 switches, take an entry out of the
> +	 * SLB preload cache.
> +	 */
> +	tsk->thread.load_slb++;
> +	if (!tsk->thread.load_slb) {
> +		unsigned long pc = KSTK_EIP(tsk);
> +
> +		preload_age(ti);
> +		preload_add(ti, pc);
> +	}
> +
> +	for (i = 0; i < ti->slb_preload_nr; i++) {
> +		unsigned char idx;
> +		unsigned long ea;
> +
> +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> +
> +		slb_allocate_user(mm, ea);
> +	}
> +}
> +
>  /* Flush all user entries from the segment table of the current processor. */
>  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>  {
> -	struct thread_info *ti = task_thread_info(tsk);
>  	unsigned char i;
>  
>  	/*
> @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	copy_mm_to_paca(mm);
>  
> -	/*
> -	 * We gradually age out SLBs after a number of context switches to
> -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> -	 * reload). Each time we wrap 256 switches, take an entry out of the
> -	 * SLB preload cache.
> -	 */
> -	tsk->thread.load_slb++;
> -	if (!tsk->thread.load_slb) {
> -		unsigned long pc = KSTK_EIP(tsk);
> -
> -		preload_age(ti);
> -		preload_add(ti, pc);
> -	}
> -
> -	for (i = 0; i < ti->slb_preload_nr; i++) {
> -		unsigned char idx;
> -		unsigned long ea;
> -
> -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> -
> -		slb_allocate_user(mm, ea);
> -	}
> +	if (!mm->context.skip_slb_preload)
> +		preload_slb_entries(tsk, mm);

Should this be wrapped in likely()?

>  
>  	/*
>  	 * Synchronize slbmte preloads with possible subsequent user memory

Right below this comment is the isync. It seems to be specifically
concerned with synchronising preloaded slbs. Do you need it if you are
skipping SLB preloads?

It's probably not a big deal to have an extra isync in the fairly rare
path when we're skipping preloads, but I thought I'd check.

Kind regards,
Daniel

> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-05-06  4:34 ` [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching Christopher M. Riedl
@ 2021-06-21  3:19   ` Daniel Axtens
  2021-07-01  5:11       ` Christopher M. Riedl
  2021-07-01  6:12     ` Nicholas Piggin
  1 sibling, 1 reply; 45+ messages in thread
From: Daniel Axtens @ 2021-06-21  3:19 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

Hi Chris,

> +	/*
> +	 * Choose a randomized, page-aligned address from the range:
> +	 * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE]
> +	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
> +	 * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay
> +	 * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> +	 */
> +	patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> +			% (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));

I checked and poking_init() comes after the functions that init the RNG,
so this should be fine. The maths - while a bit fiddly to reason about -
does check out.

> +
> +	/*
> +	 * PTE allocation uses GFP_KERNEL which means we need to pre-allocate
> +	 * the PTE here. We cannot do the allocation during patching with IRQs
> +	 * disabled (ie. "atomic" context).
> +	 */
> +	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> +	BUG_ON(!ptep);
> +	pte_unmap_unlock(ptep, ptl);
> +}
>  
>  #if IS_BUILTIN(CONFIG_LKDTM)
>  unsigned long read_cpu_patching_addr(unsigned int cpu)
>  {
> -	return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
> +	return patching_addr;
>  }
>  #endif
>  
> -static int text_area_cpu_up(unsigned int cpu)
> +struct patch_mapping {
> +	spinlock_t *ptl; /* for protecting pte table */
> +	pte_t *ptep;
> +	struct temp_mm temp_mm;
> +};
> +
> +#ifdef CONFIG_PPC_BOOK3S_64
> +
> +static inline int hash_prefault_mapping(pgprot_t pgprot)
>  {
> -	struct vm_struct *area;
> +	int err;
>  
> -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> -	if (!area) {
> -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> -			cpu);
> -		return -1;
> -	}
> -	this_cpu_write(text_poke_area, area);
> +	if (radix_enabled())
> +		return 0;
>  
> -	return 0;
> -}
> +	err = slb_allocate_user(patching_mm, patching_addr);
> +	if (err)
> +		pr_warn("map patch: failed to allocate slb entry\n");
>  

Here if slb_allocate_user() fails, you'll print a warning and then fall
through to the rest of the function. You do return err, but there's a
later call to hash_page_mm() that also sets err. Can slb_allocate_user()
fail while hash_page_mm() succeeds, and would that be a problem?

> -static int text_area_cpu_down(unsigned int cpu)
> -{
> -	free_vm_area(this_cpu_read(text_poke_area));
> -	return 0;
> +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> +			   HPTE_USE_KERNEL_KEY);
> +	if (err)
> +		pr_warn("map patch: failed to insert hashed page\n");
> +
> +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> +	isync();
> +

The comment reads:

	/*
	 * Synchronize slbmte preloads with possible subsequent user memory
	 * address accesses by the kernel (user mode won't happen until
	 * rfid, which is safe).
	 */
         isync();

I have to say having read the description of isync I'm not 100% sure why
that's enough (don't we also need stores to complete?) but I'm happy to
take commit 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
on trust here!

I think it does make sense for you to have that barrier here: you are
potentially about to start poking at the memory mapped through that SLB
entry so you should make sure you're fully synchronised.

> +	return err;
>  }
>  

> +	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> +	use_temporary_mm(&patch_mapping->temp_mm);
>  
> -	pmdp = pmd_offset(pudp, addr);
> -	if (unlikely(!pmdp))
> -		return -EINVAL;
> +	/*
> +	 * On Book3s64 with the Hash MMU we have to manually insert the SLB
> +	 * entry and HPTE to prevent taking faults on the patching_addr later.
> +	 */
> +	return(hash_prefault_mapping(pgprot));

hmm, `return hash_prefault_mapping(pgprot);` or
`return (hash_prefault_mapping((pgprot));` maybe?

Kind regards,
Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-06-21  3:13   ` Daniel Axtens
@ 2021-07-01  3:48       ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  3:48 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>
> > Switching to a different mm with Hash translation causes SLB entries to
> > be preloaded from the current thread_info. This reduces SLB faults, for
> > example when threads share a common mm but operate on different address
> > ranges.
> >
> > Preloading entries from the thread_info struct may not always be
> > appropriate - such as when switching to a temporary mm. Introduce a new
> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> > SLB preload code into a separate function since switch_slb() is already
> > quite long. The default behavior (preloading SLB entries from the
> > current thread_info struct) remains unchanged.
> >
> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >
> > ---
> >
> > v4:  * New to series.
> > ---
> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> > @@ -130,6 +130,9 @@ typedef struct {
> >  	u32 pkey_allocation_map;
> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >  #endif
> > +
> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> > +	bool skip_slb_preload;
> >  } mm_context_t;
> >  
> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >  	return 0;
> >  }
> >  
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> > +{
> > +	mm->context.skip_slb_preload = true;
> > +}
> > +
> > +#else
> > +
> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> > +
> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> > +
> >  #include <asm-generic/mmu_context.h>
> >  
> >  #endif /* __KERNEL__ */
> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> > index c10fc8a72fb37..3479910264c59 100644
> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >  	atomic_set(&mm->context.active_cpus, 0);
> >  	atomic_set(&mm->context.copros, 0);
> >  
> > +	mm->context.skip_slb_preload = false;
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> > index c91bd85eb90e3..da0836cb855af 100644
> > --- a/arch/powerpc/mm/book3s64/slb.c
> > +++ b/arch/powerpc/mm/book3s64/slb.c
> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >  }
> >  
> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> Should this be explicitly inline or even __always_inline? I'm thinking
> switch_slb is probably a fairly hot path on hash?

Yes absolutely. I'll make this change in v5.

>
> > +{
> > +	struct thread_info *ti = task_thread_info(tsk);
> > +	unsigned char i;
> > +
> > +	/*
> > +	 * We gradually age out SLBs after a number of context switches to
> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> > +	 * SLB preload cache.
> > +	 */
> > +	tsk->thread.load_slb++;
> > +	if (!tsk->thread.load_slb) {
> > +		unsigned long pc = KSTK_EIP(tsk);
> > +
> > +		preload_age(ti);
> > +		preload_add(ti, pc);
> > +	}
> > +
> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> > +		unsigned char idx;
> > +		unsigned long ea;
> > +
> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> > +
> > +		slb_allocate_user(mm, ea);
> > +	}
> > +}
> > +
> >  /* Flush all user entries from the segment table of the current processor. */
> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >  {
> > -	struct thread_info *ti = task_thread_info(tsk);
> >  	unsigned char i;
> >  
> >  	/*
> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >  
> >  	copy_mm_to_paca(mm);
> >  
> > -	/*
> > -	 * We gradually age out SLBs after a number of context switches to
> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> > -	 * SLB preload cache.
> > -	 */
> > -	tsk->thread.load_slb++;
> > -	if (!tsk->thread.load_slb) {
> > -		unsigned long pc = KSTK_EIP(tsk);
> > -
> > -		preload_age(ti);
> > -		preload_add(ti, pc);
> > -	}
> > -
> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> > -		unsigned char idx;
> > -		unsigned long ea;
> > -
> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> > -
> > -		slb_allocate_user(mm, ea);
> > -	}
> > +	if (!mm->context.skip_slb_preload)
> > +		preload_slb_entries(tsk, mm);
>
> Should this be wrapped in likely()?

Seems like a good idea - yes.

>
> >  
> >  	/*
> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>
> Right below this comment is the isync. It seems to be specifically
> concerned with synchronising preloaded slbs. Do you need it if you are
> skipping SLB preloads?
>
> It's probably not a big deal to have an extra isync in the fairly rare
> path when we're skipping preloads, but I thought I'd check.

I don't _think_ we need the `isync` if we are skipping the SLB preloads,
but then again it was always in the code-path before. If someone can
make a compelling argument to drop it when not preloading SLBs I will,
otherwise (considering some of the other non-obvious things I stepped
into with the Hash code) I will keep it here for now.

Thanks for the comments!

>
> Kind regards,
> Daniel
>
> > -- 
> > 2.26.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01  3:48       ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  3:48 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>
> > Switching to a different mm with Hash translation causes SLB entries to
> > be preloaded from the current thread_info. This reduces SLB faults, for
> > example when threads share a common mm but operate on different address
> > ranges.
> >
> > Preloading entries from the thread_info struct may not always be
> > appropriate - such as when switching to a temporary mm. Introduce a new
> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> > SLB preload code into a separate function since switch_slb() is already
> > quite long. The default behavior (preloading SLB entries from the
> > current thread_info struct) remains unchanged.
> >
> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >
> > ---
> >
> > v4:  * New to series.
> > ---
> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> > @@ -130,6 +130,9 @@ typedef struct {
> >  	u32 pkey_allocation_map;
> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >  #endif
> > +
> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> > +	bool skip_slb_preload;
> >  } mm_context_t;
> >  
> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >  	return 0;
> >  }
> >  
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> > +{
> > +	mm->context.skip_slb_preload = true;
> > +}
> > +
> > +#else
> > +
> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> > +
> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> > +
> >  #include <asm-generic/mmu_context.h>
> >  
> >  #endif /* __KERNEL__ */
> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> > index c10fc8a72fb37..3479910264c59 100644
> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >  	atomic_set(&mm->context.active_cpus, 0);
> >  	atomic_set(&mm->context.copros, 0);
> >  
> > +	mm->context.skip_slb_preload = false;
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> > index c91bd85eb90e3..da0836cb855af 100644
> > --- a/arch/powerpc/mm/book3s64/slb.c
> > +++ b/arch/powerpc/mm/book3s64/slb.c
> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >  }
> >  
> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> Should this be explicitly inline or even __always_inline? I'm thinking
> switch_slb is probably a fairly hot path on hash?

Yes absolutely. I'll make this change in v5.

>
> > +{
> > +	struct thread_info *ti = task_thread_info(tsk);
> > +	unsigned char i;
> > +
> > +	/*
> > +	 * We gradually age out SLBs after a number of context switches to
> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> > +	 * SLB preload cache.
> > +	 */
> > +	tsk->thread.load_slb++;
> > +	if (!tsk->thread.load_slb) {
> > +		unsigned long pc = KSTK_EIP(tsk);
> > +
> > +		preload_age(ti);
> > +		preload_add(ti, pc);
> > +	}
> > +
> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> > +		unsigned char idx;
> > +		unsigned long ea;
> > +
> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> > +
> > +		slb_allocate_user(mm, ea);
> > +	}
> > +}
> > +
> >  /* Flush all user entries from the segment table of the current processor. */
> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >  {
> > -	struct thread_info *ti = task_thread_info(tsk);
> >  	unsigned char i;
> >  
> >  	/*
> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >  
> >  	copy_mm_to_paca(mm);
> >  
> > -	/*
> > -	 * We gradually age out SLBs after a number of context switches to
> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> > -	 * SLB preload cache.
> > -	 */
> > -	tsk->thread.load_slb++;
> > -	if (!tsk->thread.load_slb) {
> > -		unsigned long pc = KSTK_EIP(tsk);
> > -
> > -		preload_age(ti);
> > -		preload_add(ti, pc);
> > -	}
> > -
> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> > -		unsigned char idx;
> > -		unsigned long ea;
> > -
> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> > -
> > -		slb_allocate_user(mm, ea);
> > -	}
> > +	if (!mm->context.skip_slb_preload)
> > +		preload_slb_entries(tsk, mm);
>
> Should this be wrapped in likely()?

Seems like a good idea - yes.

>
> >  
> >  	/*
> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>
> Right below this comment is the isync. It seems to be specifically
> concerned with synchronising preloaded slbs. Do you need it if you are
> skipping SLB preloads?
>
> It's probably not a big deal to have an extra isync in the fairly rare
> path when we're skipping preloads, but I thought I'd check.

I don't _think_ we need the `isync` if we are skipping the SLB preloads,
but then again it was always in the code-path before. If someone can
make a compelling argument to drop it when not preloading SLBs I will,
otherwise (considering some of the other non-obvious things I stepped
into with the Hash code) I will keep it here for now.

Thanks for the comments!

>
> Kind regards,
> Daniel
>
> > -- 
> > 2.26.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  3:48       ` Christopher M. Riedl
@ 2021-07-01  4:15         ` Nicholas Piggin
  -1 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  4:15 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: keescook, linux-hardening, tglx, x86

Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>>
>> > Switching to a different mm with Hash translation causes SLB entries to
>> > be preloaded from the current thread_info. This reduces SLB faults, for
>> > example when threads share a common mm but operate on different address
>> > ranges.
>> >
>> > Preloading entries from the thread_info struct may not always be
>> > appropriate - such as when switching to a temporary mm. Introduce a new
>> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> > SLB preload code into a separate function since switch_slb() is already
>> > quite long. The default behavior (preloading SLB entries from the
>> > current thread_info struct) remains unchanged.
>> >
>> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>> >
>> > ---
>> >
>> > v4:  * New to series.
>> > ---
>> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> > @@ -130,6 +130,9 @@ typedef struct {
>> >  	u32 pkey_allocation_map;
>> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>> >  #endif
>> > +
>> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>> > +	bool skip_slb_preload;
>> >  } mm_context_t;
>> >  
>> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> > --- a/arch/powerpc/include/asm/mmu_context.h
>> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>> >  	return 0;
>> >  }
>> >  
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> > +{
>> > +	mm->context.skip_slb_preload = true;
>> > +}
>> > +
>> > +#else
>> > +
>> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> > +
>> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> > +
>> >  #include <asm-generic/mmu_context.h>
>> >  
>> >  #endif /* __KERNEL__ */
>> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> > index c10fc8a72fb37..3479910264c59 100644
>> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> >  	atomic_set(&mm->context.active_cpus, 0);
>> >  	atomic_set(&mm->context.copros, 0);
>> >  
>> > +	mm->context.skip_slb_preload = false;
>> > +
>> >  	return 0;
>> >  }
>> >  
>> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> > index c91bd85eb90e3..da0836cb855af 100644
>> > --- a/arch/powerpc/mm/book3s64/slb.c
>> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>> >  }
>> >  
>> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>> Should this be explicitly inline or even __always_inline? I'm thinking
>> switch_slb is probably a fairly hot path on hash?
> 
> Yes absolutely. I'll make this change in v5.
> 
>>
>> > +{
>> > +	struct thread_info *ti = task_thread_info(tsk);
>> > +	unsigned char i;
>> > +
>> > +	/*
>> > +	 * We gradually age out SLBs after a number of context switches to
>> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>> > +	 * SLB preload cache.
>> > +	 */
>> > +	tsk->thread.load_slb++;
>> > +	if (!tsk->thread.load_slb) {
>> > +		unsigned long pc = KSTK_EIP(tsk);
>> > +
>> > +		preload_age(ti);
>> > +		preload_add(ti, pc);
>> > +	}
>> > +
>> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>> > +		unsigned char idx;
>> > +		unsigned long ea;
>> > +
>> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> > +
>> > +		slb_allocate_user(mm, ea);
>> > +	}
>> > +}
>> > +
>> >  /* Flush all user entries from the segment table of the current processor. */
>> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >  {
>> > -	struct thread_info *ti = task_thread_info(tsk);
>> >  	unsigned char i;
>> >  
>> >  	/*
>> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >  
>> >  	copy_mm_to_paca(mm);
>> >  
>> > -	/*
>> > -	 * We gradually age out SLBs after a number of context switches to
>> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>> > -	 * SLB preload cache.
>> > -	 */
>> > -	tsk->thread.load_slb++;
>> > -	if (!tsk->thread.load_slb) {
>> > -		unsigned long pc = KSTK_EIP(tsk);
>> > -
>> > -		preload_age(ti);
>> > -		preload_add(ti, pc);
>> > -	}
>> > -
>> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>> > -		unsigned char idx;
>> > -		unsigned long ea;
>> > -
>> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> > -
>> > -		slb_allocate_user(mm, ea);
>> > -	}
>> > +	if (!mm->context.skip_slb_preload)
>> > +		preload_slb_entries(tsk, mm);
>>
>> Should this be wrapped in likely()?
> 
> Seems like a good idea - yes.
> 
>>
>> >  
>> >  	/*
>> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>>
>> Right below this comment is the isync. It seems to be specifically
>> concerned with synchronising preloaded slbs. Do you need it if you are
>> skipping SLB preloads?
>>
>> It's probably not a big deal to have an extra isync in the fairly rare
>> path when we're skipping preloads, but I thought I'd check.
> 
> I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> but then again it was always in the code-path before. If someone can
> make a compelling argument to drop it when not preloading SLBs I will,
> otherwise (considering some of the other non-obvious things I stepped
> into with the Hash code) I will keep it here for now.

The ISA says slbia wants an isync afterward, so we probably should keep 
it. The comment is a bit misleading in that case.

Why isn't preloading appropriate for a temporary mm? 

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01  4:15         ` Nicholas Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  4:15 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>>
>> > Switching to a different mm with Hash translation causes SLB entries to
>> > be preloaded from the current thread_info. This reduces SLB faults, for
>> > example when threads share a common mm but operate on different address
>> > ranges.
>> >
>> > Preloading entries from the thread_info struct may not always be
>> > appropriate - such as when switching to a temporary mm. Introduce a new
>> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> > SLB preload code into a separate function since switch_slb() is already
>> > quite long. The default behavior (preloading SLB entries from the
>> > current thread_info struct) remains unchanged.
>> >
>> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>> >
>> > ---
>> >
>> > v4:  * New to series.
>> > ---
>> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> > @@ -130,6 +130,9 @@ typedef struct {
>> >  	u32 pkey_allocation_map;
>> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>> >  #endif
>> > +
>> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>> > +	bool skip_slb_preload;
>> >  } mm_context_t;
>> >  
>> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> > --- a/arch/powerpc/include/asm/mmu_context.h
>> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>> >  	return 0;
>> >  }
>> >  
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> > +{
>> > +	mm->context.skip_slb_preload = true;
>> > +}
>> > +
>> > +#else
>> > +
>> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> > +
>> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> > +
>> >  #include <asm-generic/mmu_context.h>
>> >  
>> >  #endif /* __KERNEL__ */
>> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> > index c10fc8a72fb37..3479910264c59 100644
>> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> >  	atomic_set(&mm->context.active_cpus, 0);
>> >  	atomic_set(&mm->context.copros, 0);
>> >  
>> > +	mm->context.skip_slb_preload = false;
>> > +
>> >  	return 0;
>> >  }
>> >  
>> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> > index c91bd85eb90e3..da0836cb855af 100644
>> > --- a/arch/powerpc/mm/book3s64/slb.c
>> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>> >  }
>> >  
>> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>> Should this be explicitly inline or even __always_inline? I'm thinking
>> switch_slb is probably a fairly hot path on hash?
> 
> Yes absolutely. I'll make this change in v5.
> 
>>
>> > +{
>> > +	struct thread_info *ti = task_thread_info(tsk);
>> > +	unsigned char i;
>> > +
>> > +	/*
>> > +	 * We gradually age out SLBs after a number of context switches to
>> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>> > +	 * SLB preload cache.
>> > +	 */
>> > +	tsk->thread.load_slb++;
>> > +	if (!tsk->thread.load_slb) {
>> > +		unsigned long pc = KSTK_EIP(tsk);
>> > +
>> > +		preload_age(ti);
>> > +		preload_add(ti, pc);
>> > +	}
>> > +
>> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>> > +		unsigned char idx;
>> > +		unsigned long ea;
>> > +
>> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> > +
>> > +		slb_allocate_user(mm, ea);
>> > +	}
>> > +}
>> > +
>> >  /* Flush all user entries from the segment table of the current processor. */
>> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >  {
>> > -	struct thread_info *ti = task_thread_info(tsk);
>> >  	unsigned char i;
>> >  
>> >  	/*
>> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >  
>> >  	copy_mm_to_paca(mm);
>> >  
>> > -	/*
>> > -	 * We gradually age out SLBs after a number of context switches to
>> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>> > -	 * SLB preload cache.
>> > -	 */
>> > -	tsk->thread.load_slb++;
>> > -	if (!tsk->thread.load_slb) {
>> > -		unsigned long pc = KSTK_EIP(tsk);
>> > -
>> > -		preload_age(ti);
>> > -		preload_add(ti, pc);
>> > -	}
>> > -
>> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>> > -		unsigned char idx;
>> > -		unsigned long ea;
>> > -
>> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> > -
>> > -		slb_allocate_user(mm, ea);
>> > -	}
>> > +	if (!mm->context.skip_slb_preload)
>> > +		preload_slb_entries(tsk, mm);
>>
>> Should this be wrapped in likely()?
> 
> Seems like a good idea - yes.
> 
>>
>> >  
>> >  	/*
>> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>>
>> Right below this comment is the isync. It seems to be specifically
>> concerned with synchronising preloaded slbs. Do you need it if you are
>> skipping SLB preloads?
>>
>> It's probably not a big deal to have an extra isync in the fairly rare
>> path when we're skipping preloads, but I thought I'd check.
> 
> I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> but then again it was always in the code-path before. If someone can
> make a compelling argument to drop it when not preloading SLBs I will,
> otherwise (considering some of the other non-obvious things I stepped
> into with the Hash code) I will keep it here for now.

The ISA says slbia wants an isync afterward, so we probably should keep 
it. The comment is a bit misleading in that case.

Why isn't preloading appropriate for a temporary mm? 

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-06-21  3:19   ` Daniel Axtens
@ 2021-07-01  5:11       ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  5:11 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

On Sun Jun 20, 2021 at 10:19 PM CDT, Daniel Axtens wrote:
> Hi Chris,
>
> > +	/*
> > +	 * Choose a randomized, page-aligned address from the range:
> > +	 * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE]
> > +	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
> > +	 * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay
> > +	 * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > +	 */
> > +	patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > +			% (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
>
> I checked and poking_init() comes after the functions that init the RNG,
> so this should be fine. The maths - while a bit fiddly to reason about -
> does check out.

Thanks for double checking.

>
> > +
> > +	/*
> > +	 * PTE allocation uses GFP_KERNEL which means we need to pre-allocate
> > +	 * the PTE here. We cannot do the allocation during patching with IRQs
> > +	 * disabled (ie. "atomic" context).
> > +	 */
> > +	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +	BUG_ON(!ptep);
> > +	pte_unmap_unlock(ptep, ptl);
> > +}
> >  
> >  #if IS_BUILTIN(CONFIG_LKDTM)
> >  unsigned long read_cpu_patching_addr(unsigned int cpu)
> >  {
> > -	return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
> > +	return patching_addr;
> >  }
> >  #endif
> >  
> > -static int text_area_cpu_up(unsigned int cpu)
> > +struct patch_mapping {
> > +	spinlock_t *ptl; /* for protecting pte table */
> > +	pte_t *ptep;
> > +	struct temp_mm temp_mm;
> > +};
> > +
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -	struct vm_struct *area;
> > +	int err;
> >  
> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -	if (!area) {
> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -			cpu);
> > -		return -1;
> > -	}
> > -	this_cpu_write(text_poke_area, area);
> > +	if (radix_enabled())
> > +		return 0;
> >  
> > -	return 0;
> > -}
> > +	err = slb_allocate_user(patching_mm, patching_addr);
> > +	if (err)
> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >  
>
> Here if slb_allocate_user() fails, you'll print a warning and then fall
> through to the rest of the function. You do return err, but there's a
> later call to hash_page_mm() that also sets err. Can slb_allocate_user()
> fail while hash_page_mm() succeeds, and would that be a problem?

Hmm, yes I think this is a problem. If slb_allocate_user() fails then we
could potentially mask that error until the actual patching
fails/miscompares later (and that *will* certainly fail in this case). I
will return the error and exit the function early in v5 of the series.
Thanks!

>
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -	free_vm_area(this_cpu_read(text_poke_area));
> > -	return 0;
> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +			   HPTE_USE_KERNEL_KEY);
> > +	if (err)
> > +		pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> > +	isync();
> > +
>
> The comment reads:
>
> /*
> * Synchronize slbmte preloads with possible subsequent user memory
> * address accesses by the kernel (user mode won't happen until
> * rfid, which is safe).
> */
> isync();
>
> I have to say having read the description of isync I'm not 100% sure why
> that's enough (don't we also need stores to complete?) but I'm happy to
> take commit 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
> on trust here!
>
> I think it does make sense for you to have that barrier here: you are
> potentially about to start poking at the memory mapped through that SLB
> entry so you should make sure you're fully synchronised.
>
> > +	return err;
> >  }
> >  
>
> > +	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > +	use_temporary_mm(&patch_mapping->temp_mm);
> >  
> > -	pmdp = pmd_offset(pudp, addr);
> > -	if (unlikely(!pmdp))
> > -		return -EINVAL;
> > +	/*
> > +	 * On Book3s64 with the Hash MMU we have to manually insert the SLB
> > +	 * entry and HPTE to prevent taking faults on the patching_addr later.
> > +	 */
> > +	return(hash_prefault_mapping(pgprot));
>
> hmm, `return hash_prefault_mapping(pgprot);` or
> `return (hash_prefault_mapping((pgprot));` maybe?

Yeah, I noticed I left the extra parentheses here after the RESEND. I
think this is left-over when I had another wrapper here... anyway, I'll
clean it up for v5.

>
> Kind regards,
> Daniel


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
@ 2021-07-01  5:11       ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  5:11 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: tglx, x86, linux-hardening, keescook

On Sun Jun 20, 2021 at 10:19 PM CDT, Daniel Axtens wrote:
> Hi Chris,
>
> > +	/*
> > +	 * Choose a randomized, page-aligned address from the range:
> > +	 * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE]
> > +	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
> > +	 * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay
> > +	 * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > +	 */
> > +	patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > +			% (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
>
> I checked and poking_init() comes after the functions that init the RNG,
> so this should be fine. The maths - while a bit fiddly to reason about -
> does check out.

Thanks for double checking.

>
> > +
> > +	/*
> > +	 * PTE allocation uses GFP_KERNEL which means we need to pre-allocate
> > +	 * the PTE here. We cannot do the allocation during patching with IRQs
> > +	 * disabled (ie. "atomic" context).
> > +	 */
> > +	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +	BUG_ON(!ptep);
> > +	pte_unmap_unlock(ptep, ptl);
> > +}
> >  
> >  #if IS_BUILTIN(CONFIG_LKDTM)
> >  unsigned long read_cpu_patching_addr(unsigned int cpu)
> >  {
> > -	return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
> > +	return patching_addr;
> >  }
> >  #endif
> >  
> > -static int text_area_cpu_up(unsigned int cpu)
> > +struct patch_mapping {
> > +	spinlock_t *ptl; /* for protecting pte table */
> > +	pte_t *ptep;
> > +	struct temp_mm temp_mm;
> > +};
> > +
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -	struct vm_struct *area;
> > +	int err;
> >  
> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -	if (!area) {
> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -			cpu);
> > -		return -1;
> > -	}
> > -	this_cpu_write(text_poke_area, area);
> > +	if (radix_enabled())
> > +		return 0;
> >  
> > -	return 0;
> > -}
> > +	err = slb_allocate_user(patching_mm, patching_addr);
> > +	if (err)
> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >  
>
> Here if slb_allocate_user() fails, you'll print a warning and then fall
> through to the rest of the function. You do return err, but there's a
> later call to hash_page_mm() that also sets err. Can slb_allocate_user()
> fail while hash_page_mm() succeeds, and would that be a problem?

Hmm, yes I think this is a problem. If slb_allocate_user() fails then we
could potentially mask that error until the actual patching
fails/miscompares later (and that *will* certainly fail in this case). I
will return the error and exit the function early in v5 of the series.
Thanks!

>
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -	free_vm_area(this_cpu_read(text_poke_area));
> > -	return 0;
> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +			   HPTE_USE_KERNEL_KEY);
> > +	if (err)
> > +		pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> > +	isync();
> > +
>
> The comment reads:
>
> /*
> * Synchronize slbmte preloads with possible subsequent user memory
> * address accesses by the kernel (user mode won't happen until
> * rfid, which is safe).
> */
> isync();
>
> I have to say having read the description of isync I'm not 100% sure why
> that's enough (don't we also need stores to complete?) but I'm happy to
> take commit 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
> on trust here!
>
> I think it does make sense for you to have that barrier here: you are
> potentially about to start poking at the memory mapped through that SLB
> entry so you should make sure you're fully synchronised.
>
> > +	return err;
> >  }
> >  
>
> > +	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > +	use_temporary_mm(&patch_mapping->temp_mm);
> >  
> > -	pmdp = pmd_offset(pudp, addr);
> > -	if (unlikely(!pmdp))
> > -		return -EINVAL;
> > +	/*
> > +	 * On Book3s64 with the Hash MMU we have to manually insert the SLB
> > +	 * entry and HPTE to prevent taking faults on the patching_addr later.
> > +	 */
> > +	return(hash_prefault_mapping(pgprot));
>
> hmm, `return hash_prefault_mapping(pgprot);` or
> `return (hash_prefault_mapping((pgprot));` maybe?

Yeah, I noticed I left the extra parentheses here after the RESEND. I
think this is left-over when I had another wrapper here... anyway, I'll
clean it up for v5.

>
> Kind regards,
> Daniel


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  4:15         ` Nicholas Piggin
@ 2021-07-01  5:28           ` Christopher M. Riedl
  -1 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  5:28 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
> >>
> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> > example when threads share a common mm but operate on different address
> >> > ranges.
> >> >
> >> > Preloading entries from the thread_info struct may not always be
> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> > SLB preload code into a separate function since switch_slb() is already
> >> > quite long. The default behavior (preloading SLB entries from the
> >> > current thread_info struct) remains unchanged.
> >> >
> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >> >
> >> > ---
> >> >
> >> > v4:  * New to series.
> >> > ---
> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >  	u32 pkey_allocation_map;
> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >> >  #endif
> >> > +
> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> >> > +	bool skip_slb_preload;
> >> >  } mm_context_t;
> >> >  
> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >> >  	return 0;
> >> >  }
> >> >  
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> > +{
> >> > +	mm->context.skip_slb_preload = true;
> >> > +}
> >> > +
> >> > +#else
> >> > +
> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> > +
> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> > +
> >> >  #include <asm-generic/mmu_context.h>
> >> >  
> >> >  #endif /* __KERNEL__ */
> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> >> > index c10fc8a72fb37..3479910264c59 100644
> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> >  	atomic_set(&mm->context.active_cpus, 0);
> >> >  	atomic_set(&mm->context.copros, 0);
> >> >  
> >> > +	mm->context.skip_slb_preload = false;
> >> > +
> >> >  	return 0;
> >> >  }
> >> >  
> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >> >  }
> >> >  
> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> switch_slb is probably a fairly hot path on hash?
> > 
> > Yes absolutely. I'll make this change in v5.
> > 
> >>
> >> > +{
> >> > +	struct thread_info *ti = task_thread_info(tsk);
> >> > +	unsigned char i;
> >> > +
> >> > +	/*
> >> > +	 * We gradually age out SLBs after a number of context switches to
> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> > +	 * SLB preload cache.
> >> > +	 */
> >> > +	tsk->thread.load_slb++;
> >> > +	if (!tsk->thread.load_slb) {
> >> > +		unsigned long pc = KSTK_EIP(tsk);
> >> > +
> >> > +		preload_age(ti);
> >> > +		preload_add(ti, pc);
> >> > +	}
> >> > +
> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> > +		unsigned char idx;
> >> > +		unsigned long ea;
> >> > +
> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> > +
> >> > +		slb_allocate_user(mm, ea);
> >> > +	}
> >> > +}
> >> > +
> >> >  /* Flush all user entries from the segment table of the current processor. */
> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >  {
> >> > -	struct thread_info *ti = task_thread_info(tsk);
> >> >  	unsigned char i;
> >> >  
> >> >  	/*
> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >  
> >> >  	copy_mm_to_paca(mm);
> >> >  
> >> > -	/*
> >> > -	 * We gradually age out SLBs after a number of context switches to
> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> > -	 * SLB preload cache.
> >> > -	 */
> >> > -	tsk->thread.load_slb++;
> >> > -	if (!tsk->thread.load_slb) {
> >> > -		unsigned long pc = KSTK_EIP(tsk);
> >> > -
> >> > -		preload_age(ti);
> >> > -		preload_add(ti, pc);
> >> > -	}
> >> > -
> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> > -		unsigned char idx;
> >> > -		unsigned long ea;
> >> > -
> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> > -
> >> > -		slb_allocate_user(mm, ea);
> >> > -	}
> >> > +	if (!mm->context.skip_slb_preload)
> >> > +		preload_slb_entries(tsk, mm);
> >>
> >> Should this be wrapped in likely()?
> > 
> > Seems like a good idea - yes.
> > 
> >>
> >> >  
> >> >  	/*
> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
> >>
> >> Right below this comment is the isync. It seems to be specifically
> >> concerned with synchronising preloaded slbs. Do you need it if you are
> >> skipping SLB preloads?
> >>
> >> It's probably not a big deal to have an extra isync in the fairly rare
> >> path when we're skipping preloads, but I thought I'd check.
> > 
> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> > but then again it was always in the code-path before. If someone can
> > make a compelling argument to drop it when not preloading SLBs I will,
> > otherwise (considering some of the other non-obvious things I stepped
> > into with the Hash code) I will keep it here for now.
>
> The ISA says slbia wants an isync afterward, so we probably should keep
> it. The comment is a bit misleading in that case.
>
> Why isn't preloading appropriate for a temporary mm?

The preloaded entries come from the thread_info struct which isn't
necessarily related to the temporary mm at all. I saw SLB multihits
while testing this series with my LKDTM test where the "patching
address" (userspace address for the temporary mapping w/
write-permissions) ends up in a thread's preload list and then we
explicitly insert it again in map_patch() when trying to patch. At that
point the SLB multihit triggers.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01  5:28           ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  5:28 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
> >>
> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> > example when threads share a common mm but operate on different address
> >> > ranges.
> >> >
> >> > Preloading entries from the thread_info struct may not always be
> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> > SLB preload code into a separate function since switch_slb() is already
> >> > quite long. The default behavior (preloading SLB entries from the
> >> > current thread_info struct) remains unchanged.
> >> >
> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >> >
> >> > ---
> >> >
> >> > v4:  * New to series.
> >> > ---
> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >  	u32 pkey_allocation_map;
> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >> >  #endif
> >> > +
> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> >> > +	bool skip_slb_preload;
> >> >  } mm_context_t;
> >> >  
> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >> >  	return 0;
> >> >  }
> >> >  
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> > +{
> >> > +	mm->context.skip_slb_preload = true;
> >> > +}
> >> > +
> >> > +#else
> >> > +
> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> > +
> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> > +
> >> >  #include <asm-generic/mmu_context.h>
> >> >  
> >> >  #endif /* __KERNEL__ */
> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> >> > index c10fc8a72fb37..3479910264c59 100644
> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> >  	atomic_set(&mm->context.active_cpus, 0);
> >> >  	atomic_set(&mm->context.copros, 0);
> >> >  
> >> > +	mm->context.skip_slb_preload = false;
> >> > +
> >> >  	return 0;
> >> >  }
> >> >  
> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >> >  }
> >> >  
> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> switch_slb is probably a fairly hot path on hash?
> > 
> > Yes absolutely. I'll make this change in v5.
> > 
> >>
> >> > +{
> >> > +	struct thread_info *ti = task_thread_info(tsk);
> >> > +	unsigned char i;
> >> > +
> >> > +	/*
> >> > +	 * We gradually age out SLBs after a number of context switches to
> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> > +	 * SLB preload cache.
> >> > +	 */
> >> > +	tsk->thread.load_slb++;
> >> > +	if (!tsk->thread.load_slb) {
> >> > +		unsigned long pc = KSTK_EIP(tsk);
> >> > +
> >> > +		preload_age(ti);
> >> > +		preload_add(ti, pc);
> >> > +	}
> >> > +
> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> > +		unsigned char idx;
> >> > +		unsigned long ea;
> >> > +
> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> > +
> >> > +		slb_allocate_user(mm, ea);
> >> > +	}
> >> > +}
> >> > +
> >> >  /* Flush all user entries from the segment table of the current processor. */
> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >  {
> >> > -	struct thread_info *ti = task_thread_info(tsk);
> >> >  	unsigned char i;
> >> >  
> >> >  	/*
> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >  
> >> >  	copy_mm_to_paca(mm);
> >> >  
> >> > -	/*
> >> > -	 * We gradually age out SLBs after a number of context switches to
> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> > -	 * SLB preload cache.
> >> > -	 */
> >> > -	tsk->thread.load_slb++;
> >> > -	if (!tsk->thread.load_slb) {
> >> > -		unsigned long pc = KSTK_EIP(tsk);
> >> > -
> >> > -		preload_age(ti);
> >> > -		preload_add(ti, pc);
> >> > -	}
> >> > -
> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> > -		unsigned char idx;
> >> > -		unsigned long ea;
> >> > -
> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> > -
> >> > -		slb_allocate_user(mm, ea);
> >> > -	}
> >> > +	if (!mm->context.skip_slb_preload)
> >> > +		preload_slb_entries(tsk, mm);
> >>
> >> Should this be wrapped in likely()?
> > 
> > Seems like a good idea - yes.
> > 
> >>
> >> >  
> >> >  	/*
> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
> >>
> >> Right below this comment is the isync. It seems to be specifically
> >> concerned with synchronising preloaded slbs. Do you need it if you are
> >> skipping SLB preloads?
> >>
> >> It's probably not a big deal to have an extra isync in the fairly rare
> >> path when we're skipping preloads, but I thought I'd check.
> > 
> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> > but then again it was always in the code-path before. If someone can
> > make a compelling argument to drop it when not preloading SLBs I will,
> > otherwise (considering some of the other non-obvious things I stepped
> > into with the Hash code) I will keep it here for now.
>
> The ISA says slbia wants an isync afterward, so we probably should keep
> it. The comment is a bit misleading in that case.
>
> Why isn't preloading appropriate for a temporary mm?

The preloaded entries come from the thread_info struct which isn't
necessarily related to the temporary mm at all. I saw SLB multihits
while testing this series with my LKDTM test where the "patching
address" (userspace address for the temporary mapping w/
write-permissions) ends up in a thread's preload list and then we
explicitly insert it again in map_patch() when trying to patch. At that
point the SLB multihit triggers.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  5:28           ` Christopher M. Riedl
@ 2021-07-01  6:04             ` Nicholas Piggin
  -1 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  6:04 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: keescook, linux-hardening, tglx, x86

Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>> >>
>> >> > Switching to a different mm with Hash translation causes SLB entries to
>> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>> >> > example when threads share a common mm but operate on different address
>> >> > ranges.
>> >> >
>> >> > Preloading entries from the thread_info struct may not always be
>> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> >> > SLB preload code into a separate function since switch_slb() is already
>> >> > quite long. The default behavior (preloading SLB entries from the
>> >> > current thread_info struct) remains unchanged.
>> >> >
>> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>> >> >
>> >> > ---
>> >> >
>> >> > v4:  * New to series.
>> >> > ---
>> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >
>> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> >  	u32 pkey_allocation_map;
>> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >  #endif
>> >> > +
>> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>> >> > +	bool skip_slb_preload;
>> >> >  } mm_context_t;
>> >> >  
>> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>> >> >  	return 0;
>> >> >  }
>> >> >  
>> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> > +{
>> >> > +	mm->context.skip_slb_preload = true;
>> >> > +}
>> >> > +
>> >> > +#else
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> > +
>> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> > +
>> >> >  #include <asm-generic/mmu_context.h>
>> >> >  
>> >> >  #endif /* __KERNEL__ */
>> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> >> >  	atomic_set(&mm->context.active_cpus, 0);
>> >> >  	atomic_set(&mm->context.copros, 0);
>> >> >  
>> >> > +	mm->context.skip_slb_preload = false;
>> >> > +
>> >> >  	return 0;
>> >> >  }
>> >> >  
>> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >  }
>> >> >  
>> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> switch_slb is probably a fairly hot path on hash?
>> > 
>> > Yes absolutely. I'll make this change in v5.
>> > 
>> >>
>> >> > +{
>> >> > +	struct thread_info *ti = task_thread_info(tsk);
>> >> > +	unsigned char i;
>> >> > +
>> >> > +	/*
>> >> > +	 * We gradually age out SLBs after a number of context switches to
>> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> > +	 * SLB preload cache.
>> >> > +	 */
>> >> > +	tsk->thread.load_slb++;
>> >> > +	if (!tsk->thread.load_slb) {
>> >> > +		unsigned long pc = KSTK_EIP(tsk);
>> >> > +
>> >> > +		preload_age(ti);
>> >> > +		preload_add(ti, pc);
>> >> > +	}
>> >> > +
>> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> > +		unsigned char idx;
>> >> > +		unsigned long ea;
>> >> > +
>> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> > +
>> >> > +		slb_allocate_user(mm, ea);
>> >> > +	}
>> >> > +}
>> >> > +
>> >> >  /* Flush all user entries from the segment table of the current processor. */
>> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >  {
>> >> > -	struct thread_info *ti = task_thread_info(tsk);
>> >> >  	unsigned char i;
>> >> >  
>> >> >  	/*
>> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >  
>> >> >  	copy_mm_to_paca(mm);
>> >> >  
>> >> > -	/*
>> >> > -	 * We gradually age out SLBs after a number of context switches to
>> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> > -	 * SLB preload cache.
>> >> > -	 */
>> >> > -	tsk->thread.load_slb++;
>> >> > -	if (!tsk->thread.load_slb) {
>> >> > -		unsigned long pc = KSTK_EIP(tsk);
>> >> > -
>> >> > -		preload_age(ti);
>> >> > -		preload_add(ti, pc);
>> >> > -	}
>> >> > -
>> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> > -		unsigned char idx;
>> >> > -		unsigned long ea;
>> >> > -
>> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> > -
>> >> > -		slb_allocate_user(mm, ea);
>> >> > -	}
>> >> > +	if (!mm->context.skip_slb_preload)
>> >> > +		preload_slb_entries(tsk, mm);
>> >>
>> >> Should this be wrapped in likely()?
>> > 
>> > Seems like a good idea - yes.
>> > 
>> >>
>> >> >  
>> >> >  	/*
>> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>> >>
>> >> Right below this comment is the isync. It seems to be specifically
>> >> concerned with synchronising preloaded slbs. Do you need it if you are
>> >> skipping SLB preloads?
>> >>
>> >> It's probably not a big deal to have an extra isync in the fairly rare
>> >> path when we're skipping preloads, but I thought I'd check.
>> > 
>> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
>> > but then again it was always in the code-path before. If someone can
>> > make a compelling argument to drop it when not preloading SLBs I will,
>> > otherwise (considering some of the other non-obvious things I stepped
>> > into with the Hash code) I will keep it here for now.
>>
>> The ISA says slbia wants an isync afterward, so we probably should keep
>> it. The comment is a bit misleading in that case.
>>
>> Why isn't preloading appropriate for a temporary mm?
> 
> The preloaded entries come from the thread_info struct which isn't
> necessarily related to the temporary mm at all. I saw SLB multihits
> while testing this series with my LKDTM test where the "patching
> address" (userspace address for the temporary mapping w/
> write-permissions) ends up in a thread's preload list and then we
> explicitly insert it again in map_patch() when trying to patch. At that
> point the SLB multihit triggers.

Hmm, so what if we use a mm, take some SLB faults then unuse it and
use a different one? I wonder if kthread_use_mm has existing problems
with this incorrect SLB preloading. Quite possibly. We should clear
the preload whenever mm changes I think. That should cover this as
well.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01  6:04             ` Nicholas Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  6:04 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>> >>
>> >> > Switching to a different mm with Hash translation causes SLB entries to
>> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>> >> > example when threads share a common mm but operate on different address
>> >> > ranges.
>> >> >
>> >> > Preloading entries from the thread_info struct may not always be
>> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> >> > SLB preload code into a separate function since switch_slb() is already
>> >> > quite long. The default behavior (preloading SLB entries from the
>> >> > current thread_info struct) remains unchanged.
>> >> >
>> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>> >> >
>> >> > ---
>> >> >
>> >> > v4:  * New to series.
>> >> > ---
>> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >
>> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> >  	u32 pkey_allocation_map;
>> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >  #endif
>> >> > +
>> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>> >> > +	bool skip_slb_preload;
>> >> >  } mm_context_t;
>> >> >  
>> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>> >> >  	return 0;
>> >> >  }
>> >> >  
>> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> > +{
>> >> > +	mm->context.skip_slb_preload = true;
>> >> > +}
>> >> > +
>> >> > +#else
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> > +
>> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> > +
>> >> >  #include <asm-generic/mmu_context.h>
>> >> >  
>> >> >  #endif /* __KERNEL__ */
>> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> >> >  	atomic_set(&mm->context.active_cpus, 0);
>> >> >  	atomic_set(&mm->context.copros, 0);
>> >> >  
>> >> > +	mm->context.skip_slb_preload = false;
>> >> > +
>> >> >  	return 0;
>> >> >  }
>> >> >  
>> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >  }
>> >> >  
>> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> switch_slb is probably a fairly hot path on hash?
>> > 
>> > Yes absolutely. I'll make this change in v5.
>> > 
>> >>
>> >> > +{
>> >> > +	struct thread_info *ti = task_thread_info(tsk);
>> >> > +	unsigned char i;
>> >> > +
>> >> > +	/*
>> >> > +	 * We gradually age out SLBs after a number of context switches to
>> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> > +	 * SLB preload cache.
>> >> > +	 */
>> >> > +	tsk->thread.load_slb++;
>> >> > +	if (!tsk->thread.load_slb) {
>> >> > +		unsigned long pc = KSTK_EIP(tsk);
>> >> > +
>> >> > +		preload_age(ti);
>> >> > +		preload_add(ti, pc);
>> >> > +	}
>> >> > +
>> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> > +		unsigned char idx;
>> >> > +		unsigned long ea;
>> >> > +
>> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> > +
>> >> > +		slb_allocate_user(mm, ea);
>> >> > +	}
>> >> > +}
>> >> > +
>> >> >  /* Flush all user entries from the segment table of the current processor. */
>> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >  {
>> >> > -	struct thread_info *ti = task_thread_info(tsk);
>> >> >  	unsigned char i;
>> >> >  
>> >> >  	/*
>> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >  
>> >> >  	copy_mm_to_paca(mm);
>> >> >  
>> >> > -	/*
>> >> > -	 * We gradually age out SLBs after a number of context switches to
>> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> > -	 * SLB preload cache.
>> >> > -	 */
>> >> > -	tsk->thread.load_slb++;
>> >> > -	if (!tsk->thread.load_slb) {
>> >> > -		unsigned long pc = KSTK_EIP(tsk);
>> >> > -
>> >> > -		preload_age(ti);
>> >> > -		preload_add(ti, pc);
>> >> > -	}
>> >> > -
>> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> > -		unsigned char idx;
>> >> > -		unsigned long ea;
>> >> > -
>> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> > -
>> >> > -		slb_allocate_user(mm, ea);
>> >> > -	}
>> >> > +	if (!mm->context.skip_slb_preload)
>> >> > +		preload_slb_entries(tsk, mm);
>> >>
>> >> Should this be wrapped in likely()?
>> > 
>> > Seems like a good idea - yes.
>> > 
>> >>
>> >> >  
>> >> >  	/*
>> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>> >>
>> >> Right below this comment is the isync. It seems to be specifically
>> >> concerned with synchronising preloaded slbs. Do you need it if you are
>> >> skipping SLB preloads?
>> >>
>> >> It's probably not a big deal to have an extra isync in the fairly rare
>> >> path when we're skipping preloads, but I thought I'd check.
>> > 
>> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
>> > but then again it was always in the code-path before. If someone can
>> > make a compelling argument to drop it when not preloading SLBs I will,
>> > otherwise (considering some of the other non-obvious things I stepped
>> > into with the Hash code) I will keep it here for now.
>>
>> The ISA says slbia wants an isync afterward, so we probably should keep
>> it. The comment is a bit misleading in that case.
>>
>> Why isn't preloading appropriate for a temporary mm?
> 
> The preloaded entries come from the thread_info struct which isn't
> necessarily related to the temporary mm at all. I saw SLB multihits
> while testing this series with my LKDTM test where the "patching
> address" (userspace address for the temporary mapping w/
> write-permissions) ends up in a thread's preload list and then we
> explicitly insert it again in map_patch() when trying to patch. At that
> point the SLB multihit triggers.

Hmm, so what if we use a mm, take some SLB faults then unuse it and
use a different one? I wonder if kthread_use_mm has existing problems
with this incorrect SLB preloading. Quite possibly. We should clear
the preload whenever mm changes I think. That should cover this as
well.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-05-06  4:34 ` [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching Christopher M. Riedl
@ 2021-07-01  6:12     ` Nicholas Piggin
  2021-07-01  6:12     ` Nicholas Piggin
  1 sibling, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  6:12 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: keescook, linux-hardening, tglx, x86

Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
> 
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> the Book3s64 Hash MMU operates - by default the space above
> DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> all platforms/MMUs is randomized inside this range.  The number of
> possible random addresses is dependent on PAGE_SIZE and limited by
> DEFAULT_MAP_WINDOW.
> 
> Bits of entropy with 64K page size on BOOK3S_64:
> 
>         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> 
>         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> 
> Randomization occurs only once during initialization at boot.
> 
> Introduce two new functions, map_patch() and unmap_patch(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> the page for patching with PAGE_SHARED since the kernel cannot access
> userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> 
> Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> for the patching_addr when using the Hash MMU on Book3s64 to avoid
> taking an SLB and Hash fault during patching.

What prevents the SLBE or HPTE from being removed before the last
access?


> +#ifdef CONFIG_PPC_BOOK3S_64
> +
> +static inline int hash_prefault_mapping(pgprot_t pgprot)
>  {
> -	struct vm_struct *area;
> +	int err;
>  
> -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> -	if (!area) {
> -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> -			cpu);
> -		return -1;
> -	}
> -	this_cpu_write(text_poke_area, area);
> +	if (radix_enabled())
> +		return 0;
>  
> -	return 0;
> -}
> +	err = slb_allocate_user(patching_mm, patching_addr);
> +	if (err)
> +		pr_warn("map patch: failed to allocate slb entry\n");
>  
> -static int text_area_cpu_down(unsigned int cpu)
> -{
> -	free_vm_area(this_cpu_read(text_poke_area));
> -	return 0;
> +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> +			   HPTE_USE_KERNEL_KEY);
> +	if (err)
> +		pr_warn("map patch: failed to insert hashed page\n");
> +
> +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> +	isync();

I'm not sure if this is enough. Could we context switch here? You've
got the PTL so no with a normal kernel but maybe yes with an RT kernel
How about taking an machine check that clears the SLB? Could the HPTE
get removed by something else here?

You want to prevent faults because you might be patching a fault 
handler?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
@ 2021-07-01  6:12     ` Nicholas Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  6:12 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: tglx, x86, keescook, linux-hardening

Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
> 
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> the Book3s64 Hash MMU operates - by default the space above
> DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> all platforms/MMUs is randomized inside this range.  The number of
> possible random addresses is dependent on PAGE_SIZE and limited by
> DEFAULT_MAP_WINDOW.
> 
> Bits of entropy with 64K page size on BOOK3S_64:
> 
>         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> 
>         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> 
> Randomization occurs only once during initialization at boot.
> 
> Introduce two new functions, map_patch() and unmap_patch(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> the page for patching with PAGE_SHARED since the kernel cannot access
> userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> 
> Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> for the patching_addr when using the Hash MMU on Book3s64 to avoid
> taking an SLB and Hash fault during patching.

What prevents the SLBE or HPTE from being removed before the last
access?


> +#ifdef CONFIG_PPC_BOOK3S_64
> +
> +static inline int hash_prefault_mapping(pgprot_t pgprot)
>  {
> -	struct vm_struct *area;
> +	int err;
>  
> -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> -	if (!area) {
> -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> -			cpu);
> -		return -1;
> -	}
> -	this_cpu_write(text_poke_area, area);
> +	if (radix_enabled())
> +		return 0;
>  
> -	return 0;
> -}
> +	err = slb_allocate_user(patching_mm, patching_addr);
> +	if (err)
> +		pr_warn("map patch: failed to allocate slb entry\n");
>  
> -static int text_area_cpu_down(unsigned int cpu)
> -{
> -	free_vm_area(this_cpu_read(text_poke_area));
> -	return 0;
> +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> +			   HPTE_USE_KERNEL_KEY);
> +	if (err)
> +		pr_warn("map patch: failed to insert hashed page\n");
> +
> +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> +	isync();

I'm not sure if this is enough. Could we context switch here? You've
got the PTL so no with a normal kernel but maybe yes with an RT kernel
How about taking an machine check that clears the SLB? Could the HPTE
get removed by something else here?

You want to prevent faults because you might be patching a fault 
handler?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  6:04             ` Nicholas Piggin
@ 2021-07-01  6:53               ` Christopher M. Riedl
  -1 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  6:53 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
> >> >>
> >> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> >> > example when threads share a common mm but operate on different address
> >> >> > ranges.
> >> >> >
> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> >> > SLB preload code into a separate function since switch_slb() is already
> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> > current thread_info struct) remains unchanged.
> >> >> >
> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >> >> >
> >> >> > ---
> >> >> >
> >> >> > v4:  * New to series.
> >> >> > ---
> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >  	u32 pkey_allocation_map;
> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >  #endif
> >> >> > +
> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> >> >> > +	bool skip_slb_preload;
> >> >> >  } mm_context_t;
> >> >> >  
> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >> >> >  	return 0;
> >> >> >  }
> >> >> >  
> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> > +{
> >> >> > +	mm->context.skip_slb_preload = true;
> >> >> > +}
> >> >> > +
> >> >> > +#else
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> > +
> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> > +
> >> >> >  #include <asm-generic/mmu_context.h>
> >> >> >  
> >> >> >  #endif /* __KERNEL__ */
> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
> >> >> >  	atomic_set(&mm->context.copros, 0);
> >> >> >  
> >> >> > +	mm->context.skip_slb_preload = false;
> >> >> > +
> >> >> >  	return 0;
> >> >> >  }
> >> >> >  
> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> >> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >> >> >  }
> >> >> >  
> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> >> switch_slb is probably a fairly hot path on hash?
> >> > 
> >> > Yes absolutely. I'll make this change in v5.
> >> > 
> >> >>
> >> >> > +{
> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
> >> >> > +	unsigned char i;
> >> >> > +
> >> >> > +	/*
> >> >> > +	 * We gradually age out SLBs after a number of context switches to
> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> > +	 * SLB preload cache.
> >> >> > +	 */
> >> >> > +	tsk->thread.load_slb++;
> >> >> > +	if (!tsk->thread.load_slb) {
> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
> >> >> > +
> >> >> > +		preload_age(ti);
> >> >> > +		preload_add(ti, pc);
> >> >> > +	}
> >> >> > +
> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> > +		unsigned char idx;
> >> >> > +		unsigned long ea;
> >> >> > +
> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> > +
> >> >> > +		slb_allocate_user(mm, ea);
> >> >> > +	}
> >> >> > +}
> >> >> > +
> >> >> >  /* Flush all user entries from the segment table of the current processor. */
> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >  {
> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
> >> >> >  	unsigned char i;
> >> >> >  
> >> >> >  	/*
> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >  
> >> >> >  	copy_mm_to_paca(mm);
> >> >> >  
> >> >> > -	/*
> >> >> > -	 * We gradually age out SLBs after a number of context switches to
> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> > -	 * SLB preload cache.
> >> >> > -	 */
> >> >> > -	tsk->thread.load_slb++;
> >> >> > -	if (!tsk->thread.load_slb) {
> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
> >> >> > -
> >> >> > -		preload_age(ti);
> >> >> > -		preload_add(ti, pc);
> >> >> > -	}
> >> >> > -
> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> > -		unsigned char idx;
> >> >> > -		unsigned long ea;
> >> >> > -
> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> > -
> >> >> > -		slb_allocate_user(mm, ea);
> >> >> > -	}
> >> >> > +	if (!mm->context.skip_slb_preload)
> >> >> > +		preload_slb_entries(tsk, mm);
> >> >>
> >> >> Should this be wrapped in likely()?
> >> > 
> >> > Seems like a good idea - yes.
> >> > 
> >> >>
> >> >> >  
> >> >> >  	/*
> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
> >> >>
> >> >> Right below this comment is the isync. It seems to be specifically
> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
> >> >> skipping SLB preloads?
> >> >>
> >> >> It's probably not a big deal to have an extra isync in the fairly rare
> >> >> path when we're skipping preloads, but I thought I'd check.
> >> > 
> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> >> > but then again it was always in the code-path before. If someone can
> >> > make a compelling argument to drop it when not preloading SLBs I will,
> >> > otherwise (considering some of the other non-obvious things I stepped
> >> > into with the Hash code) I will keep it here for now.
> >>
> >> The ISA says slbia wants an isync afterward, so we probably should keep
> >> it. The comment is a bit misleading in that case.
> >>
> >> Why isn't preloading appropriate for a temporary mm?
> > 
> > The preloaded entries come from the thread_info struct which isn't
> > necessarily related to the temporary mm at all. I saw SLB multihits
> > while testing this series with my LKDTM test where the "patching
> > address" (userspace address for the temporary mapping w/
> > write-permissions) ends up in a thread's preload list and then we
> > explicitly insert it again in map_patch() when trying to patch. At that
> > point the SLB multihit triggers.
>
> Hmm, so what if we use a mm, take some SLB faults then unuse it and
> use a different one? I wonder if kthread_use_mm has existing problems
> with this incorrect SLB preloading. Quite possibly. We should clear
> the preload whenever mm changes I think. That should cover this as
> well.

I actually did this initially but thought it was a bit too intrusive to
include as part of this series and hurt performance. I agree that
preloading the SLB from the thread may be a problem in general when
switching in/out an mm.

kthread_use_mm may not be affected unless we explicitly insert some SLB
entries which could collide with an existing preload (which I don't
think we do anywhere until this series).

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01  6:53               ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  6:53 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
> >> >>
> >> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> >> > example when threads share a common mm but operate on different address
> >> >> > ranges.
> >> >> >
> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> >> > SLB preload code into a separate function since switch_slb() is already
> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> > current thread_info struct) remains unchanged.
> >> >> >
> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >> >> >
> >> >> > ---
> >> >> >
> >> >> > v4:  * New to series.
> >> >> > ---
> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >  	u32 pkey_allocation_map;
> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >  #endif
> >> >> > +
> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> >> >> > +	bool skip_slb_preload;
> >> >> >  } mm_context_t;
> >> >> >  
> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >> >> >  	return 0;
> >> >> >  }
> >> >> >  
> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> > +{
> >> >> > +	mm->context.skip_slb_preload = true;
> >> >> > +}
> >> >> > +
> >> >> > +#else
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> > +
> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> > +
> >> >> >  #include <asm-generic/mmu_context.h>
> >> >> >  
> >> >> >  #endif /* __KERNEL__ */
> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
> >> >> >  	atomic_set(&mm->context.copros, 0);
> >> >> >  
> >> >> > +	mm->context.skip_slb_preload = false;
> >> >> > +
> >> >> >  	return 0;
> >> >> >  }
> >> >> >  
> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> >> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >> >> >  }
> >> >> >  
> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> >> switch_slb is probably a fairly hot path on hash?
> >> > 
> >> > Yes absolutely. I'll make this change in v5.
> >> > 
> >> >>
> >> >> > +{
> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
> >> >> > +	unsigned char i;
> >> >> > +
> >> >> > +	/*
> >> >> > +	 * We gradually age out SLBs after a number of context switches to
> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> > +	 * SLB preload cache.
> >> >> > +	 */
> >> >> > +	tsk->thread.load_slb++;
> >> >> > +	if (!tsk->thread.load_slb) {
> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
> >> >> > +
> >> >> > +		preload_age(ti);
> >> >> > +		preload_add(ti, pc);
> >> >> > +	}
> >> >> > +
> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> > +		unsigned char idx;
> >> >> > +		unsigned long ea;
> >> >> > +
> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> > +
> >> >> > +		slb_allocate_user(mm, ea);
> >> >> > +	}
> >> >> > +}
> >> >> > +
> >> >> >  /* Flush all user entries from the segment table of the current processor. */
> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >  {
> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
> >> >> >  	unsigned char i;
> >> >> >  
> >> >> >  	/*
> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >  
> >> >> >  	copy_mm_to_paca(mm);
> >> >> >  
> >> >> > -	/*
> >> >> > -	 * We gradually age out SLBs after a number of context switches to
> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> > -	 * SLB preload cache.
> >> >> > -	 */
> >> >> > -	tsk->thread.load_slb++;
> >> >> > -	if (!tsk->thread.load_slb) {
> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
> >> >> > -
> >> >> > -		preload_age(ti);
> >> >> > -		preload_add(ti, pc);
> >> >> > -	}
> >> >> > -
> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> > -		unsigned char idx;
> >> >> > -		unsigned long ea;
> >> >> > -
> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> > -
> >> >> > -		slb_allocate_user(mm, ea);
> >> >> > -	}
> >> >> > +	if (!mm->context.skip_slb_preload)
> >> >> > +		preload_slb_entries(tsk, mm);
> >> >>
> >> >> Should this be wrapped in likely()?
> >> > 
> >> > Seems like a good idea - yes.
> >> > 
> >> >>
> >> >> >  
> >> >> >  	/*
> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
> >> >>
> >> >> Right below this comment is the isync. It seems to be specifically
> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
> >> >> skipping SLB preloads?
> >> >>
> >> >> It's probably not a big deal to have an extra isync in the fairly rare
> >> >> path when we're skipping preloads, but I thought I'd check.
> >> > 
> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> >> > but then again it was always in the code-path before. If someone can
> >> > make a compelling argument to drop it when not preloading SLBs I will,
> >> > otherwise (considering some of the other non-obvious things I stepped
> >> > into with the Hash code) I will keep it here for now.
> >>
> >> The ISA says slbia wants an isync afterward, so we probably should keep
> >> it. The comment is a bit misleading in that case.
> >>
> >> Why isn't preloading appropriate for a temporary mm?
> > 
> > The preloaded entries come from the thread_info struct which isn't
> > necessarily related to the temporary mm at all. I saw SLB multihits
> > while testing this series with my LKDTM test where the "patching
> > address" (userspace address for the temporary mapping w/
> > write-permissions) ends up in a thread's preload list and then we
> > explicitly insert it again in map_patch() when trying to patch. At that
> > point the SLB multihit triggers.
>
> Hmm, so what if we use a mm, take some SLB faults then unuse it and
> use a different one? I wonder if kthread_use_mm has existing problems
> with this incorrect SLB preloading. Quite possibly. We should clear
> the preload whenever mm changes I think. That should cover this as
> well.

I actually did this initially but thought it was a bit too intrusive to
include as part of this series and hurt performance. I agree that
preloading the SLB from the thread may be a problem in general when
switching in/out an mm.

kthread_use_mm may not be affected unless we explicitly insert some SLB
entries which could collide with an existing preload (which I don't
think we do anywhere until this series).

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-07-01  6:12     ` Nicholas Piggin
@ 2021-07-01  7:02       ` Christopher M. Riedl
  -1 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  7:02 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> > the Book3s64 Hash MMU operates - by default the space above
> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> > all platforms/MMUs is randomized inside this range.  The number of
> > possible random addresses is dependent on PAGE_SIZE and limited by
> > DEFAULT_MAP_WINDOW.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > 
> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> > 
> > Randomization occurs only once during initialization at boot.
> > 
> > Introduce two new functions, map_patch() and unmap_patch(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> > the page for patching with PAGE_SHARED since the kernel cannot access
> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> > 
> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> > taking an SLB and Hash fault during patching.
>
> What prevents the SLBE or HPTE from being removed before the last
> access?

This code runs with local IRQs disabled - we also don't access anything
else in userspace so I'm not sure what else could cause the entries to
be removed TBH.

>
>
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -	struct vm_struct *area;
> > +	int err;
> >  
> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -	if (!area) {
> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -			cpu);
> > -		return -1;
> > -	}
> > -	this_cpu_write(text_poke_area, area);
> > +	if (radix_enabled())
> > +		return 0;
> >  
> > -	return 0;
> > -}
> > +	err = slb_allocate_user(patching_mm, patching_addr);
> > +	if (err)
> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >  
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -	free_vm_area(this_cpu_read(text_poke_area));
> > -	return 0;
> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +			   HPTE_USE_KERNEL_KEY);
> > +	if (err)
> > +		pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> > +	isync();
>
> I'm not sure if this is enough. Could we context switch here? You've
> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> How about taking an machine check that clears the SLB? Could the HPTE
> get removed by something else here?

All of this happens after a local_irq_save() which should at least
prevent context switches IIUC. I am not sure what else could cause the
HPTE to get removed here.

>
> You want to prevent faults because you might be patching a fault
> handler?

In a more general sense: I don't think we want to take page faults every
time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
fault handler codepath also checks `current->mm` in some places which
won't match the temporary mm. Also `current->mm` can be NULL which
caused problems in my earlier revisions of this series.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
@ 2021-07-01  7:02       ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-01  7:02 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> > the Book3s64 Hash MMU operates - by default the space above
> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> > all platforms/MMUs is randomized inside this range.  The number of
> > possible random addresses is dependent on PAGE_SIZE and limited by
> > DEFAULT_MAP_WINDOW.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > 
> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> > 
> > Randomization occurs only once during initialization at boot.
> > 
> > Introduce two new functions, map_patch() and unmap_patch(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> > the page for patching with PAGE_SHARED since the kernel cannot access
> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> > 
> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> > taking an SLB and Hash fault during patching.
>
> What prevents the SLBE or HPTE from being removed before the last
> access?

This code runs with local IRQs disabled - we also don't access anything
else in userspace so I'm not sure what else could cause the entries to
be removed TBH.

>
>
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -	struct vm_struct *area;
> > +	int err;
> >  
> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -	if (!area) {
> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -			cpu);
> > -		return -1;
> > -	}
> > -	this_cpu_write(text_poke_area, area);
> > +	if (radix_enabled())
> > +		return 0;
> >  
> > -	return 0;
> > -}
> > +	err = slb_allocate_user(patching_mm, patching_addr);
> > +	if (err)
> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >  
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -	free_vm_area(this_cpu_read(text_poke_area));
> > -	return 0;
> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +			   HPTE_USE_KERNEL_KEY);
> > +	if (err)
> > +		pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> > +	isync();
>
> I'm not sure if this is enough. Could we context switch here? You've
> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> How about taking an machine check that clears the SLB? Could the HPTE
> get removed by something else here?

All of this happens after a local_irq_save() which should at least
prevent context switches IIUC. I am not sure what else could cause the
HPTE to get removed here.

>
> You want to prevent faults because you might be patching a fault
> handler?

In a more general sense: I don't think we want to take page faults every
time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
fault handler codepath also checks `current->mm` in some places which
won't match the temporary mm. Also `current->mm` can be NULL which
caused problems in my earlier revisions of this series.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  6:53               ` Christopher M. Riedl
@ 2021-07-01  7:37                 ` Nicholas Piggin
  -1 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  7:37 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: keescook, linux-hardening, tglx, x86

Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>> >> >>
>> >> >> > Switching to a different mm with Hash translation causes SLB entries to
>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>> >> >> > example when threads share a common mm but operate on different address
>> >> >> > ranges.
>> >> >> >
>> >> >> > Preloading entries from the thread_info struct may not always be
>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> >> >> > SLB preload code into a separate function since switch_slb() is already
>> >> >> > quite long. The default behavior (preloading SLB entries from the
>> >> >> > current thread_info struct) remains unchanged.
>> >> >> >
>> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>> >> >> >
>> >> >> > ---
>> >> >> >
>> >> >> > v4:  * New to series.
>> >> >> > ---
>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >> >
>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> >> >  	u32 pkey_allocation_map;
>> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >> >  #endif
>> >> >> > +
>> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>> >> >> > +	bool skip_slb_preload;
>> >> >> >  } mm_context_t;
>> >> >> >  
>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>> >> >> >  	return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> >> > +{
>> >> >> > +	mm->context.skip_slb_preload = true;
>> >> >> > +}
>> >> >> > +
>> >> >> > +#else
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> >> > +
>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> >> > +
>> >> >> >  #include <asm-generic/mmu_context.h>
>> >> >> >  
>> >> >> >  #endif /* __KERNEL__ */
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
>> >> >> >  	atomic_set(&mm->context.copros, 0);
>> >> >> >  
>> >> >> > +	mm->context.skip_slb_preload = false;
>> >> >> > +
>> >> >> >  	return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >> >  }
>> >> >> >  
>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> >> switch_slb is probably a fairly hot path on hash?
>> >> > 
>> >> > Yes absolutely. I'll make this change in v5.
>> >> > 
>> >> >>
>> >> >> > +{
>> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
>> >> >> > +	unsigned char i;
>> >> >> > +
>> >> >> > +	/*
>> >> >> > +	 * We gradually age out SLBs after a number of context switches to
>> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> >> > +	 * SLB preload cache.
>> >> >> > +	 */
>> >> >> > +	tsk->thread.load_slb++;
>> >> >> > +	if (!tsk->thread.load_slb) {
>> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
>> >> >> > +
>> >> >> > +		preload_age(ti);
>> >> >> > +		preload_add(ti, pc);
>> >> >> > +	}
>> >> >> > +
>> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> >> > +		unsigned char idx;
>> >> >> > +		unsigned long ea;
>> >> >> > +
>> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> >> > +
>> >> >> > +		slb_allocate_user(mm, ea);
>> >> >> > +	}
>> >> >> > +}
>> >> >> > +
>> >> >> >  /* Flush all user entries from the segment table of the current processor. */
>> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> >  {
>> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
>> >> >> >  	unsigned char i;
>> >> >> >  
>> >> >> >  	/*
>> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> >  
>> >> >> >  	copy_mm_to_paca(mm);
>> >> >> >  
>> >> >> > -	/*
>> >> >> > -	 * We gradually age out SLBs after a number of context switches to
>> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> >> > -	 * SLB preload cache.
>> >> >> > -	 */
>> >> >> > -	tsk->thread.load_slb++;
>> >> >> > -	if (!tsk->thread.load_slb) {
>> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
>> >> >> > -
>> >> >> > -		preload_age(ti);
>> >> >> > -		preload_add(ti, pc);
>> >> >> > -	}
>> >> >> > -
>> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> >> > -		unsigned char idx;
>> >> >> > -		unsigned long ea;
>> >> >> > -
>> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> >> > -
>> >> >> > -		slb_allocate_user(mm, ea);
>> >> >> > -	}
>> >> >> > +	if (!mm->context.skip_slb_preload)
>> >> >> > +		preload_slb_entries(tsk, mm);
>> >> >>
>> >> >> Should this be wrapped in likely()?
>> >> > 
>> >> > Seems like a good idea - yes.
>> >> > 
>> >> >>
>> >> >> >  
>> >> >> >  	/*
>> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>> >> >>
>> >> >> Right below this comment is the isync. It seems to be specifically
>> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
>> >> >> skipping SLB preloads?
>> >> >>
>> >> >> It's probably not a big deal to have an extra isync in the fairly rare
>> >> >> path when we're skipping preloads, but I thought I'd check.
>> >> > 
>> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
>> >> > but then again it was always in the code-path before. If someone can
>> >> > make a compelling argument to drop it when not preloading SLBs I will,
>> >> > otherwise (considering some of the other non-obvious things I stepped
>> >> > into with the Hash code) I will keep it here for now.
>> >>
>> >> The ISA says slbia wants an isync afterward, so we probably should keep
>> >> it. The comment is a bit misleading in that case.
>> >>
>> >> Why isn't preloading appropriate for a temporary mm?
>> > 
>> > The preloaded entries come from the thread_info struct which isn't
>> > necessarily related to the temporary mm at all. I saw SLB multihits
>> > while testing this series with my LKDTM test where the "patching
>> > address" (userspace address for the temporary mapping w/
>> > write-permissions) ends up in a thread's preload list and then we
>> > explicitly insert it again in map_patch() when trying to patch. At that
>> > point the SLB multihit triggers.
>>
>> Hmm, so what if we use a mm, take some SLB faults then unuse it and
>> use a different one? I wonder if kthread_use_mm has existing problems
>> with this incorrect SLB preloading. Quite possibly. We should clear
>> the preload whenever mm changes I think. That should cover this as
>> well.
> 
> I actually did this initially but thought it was a bit too intrusive to
> include as part of this series and hurt performance. I agree that
> preloading the SLB from the thread may be a problem in general when
> switching in/out an mm.
> 
> kthread_use_mm may not be affected unless we explicitly insert some SLB
> entries which could collide with an existing preload (which I don't
> think we do anywhere until this series).

kthread_use_mm(mm1);
*ea = blah; /* slb preload[n++][ea] = va */
kthread_unuse_mm(mm1);

kthread_use_mm(mm2);
  switch_slb();
schedule();
  /* preload ea=va? */
x = *ea;
kthread_unuse_mm(mm2);

? I'm sure we'd have a bug in existing code if you're hitting a bug 
there.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01  7:37                 ` Nicholas Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  7:37 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>> >> >>
>> >> >> > Switching to a different mm with Hash translation causes SLB entries to
>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>> >> >> > example when threads share a common mm but operate on different address
>> >> >> > ranges.
>> >> >> >
>> >> >> > Preloading entries from the thread_info struct may not always be
>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> >> >> > SLB preload code into a separate function since switch_slb() is already
>> >> >> > quite long. The default behavior (preloading SLB entries from the
>> >> >> > current thread_info struct) remains unchanged.
>> >> >> >
>> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>> >> >> >
>> >> >> > ---
>> >> >> >
>> >> >> > v4:  * New to series.
>> >> >> > ---
>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >> >
>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> >> >  	u32 pkey_allocation_map;
>> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >> >  #endif
>> >> >> > +
>> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>> >> >> > +	bool skip_slb_preload;
>> >> >> >  } mm_context_t;
>> >> >> >  
>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>> >> >> >  	return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> >> > +{
>> >> >> > +	mm->context.skip_slb_preload = true;
>> >> >> > +}
>> >> >> > +
>> >> >> > +#else
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> >> > +
>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> >> > +
>> >> >> >  #include <asm-generic/mmu_context.h>
>> >> >> >  
>> >> >> >  #endif /* __KERNEL__ */
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
>> >> >> >  	atomic_set(&mm->context.copros, 0);
>> >> >> >  
>> >> >> > +	mm->context.skip_slb_preload = false;
>> >> >> > +
>> >> >> >  	return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >> >  }
>> >> >> >  
>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> >> switch_slb is probably a fairly hot path on hash?
>> >> > 
>> >> > Yes absolutely. I'll make this change in v5.
>> >> > 
>> >> >>
>> >> >> > +{
>> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
>> >> >> > +	unsigned char i;
>> >> >> > +
>> >> >> > +	/*
>> >> >> > +	 * We gradually age out SLBs after a number of context switches to
>> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> >> > +	 * SLB preload cache.
>> >> >> > +	 */
>> >> >> > +	tsk->thread.load_slb++;
>> >> >> > +	if (!tsk->thread.load_slb) {
>> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
>> >> >> > +
>> >> >> > +		preload_age(ti);
>> >> >> > +		preload_add(ti, pc);
>> >> >> > +	}
>> >> >> > +
>> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> >> > +		unsigned char idx;
>> >> >> > +		unsigned long ea;
>> >> >> > +
>> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> >> > +
>> >> >> > +		slb_allocate_user(mm, ea);
>> >> >> > +	}
>> >> >> > +}
>> >> >> > +
>> >> >> >  /* Flush all user entries from the segment table of the current processor. */
>> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> >  {
>> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
>> >> >> >  	unsigned char i;
>> >> >> >  
>> >> >> >  	/*
>> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>> >> >> >  
>> >> >> >  	copy_mm_to_paca(mm);
>> >> >> >  
>> >> >> > -	/*
>> >> >> > -	 * We gradually age out SLBs after a number of context switches to
>> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>> >> >> > -	 * SLB preload cache.
>> >> >> > -	 */
>> >> >> > -	tsk->thread.load_slb++;
>> >> >> > -	if (!tsk->thread.load_slb) {
>> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
>> >> >> > -
>> >> >> > -		preload_age(ti);
>> >> >> > -		preload_add(ti, pc);
>> >> >> > -	}
>> >> >> > -
>> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>> >> >> > -		unsigned char idx;
>> >> >> > -		unsigned long ea;
>> >> >> > -
>> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>> >> >> > -
>> >> >> > -		slb_allocate_user(mm, ea);
>> >> >> > -	}
>> >> >> > +	if (!mm->context.skip_slb_preload)
>> >> >> > +		preload_slb_entries(tsk, mm);
>> >> >>
>> >> >> Should this be wrapped in likely()?
>> >> > 
>> >> > Seems like a good idea - yes.
>> >> > 
>> >> >>
>> >> >> >  
>> >> >> >  	/*
>> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>> >> >>
>> >> >> Right below this comment is the isync. It seems to be specifically
>> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
>> >> >> skipping SLB preloads?
>> >> >>
>> >> >> It's probably not a big deal to have an extra isync in the fairly rare
>> >> >> path when we're skipping preloads, but I thought I'd check.
>> >> > 
>> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
>> >> > but then again it was always in the code-path before. If someone can
>> >> > make a compelling argument to drop it when not preloading SLBs I will,
>> >> > otherwise (considering some of the other non-obvious things I stepped
>> >> > into with the Hash code) I will keep it here for now.
>> >>
>> >> The ISA says slbia wants an isync afterward, so we probably should keep
>> >> it. The comment is a bit misleading in that case.
>> >>
>> >> Why isn't preloading appropriate for a temporary mm?
>> > 
>> > The preloaded entries come from the thread_info struct which isn't
>> > necessarily related to the temporary mm at all. I saw SLB multihits
>> > while testing this series with my LKDTM test where the "patching
>> > address" (userspace address for the temporary mapping w/
>> > write-permissions) ends up in a thread's preload list and then we
>> > explicitly insert it again in map_patch() when trying to patch. At that
>> > point the SLB multihit triggers.
>>
>> Hmm, so what if we use a mm, take some SLB faults then unuse it and
>> use a different one? I wonder if kthread_use_mm has existing problems
>> with this incorrect SLB preloading. Quite possibly. We should clear
>> the preload whenever mm changes I think. That should cover this as
>> well.
> 
> I actually did this initially but thought it was a bit too intrusive to
> include as part of this series and hurt performance. I agree that
> preloading the SLB from the thread may be a problem in general when
> switching in/out an mm.
> 
> kthread_use_mm may not be affected unless we explicitly insert some SLB
> entries which could collide with an existing preload (which I don't
> think we do anywhere until this series).

kthread_use_mm(mm1);
*ea = blah; /* slb preload[n++][ea] = va */
kthread_unuse_mm(mm1);

kthread_use_mm(mm2);
  switch_slb();
schedule();
  /* preload ea=va? */
x = *ea;
kthread_unuse_mm(mm2);

? I'm sure we'd have a bug in existing code if you're hitting a bug 
there.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-07-01  7:02       ` Christopher M. Riedl
@ 2021-07-01  7:51         ` Nicholas Piggin
  -1 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  7:51 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: keescook, linux-hardening, tglx, x86

Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
>> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
>> > address to be patched is temporarily mapped as writeable. Currently, a
>> > per-cpu vmalloc patch area is used for this purpose. While the patch
>> > area is per-cpu, the temporary page mapping is inserted into the kernel
>> > page tables for the duration of patching. The mapping is exposed to CPUs
>> > other than the patching CPU - this is undesirable from a hardening
>> > perspective. Use a temporary mm instead which keeps the mapping local to
>> > the CPU doing the patching.
>> > 
>> > Use the `poking_init` init hook to prepare a temporary mm and patching
>> > address. Initialize the temporary mm by copying the init mm. Choose a
>> > randomized patching address inside the temporary mm userspace address
>> > space. The patching address is randomized between PAGE_SIZE and
>> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
>> > the Book3s64 Hash MMU operates - by default the space above
>> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
>> > all platforms/MMUs is randomized inside this range.  The number of
>> > possible random addresses is dependent on PAGE_SIZE and limited by
>> > DEFAULT_MAP_WINDOW.
>> > 
>> > Bits of entropy with 64K page size on BOOK3S_64:
>> > 
>> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>> > 
>> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
>> > 
>> > Randomization occurs only once during initialization at boot.
>> > 
>> > Introduce two new functions, map_patch() and unmap_patch(), to
>> > respectively create and remove the temporary mapping with write
>> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
>> > the page for patching with PAGE_SHARED since the kernel cannot access
>> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
>> > 
>> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
>> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
>> > taking an SLB and Hash fault during patching.
>>
>> What prevents the SLBE or HPTE from being removed before the last
>> access?
> 
> This code runs with local IRQs disabled - we also don't access anything
> else in userspace so I'm not sure what else could cause the entries to
> be removed TBH.
> 
>>
>>
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
>> >  {
>> > -	struct vm_struct *area;
>> > +	int err;
>> >  
>> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> > -	if (!area) {
>> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
>> > -			cpu);
>> > -		return -1;
>> > -	}
>> > -	this_cpu_write(text_poke_area, area);
>> > +	if (radix_enabled())
>> > +		return 0;
>> >  
>> > -	return 0;
>> > -}
>> > +	err = slb_allocate_user(patching_mm, patching_addr);
>> > +	if (err)
>> > +		pr_warn("map patch: failed to allocate slb entry\n");
>> >  
>> > -static int text_area_cpu_down(unsigned int cpu)
>> > -{
>> > -	free_vm_area(this_cpu_read(text_poke_area));
>> > -	return 0;
>> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
>> > +			   HPTE_USE_KERNEL_KEY);
>> > +	if (err)
>> > +		pr_warn("map patch: failed to insert hashed page\n");
>> > +
>> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
>> > +	isync();
>>
>> I'm not sure if this is enough. Could we context switch here? You've
>> got the PTL so no with a normal kernel but maybe yes with an RT kernel
>> How about taking an machine check that clears the SLB? Could the HPTE
>> get removed by something else here?
> 
> All of this happens after a local_irq_save() which should at least
> prevent context switches IIUC.

Ah yeah I didn't look that far back. A machine check can take out SLB
entries.

> I am not sure what else could cause the
> HPTE to get removed here.

Other CPUs?

>> You want to prevent faults because you might be patching a fault
>> handler?
> 
> In a more general sense: I don't think we want to take page faults every
> time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> fault handler codepath also checks `current->mm` in some places which
> won't match the temporary mm. Also `current->mm` can be NULL which
> caused problems in my earlier revisions of this series.

Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and 
setting current->mm properly would explode too much. Maybe that's okayish.
But I can't see how the HPT code is up to the job of this in general 
(even if that current->mm issue was fixed).

To do it without holes you would either have to get the SLB MCE handler 
to restore that particular SLB if it flushed it, or restart the patch
code from a fixup location if it took an MCE after installing the SLB.
And bolt a hash table entry.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
@ 2021-07-01  7:51         ` Nicholas Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01  7:51 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: tglx, x86, keescook, linux-hardening

Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
>> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
>> > address to be patched is temporarily mapped as writeable. Currently, a
>> > per-cpu vmalloc patch area is used for this purpose. While the patch
>> > area is per-cpu, the temporary page mapping is inserted into the kernel
>> > page tables for the duration of patching. The mapping is exposed to CPUs
>> > other than the patching CPU - this is undesirable from a hardening
>> > perspective. Use a temporary mm instead which keeps the mapping local to
>> > the CPU doing the patching.
>> > 
>> > Use the `poking_init` init hook to prepare a temporary mm and patching
>> > address. Initialize the temporary mm by copying the init mm. Choose a
>> > randomized patching address inside the temporary mm userspace address
>> > space. The patching address is randomized between PAGE_SIZE and
>> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
>> > the Book3s64 Hash MMU operates - by default the space above
>> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
>> > all platforms/MMUs is randomized inside this range.  The number of
>> > possible random addresses is dependent on PAGE_SIZE and limited by
>> > DEFAULT_MAP_WINDOW.
>> > 
>> > Bits of entropy with 64K page size on BOOK3S_64:
>> > 
>> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>> > 
>> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
>> > 
>> > Randomization occurs only once during initialization at boot.
>> > 
>> > Introduce two new functions, map_patch() and unmap_patch(), to
>> > respectively create and remove the temporary mapping with write
>> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
>> > the page for patching with PAGE_SHARED since the kernel cannot access
>> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
>> > 
>> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
>> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
>> > taking an SLB and Hash fault during patching.
>>
>> What prevents the SLBE or HPTE from being removed before the last
>> access?
> 
> This code runs with local IRQs disabled - we also don't access anything
> else in userspace so I'm not sure what else could cause the entries to
> be removed TBH.
> 
>>
>>
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
>> >  {
>> > -	struct vm_struct *area;
>> > +	int err;
>> >  
>> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> > -	if (!area) {
>> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
>> > -			cpu);
>> > -		return -1;
>> > -	}
>> > -	this_cpu_write(text_poke_area, area);
>> > +	if (radix_enabled())
>> > +		return 0;
>> >  
>> > -	return 0;
>> > -}
>> > +	err = slb_allocate_user(patching_mm, patching_addr);
>> > +	if (err)
>> > +		pr_warn("map patch: failed to allocate slb entry\n");
>> >  
>> > -static int text_area_cpu_down(unsigned int cpu)
>> > -{
>> > -	free_vm_area(this_cpu_read(text_poke_area));
>> > -	return 0;
>> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
>> > +			   HPTE_USE_KERNEL_KEY);
>> > +	if (err)
>> > +		pr_warn("map patch: failed to insert hashed page\n");
>> > +
>> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
>> > +	isync();
>>
>> I'm not sure if this is enough. Could we context switch here? You've
>> got the PTL so no with a normal kernel but maybe yes with an RT kernel
>> How about taking an machine check that clears the SLB? Could the HPTE
>> get removed by something else here?
> 
> All of this happens after a local_irq_save() which should at least
> prevent context switches IIUC.

Ah yeah I didn't look that far back. A machine check can take out SLB
entries.

> I am not sure what else could cause the
> HPTE to get removed here.

Other CPUs?

>> You want to prevent faults because you might be patching a fault
>> handler?
> 
> In a more general sense: I don't think we want to take page faults every
> time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> fault handler codepath also checks `current->mm` in some places which
> won't match the temporary mm. Also `current->mm` can be NULL which
> caused problems in my earlier revisions of this series.

Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and 
setting current->mm properly would explode too much. Maybe that's okayish.
But I can't see how the HPT code is up to the job of this in general 
(even if that current->mm issue was fixed).

To do it without holes you would either have to get the SLB MCE handler 
to restore that particular SLB if it flushed it, or restart the patch
code from a fixup location if it took an MCE after installing the SLB.
And bolt a hash table entry.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  7:37                 ` Nicholas Piggin
@ 2021-07-01 11:30                   ` Nicholas Piggin
  -1 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01 11:30 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: keescook, linux-hardening, tglx, x86

Excerpts from Nicholas Piggin's message of July 1, 2021 5:37 pm:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
>> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>>> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>>> >> >>
>>> >> >> > Switching to a different mm with Hash translation causes SLB entries to
>>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>>> >> >> > example when threads share a common mm but operate on different address
>>> >> >> > ranges.
>>> >> >> >
>>> >> >> > Preloading entries from the thread_info struct may not always be
>>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>>> >> >> > SLB preload code into a separate function since switch_slb() is already
>>> >> >> > quite long. The default behavior (preloading SLB entries from the
>>> >> >> > current thread_info struct) remains unchanged.
>>> >> >> >
>>> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>>> >> >> >
>>> >> >> > ---
>>> >> >> >
>>> >> >> > v4:  * New to series.
>>> >> >> > ---
>>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>>> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>>> >> >> >  	u32 pkey_allocation_map;
>>> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>>> >> >> >  #endif
>>> >> >> > +
>>> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>>> >> >> > +	bool skip_slb_preload;
>>> >> >> >  } mm_context_t;
>>> >> >> >  
>>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>>> >> >> >  	return 0;
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>>> >> >> > +
>>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>>> >> >> > +{
>>> >> >> > +	mm->context.skip_slb_preload = true;
>>> >> >> > +}
>>> >> >> > +
>>> >> >> > +#else
>>> >> >> > +
>>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>>> >> >> > +
>>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>>> >> >> > +
>>> >> >> >  #include <asm-generic/mmu_context.h>
>>> >> >> >  
>>> >> >> >  #endif /* __KERNEL__ */
>>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
>>> >> >> >  	atomic_set(&mm->context.copros, 0);
>>> >> >> >  
>>> >> >> > +	mm->context.skip_slb_preload = false;
>>> >> >> > +
>>> >> >> >  	return 0;
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>>> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>>> >> >> switch_slb is probably a fairly hot path on hash?
>>> >> > 
>>> >> > Yes absolutely. I'll make this change in v5.
>>> >> > 
>>> >> >>
>>> >> >> > +{
>>> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
>>> >> >> > +	unsigned char i;
>>> >> >> > +
>>> >> >> > +	/*
>>> >> >> > +	 * We gradually age out SLBs after a number of context switches to
>>> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>>> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>>> >> >> > +	 * SLB preload cache.
>>> >> >> > +	 */
>>> >> >> > +	tsk->thread.load_slb++;
>>> >> >> > +	if (!tsk->thread.load_slb) {
>>> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
>>> >> >> > +
>>> >> >> > +		preload_age(ti);
>>> >> >> > +		preload_add(ti, pc);
>>> >> >> > +	}
>>> >> >> > +
>>> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>>> >> >> > +		unsigned char idx;
>>> >> >> > +		unsigned long ea;
>>> >> >> > +
>>> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>>> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>>> >> >> > +
>>> >> >> > +		slb_allocate_user(mm, ea);
>>> >> >> > +	}
>>> >> >> > +}
>>> >> >> > +
>>> >> >> >  /* Flush all user entries from the segment table of the current processor. */
>>> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> >  {
>>> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
>>> >> >> >  	unsigned char i;
>>> >> >> >  
>>> >> >> >  	/*
>>> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> >  
>>> >> >> >  	copy_mm_to_paca(mm);
>>> >> >> >  
>>> >> >> > -	/*
>>> >> >> > -	 * We gradually age out SLBs after a number of context switches to
>>> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>>> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>>> >> >> > -	 * SLB preload cache.
>>> >> >> > -	 */
>>> >> >> > -	tsk->thread.load_slb++;
>>> >> >> > -	if (!tsk->thread.load_slb) {
>>> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
>>> >> >> > -
>>> >> >> > -		preload_age(ti);
>>> >> >> > -		preload_add(ti, pc);
>>> >> >> > -	}
>>> >> >> > -
>>> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>>> >> >> > -		unsigned char idx;
>>> >> >> > -		unsigned long ea;
>>> >> >> > -
>>> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>>> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>>> >> >> > -
>>> >> >> > -		slb_allocate_user(mm, ea);
>>> >> >> > -	}
>>> >> >> > +	if (!mm->context.skip_slb_preload)
>>> >> >> > +		preload_slb_entries(tsk, mm);
>>> >> >>
>>> >> >> Should this be wrapped in likely()?
>>> >> > 
>>> >> > Seems like a good idea - yes.
>>> >> > 
>>> >> >>
>>> >> >> >  
>>> >> >> >  	/*
>>> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>>> >> >>
>>> >> >> Right below this comment is the isync. It seems to be specifically
>>> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
>>> >> >> skipping SLB preloads?
>>> >> >>
>>> >> >> It's probably not a big deal to have an extra isync in the fairly rare
>>> >> >> path when we're skipping preloads, but I thought I'd check.
>>> >> > 
>>> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
>>> >> > but then again it was always in the code-path before. If someone can
>>> >> > make a compelling argument to drop it when not preloading SLBs I will,
>>> >> > otherwise (considering some of the other non-obvious things I stepped
>>> >> > into with the Hash code) I will keep it here for now.
>>> >>
>>> >> The ISA says slbia wants an isync afterward, so we probably should keep
>>> >> it. The comment is a bit misleading in that case.
>>> >>
>>> >> Why isn't preloading appropriate for a temporary mm?
>>> > 
>>> > The preloaded entries come from the thread_info struct which isn't
>>> > necessarily related to the temporary mm at all. I saw SLB multihits
>>> > while testing this series with my LKDTM test where the "patching
>>> > address" (userspace address for the temporary mapping w/
>>> > write-permissions) ends up in a thread's preload list and then we
>>> > explicitly insert it again in map_patch() when trying to patch. At that
>>> > point the SLB multihit triggers.
>>>
>>> Hmm, so what if we use a mm, take some SLB faults then unuse it and
>>> use a different one? I wonder if kthread_use_mm has existing problems
>>> with this incorrect SLB preloading. Quite possibly. We should clear
>>> the preload whenever mm changes I think. That should cover this as
>>> well.
>> 
>> I actually did this initially but thought it was a bit too intrusive to
>> include as part of this series and hurt performance. I agree that
>> preloading the SLB from the thread may be a problem in general when
>> switching in/out an mm.
>> 
>> kthread_use_mm may not be affected unless we explicitly insert some SLB
>> entries which could collide with an existing preload (which I don't
>> think we do anywhere until this series).
> 
> kthread_use_mm(mm1);
> *ea = blah; /* slb preload[n++][ea] = va */
> kthread_unuse_mm(mm1);
> 
> kthread_use_mm(mm2);
>   switch_slb();
> schedule();
>   /* preload ea=va? */
> x = *ea;
> kthread_unuse_mm(mm2);
> 
> ? I'm sure we'd have a bug in existing code if you're hitting a bug 
> there.

Something like this I think should prevent it. I thought there was a 
better arch hook for it, but doesn't seem so. I have an unexplained
SLB crash bug somewhere too, better check if it matches...

Thanks,
Nick

diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index c91bd85eb90e..cb8c8a5d861e 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -502,6 +502,9 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 
        copy_mm_to_paca(mm);
 
+       if (unlikely(tsk->flags & PF_KTHREAD))
+               goto no_preload;
+
        /*
         * We gradually age out SLBs after a number of context switches to
         * reduce reload overhead of unused entries (like we do with FP/VEC
@@ -526,10 +529,11 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
                slb_allocate_user(mm, ea);
        }
 
+no_preload:
        /*
-        * Synchronize slbmte preloads with possible subsequent user memory
-        * address accesses by the kernel (user mode won't happen until
-        * rfid, which is safe).
+        * Synchronize slbias and slbmte preloads with possible subsequent user
+        * memory address accesses by the kernel (user mode won't happen until
+        * rfid, which is synchronizing).
         */
        isync();
 }


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-01 11:30                   ` Nicholas Piggin
  0 siblings, 0 replies; 45+ messages in thread
From: Nicholas Piggin @ 2021-07-01 11:30 UTC (permalink / raw)
  To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

Excerpts from Nicholas Piggin's message of July 1, 2021 5:37 pm:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
>> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>>> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
>>> >> >>
>>> >> >> > Switching to a different mm with Hash translation causes SLB entries to
>>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>>> >> >> > example when threads share a common mm but operate on different address
>>> >> >> > ranges.
>>> >> >> >
>>> >> >> > Preloading entries from the thread_info struct may not always be
>>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>>> >> >> > SLB preload code into a separate function since switch_slb() is already
>>> >> >> > quite long. The default behavior (preloading SLB entries from the
>>> >> >> > current thread_info struct) remains unchanged.
>>> >> >> >
>>> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
>>> >> >> >
>>> >> >> > ---
>>> >> >> >
>>> >> >> > v4:  * New to series.
>>> >> >> > ---
>>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
>>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>>> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
>>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>>> >> >> >  	u32 pkey_allocation_map;
>>> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
>>> >> >> >  #endif
>>> >> >> > +
>>> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
>>> >> >> > +	bool skip_slb_preload;
>>> >> >> >  } mm_context_t;
>>> >> >> >  
>>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>>> >> >> >  	return 0;
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>>> >> >> > +
>>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>>> >> >> > +{
>>> >> >> > +	mm->context.skip_slb_preload = true;
>>> >> >> > +}
>>> >> >> > +
>>> >> >> > +#else
>>> >> >> > +
>>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>>> >> >> > +
>>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>>> >> >> > +
>>> >> >> >  #include <asm-generic/mmu_context.h>
>>> >> >> >  
>>> >> >> >  #endif /* __KERNEL__ */
>>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
>>> >> >> >  	atomic_set(&mm->context.copros, 0);
>>> >> >> >  
>>> >> >> > +	mm->context.skip_slb_preload = false;
>>> >> >> > +
>>> >> >> >  	return 0;
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
>>> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>>> >> >> switch_slb is probably a fairly hot path on hash?
>>> >> > 
>>> >> > Yes absolutely. I'll make this change in v5.
>>> >> > 
>>> >> >>
>>> >> >> > +{
>>> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
>>> >> >> > +	unsigned char i;
>>> >> >> > +
>>> >> >> > +	/*
>>> >> >> > +	 * We gradually age out SLBs after a number of context switches to
>>> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
>>> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
>>> >> >> > +	 * SLB preload cache.
>>> >> >> > +	 */
>>> >> >> > +	tsk->thread.load_slb++;
>>> >> >> > +	if (!tsk->thread.load_slb) {
>>> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
>>> >> >> > +
>>> >> >> > +		preload_age(ti);
>>> >> >> > +		preload_add(ti, pc);
>>> >> >> > +	}
>>> >> >> > +
>>> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
>>> >> >> > +		unsigned char idx;
>>> >> >> > +		unsigned long ea;
>>> >> >> > +
>>> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>>> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>>> >> >> > +
>>> >> >> > +		slb_allocate_user(mm, ea);
>>> >> >> > +	}
>>> >> >> > +}
>>> >> >> > +
>>> >> >> >  /* Flush all user entries from the segment table of the current processor. */
>>> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> >  {
>>> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
>>> >> >> >  	unsigned char i;
>>> >> >> >  
>>> >> >> >  	/*
>>> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>>> >> >> >  
>>> >> >> >  	copy_mm_to_paca(mm);
>>> >> >> >  
>>> >> >> > -	/*
>>> >> >> > -	 * We gradually age out SLBs after a number of context switches to
>>> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
>>> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
>>> >> >> > -	 * SLB preload cache.
>>> >> >> > -	 */
>>> >> >> > -	tsk->thread.load_slb++;
>>> >> >> > -	if (!tsk->thread.load_slb) {
>>> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
>>> >> >> > -
>>> >> >> > -		preload_age(ti);
>>> >> >> > -		preload_add(ti, pc);
>>> >> >> > -	}
>>> >> >> > -
>>> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
>>> >> >> > -		unsigned char idx;
>>> >> >> > -		unsigned long ea;
>>> >> >> > -
>>> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
>>> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
>>> >> >> > -
>>> >> >> > -		slb_allocate_user(mm, ea);
>>> >> >> > -	}
>>> >> >> > +	if (!mm->context.skip_slb_preload)
>>> >> >> > +		preload_slb_entries(tsk, mm);
>>> >> >>
>>> >> >> Should this be wrapped in likely()?
>>> >> > 
>>> >> > Seems like a good idea - yes.
>>> >> > 
>>> >> >>
>>> >> >> >  
>>> >> >> >  	/*
>>> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
>>> >> >>
>>> >> >> Right below this comment is the isync. It seems to be specifically
>>> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
>>> >> >> skipping SLB preloads?
>>> >> >>
>>> >> >> It's probably not a big deal to have an extra isync in the fairly rare
>>> >> >> path when we're skipping preloads, but I thought I'd check.
>>> >> > 
>>> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
>>> >> > but then again it was always in the code-path before. If someone can
>>> >> > make a compelling argument to drop it when not preloading SLBs I will,
>>> >> > otherwise (considering some of the other non-obvious things I stepped
>>> >> > into with the Hash code) I will keep it here for now.
>>> >>
>>> >> The ISA says slbia wants an isync afterward, so we probably should keep
>>> >> it. The comment is a bit misleading in that case.
>>> >>
>>> >> Why isn't preloading appropriate for a temporary mm?
>>> > 
>>> > The preloaded entries come from the thread_info struct which isn't
>>> > necessarily related to the temporary mm at all. I saw SLB multihits
>>> > while testing this series with my LKDTM test where the "patching
>>> > address" (userspace address for the temporary mapping w/
>>> > write-permissions) ends up in a thread's preload list and then we
>>> > explicitly insert it again in map_patch() when trying to patch. At that
>>> > point the SLB multihit triggers.
>>>
>>> Hmm, so what if we use a mm, take some SLB faults then unuse it and
>>> use a different one? I wonder if kthread_use_mm has existing problems
>>> with this incorrect SLB preloading. Quite possibly. We should clear
>>> the preload whenever mm changes I think. That should cover this as
>>> well.
>> 
>> I actually did this initially but thought it was a bit too intrusive to
>> include as part of this series and hurt performance. I agree that
>> preloading the SLB from the thread may be a problem in general when
>> switching in/out an mm.
>> 
>> kthread_use_mm may not be affected unless we explicitly insert some SLB
>> entries which could collide with an existing preload (which I don't
>> think we do anywhere until this series).
> 
> kthread_use_mm(mm1);
> *ea = blah; /* slb preload[n++][ea] = va */
> kthread_unuse_mm(mm1);
> 
> kthread_use_mm(mm2);
>   switch_slb();
> schedule();
>   /* preload ea=va? */
> x = *ea;
> kthread_unuse_mm(mm2);
> 
> ? I'm sure we'd have a bug in existing code if you're hitting a bug 
> there.

Something like this I think should prevent it. I thought there was a 
better arch hook for it, but doesn't seem so. I have an unexplained
SLB crash bug somewhere too, better check if it matches...

Thanks,
Nick

diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index c91bd85eb90e..cb8c8a5d861e 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -502,6 +502,9 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 
        copy_mm_to_paca(mm);
 
+       if (unlikely(tsk->flags & PF_KTHREAD))
+               goto no_preload;
+
        /*
         * We gradually age out SLBs after a number of context switches to
         * reduce reload overhead of unused entries (like we do with FP/VEC
@@ -526,10 +529,11 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
                slb_allocate_user(mm, ea);
        }
 
+no_preload:
        /*
-        * Synchronize slbmte preloads with possible subsequent user memory
-        * address accesses by the kernel (user mode won't happen until
-        * rfid, which is safe).
+        * Synchronize slbias and slbmte preloads with possible subsequent user
+        * memory address accesses by the kernel (user mode won't happen until
+        * rfid, which is synchronizing).
         */
        isync();
 }


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
  2021-07-01  7:37                 ` Nicholas Piggin
@ 2021-07-09  4:55                   ` Christopher M. Riedl
  -1 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-09  4:55 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 2:37 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> > On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> >> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
> >> >> >>
> >> >> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> >> >> > example when threads share a common mm but operate on different address
> >> >> >> > ranges.
> >> >> >> >
> >> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> >> >> > SLB preload code into a separate function since switch_slb() is already
> >> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> >> > current thread_info struct) remains unchanged.
> >> >> >> >
> >> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >> >> >> >
> >> >> >> > ---
> >> >> >> >
> >> >> >> > v4:  * New to series.
> >> >> >> > ---
> >> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >> >  	u32 pkey_allocation_map;
> >> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >> >  #endif
> >> >> >> > +
> >> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> >> >> >> > +	bool skip_slb_preload;
> >> >> >> >  } mm_context_t;
> >> >> >> >  
> >> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >> >> >> >  	return 0;
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> >> > +
> >> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> >> > +{
> >> >> >> > +	mm->context.skip_slb_preload = true;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +#else
> >> >> >> > +
> >> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> >> > +
> >> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> >> > +
> >> >> >> >  #include <asm-generic/mmu_context.h>
> >> >> >> >  
> >> >> >> >  #endif /* __KERNEL__ */
> >> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
> >> >> >> >  	atomic_set(&mm->context.copros, 0);
> >> >> >> >  
> >> >> >> > +	mm->context.skip_slb_preload = false;
> >> >> >> > +
> >> >> >> >  	return 0;
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> >> >> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> >> >> switch_slb is probably a fairly hot path on hash?
> >> >> > 
> >> >> > Yes absolutely. I'll make this change in v5.
> >> >> > 
> >> >> >>
> >> >> >> > +{
> >> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
> >> >> >> > +	unsigned char i;
> >> >> >> > +
> >> >> >> > +	/*
> >> >> >> > +	 * We gradually age out SLBs after a number of context switches to
> >> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> >> > +	 * SLB preload cache.
> >> >> >> > +	 */
> >> >> >> > +	tsk->thread.load_slb++;
> >> >> >> > +	if (!tsk->thread.load_slb) {
> >> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
> >> >> >> > +
> >> >> >> > +		preload_age(ti);
> >> >> >> > +		preload_add(ti, pc);
> >> >> >> > +	}
> >> >> >> > +
> >> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> >> > +		unsigned char idx;
> >> >> >> > +		unsigned long ea;
> >> >> >> > +
> >> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> >> > +
> >> >> >> > +		slb_allocate_user(mm, ea);
> >> >> >> > +	}
> >> >> >> > +}
> >> >> >> > +
> >> >> >> >  /* Flush all user entries from the segment table of the current processor. */
> >> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> >  {
> >> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
> >> >> >> >  	unsigned char i;
> >> >> >> >  
> >> >> >> >  	/*
> >> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> >  
> >> >> >> >  	copy_mm_to_paca(mm);
> >> >> >> >  
> >> >> >> > -	/*
> >> >> >> > -	 * We gradually age out SLBs after a number of context switches to
> >> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> >> > -	 * SLB preload cache.
> >> >> >> > -	 */
> >> >> >> > -	tsk->thread.load_slb++;
> >> >> >> > -	if (!tsk->thread.load_slb) {
> >> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
> >> >> >> > -
> >> >> >> > -		preload_age(ti);
> >> >> >> > -		preload_add(ti, pc);
> >> >> >> > -	}
> >> >> >> > -
> >> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> >> > -		unsigned char idx;
> >> >> >> > -		unsigned long ea;
> >> >> >> > -
> >> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> >> > -
> >> >> >> > -		slb_allocate_user(mm, ea);
> >> >> >> > -	}
> >> >> >> > +	if (!mm->context.skip_slb_preload)
> >> >> >> > +		preload_slb_entries(tsk, mm);
> >> >> >>
> >> >> >> Should this be wrapped in likely()?
> >> >> > 
> >> >> > Seems like a good idea - yes.
> >> >> > 
> >> >> >>
> >> >> >> >  
> >> >> >> >  	/*
> >> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
> >> >> >>
> >> >> >> Right below this comment is the isync. It seems to be specifically
> >> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
> >> >> >> skipping SLB preloads?
> >> >> >>
> >> >> >> It's probably not a big deal to have an extra isync in the fairly rare
> >> >> >> path when we're skipping preloads, but I thought I'd check.
> >> >> > 
> >> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> >> >> > but then again it was always in the code-path before. If someone can
> >> >> > make a compelling argument to drop it when not preloading SLBs I will,
> >> >> > otherwise (considering some of the other non-obvious things I stepped
> >> >> > into with the Hash code) I will keep it here for now.
> >> >>
> >> >> The ISA says slbia wants an isync afterward, so we probably should keep
> >> >> it. The comment is a bit misleading in that case.
> >> >>
> >> >> Why isn't preloading appropriate for a temporary mm?
> >> > 
> >> > The preloaded entries come from the thread_info struct which isn't
> >> > necessarily related to the temporary mm at all. I saw SLB multihits
> >> > while testing this series with my LKDTM test where the "patching
> >> > address" (userspace address for the temporary mapping w/
> >> > write-permissions) ends up in a thread's preload list and then we
> >> > explicitly insert it again in map_patch() when trying to patch. At that
> >> > point the SLB multihit triggers.
> >>
> >> Hmm, so what if we use a mm, take some SLB faults then unuse it and
> >> use a different one? I wonder if kthread_use_mm has existing problems
> >> with this incorrect SLB preloading. Quite possibly. We should clear
> >> the preload whenever mm changes I think. That should cover this as
> >> well.
> > 
> > I actually did this initially but thought it was a bit too intrusive to
> > include as part of this series and hurt performance. I agree that
> > preloading the SLB from the thread may be a problem in general when
> > switching in/out an mm.
> > 
> > kthread_use_mm may not be affected unless we explicitly insert some SLB
> > entries which could collide with an existing preload (which I don't
> > think we do anywhere until this series).
>
> kthread_use_mm(mm1);
> *ea = blah; /* slb preload[n++][ea] = va */
> kthread_unuse_mm(mm1);
>
> kthread_use_mm(mm2);
> switch_slb();
> schedule();
> /* preload ea=va? */
> x = *ea;
> kthread_unuse_mm(mm2);
>
> ? I'm sure we'd have a bug in existing code if you're hitting a bug
> there.

Not exactly - the SLB multihit happens because of the new code in this
series - specifically the slb_allocate_user() call during patching:

put_user(..., ea); /* insert ea into thread's preload list */
...
patch_instruction(..., ea);
  map_patch()
    switch_slb(); /* preload slb entry for ea from thread_info */
    ...
    slb_allocate_user(..., ea); /* insert slb entry for ea */
    __put_kernel_nofault(..., ea); /* ie. a 'stw' to patch */
    >>> SLB Multihit since we have an SLBE from the preload and the
        explicit slb_allocate_user()

Based on your other comments on this series I am dropping the Hash
support for percpu temp mm altogether for now so this is moot. But, I
still think it doesn't make much sense to preload SLB entries from a
thread_info struct when switching to a completely different mm.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload
@ 2021-07-09  4:55                   ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-09  4:55 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel Axtens, linuxppc-dev
  Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 2:37 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> > On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> >> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> >> "Christopher M. Riedl" <cmr@linux.ibm.com> writes:
> >> >> >>
> >> >> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> >> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> >> >> > example when threads share a common mm but operate on different address
> >> >> >> > ranges.
> >> >> >> >
> >> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> >> >> > SLB preload code into a separate function since switch_slb() is already
> >> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> >> > current thread_info struct) remains unchanged.
> >> >> >> >
> >> >> >> > Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com>
> >> >> >> >
> >> >> >> > ---
> >> >> >> >
> >> >> >> > v4:  * New to series.
> >> >> >> > ---
> >> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++++++
> >> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >> >  arch/powerpc/mm/book3s64/slb.c           | 56 ++++++++++++++----------
> >> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >> >  	u32 pkey_allocation_map;
> >> >> >> >  	s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >> >  #endif
> >> >> >> > +
> >> >> >> > +	/* Do not preload SLB entries from thread_info during switch_slb() */
> >> >> >> > +	bool skip_slb_preload;
> >> >> >> >  } mm_context_t;
> >> >> >> >  
> >> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> >> >> >> >  	return 0;
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> >> > +
> >> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> >> > +{
> >> >> >> > +	mm->context.skip_slb_preload = true;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +#else
> >> >> >> > +
> >> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> >> > +
> >> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> >> > +
> >> >> >> >  #include <asm-generic/mmu_context.h>
> >> >> >> >  
> >> >> >> >  #endif /* __KERNEL__ */
> >> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> >  	atomic_set(&mm->context.active_cpus, 0);
> >> >> >> >  	atomic_set(&mm->context.copros, 0);
> >> >> >> >  
> >> >> >> > +	mm->context.skip_slb_preload = false;
> >> >> >> > +
> >> >> >> >  	return 0;
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> >> >> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> >> >> >> >  	asm volatile("slbie %0" : : "r" (slbie_data));
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> >> >> switch_slb is probably a fairly hot path on hash?
> >> >> > 
> >> >> > Yes absolutely. I'll make this change in v5.
> >> >> > 
> >> >> >>
> >> >> >> > +{
> >> >> >> > +	struct thread_info *ti = task_thread_info(tsk);
> >> >> >> > +	unsigned char i;
> >> >> >> > +
> >> >> >> > +	/*
> >> >> >> > +	 * We gradually age out SLBs after a number of context switches to
> >> >> >> > +	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> >> > +	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> >> > +	 * SLB preload cache.
> >> >> >> > +	 */
> >> >> >> > +	tsk->thread.load_slb++;
> >> >> >> > +	if (!tsk->thread.load_slb) {
> >> >> >> > +		unsigned long pc = KSTK_EIP(tsk);
> >> >> >> > +
> >> >> >> > +		preload_age(ti);
> >> >> >> > +		preload_add(ti, pc);
> >> >> >> > +	}
> >> >> >> > +
> >> >> >> > +	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> >> > +		unsigned char idx;
> >> >> >> > +		unsigned long ea;
> >> >> >> > +
> >> >> >> > +		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> >> > +		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> >> > +
> >> >> >> > +		slb_allocate_user(mm, ea);
> >> >> >> > +	}
> >> >> >> > +}
> >> >> >> > +
> >> >> >> >  /* Flush all user entries from the segment table of the current processor. */
> >> >> >> >  void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> >  {
> >> >> >> > -	struct thread_info *ti = task_thread_info(tsk);
> >> >> >> >  	unsigned char i;
> >> >> >> >  
> >> >> >> >  	/*
> >> >> >> > @@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
> >> >> >> >  
> >> >> >> >  	copy_mm_to_paca(mm);
> >> >> >> >  
> >> >> >> > -	/*
> >> >> >> > -	 * We gradually age out SLBs after a number of context switches to
> >> >> >> > -	 * reduce reload overhead of unused entries (like we do with FP/VEC
> >> >> >> > -	 * reload). Each time we wrap 256 switches, take an entry out of the
> >> >> >> > -	 * SLB preload cache.
> >> >> >> > -	 */
> >> >> >> > -	tsk->thread.load_slb++;
> >> >> >> > -	if (!tsk->thread.load_slb) {
> >> >> >> > -		unsigned long pc = KSTK_EIP(tsk);
> >> >> >> > -
> >> >> >> > -		preload_age(ti);
> >> >> >> > -		preload_add(ti, pc);
> >> >> >> > -	}
> >> >> >> > -
> >> >> >> > -	for (i = 0; i < ti->slb_preload_nr; i++) {
> >> >> >> > -		unsigned char idx;
> >> >> >> > -		unsigned long ea;
> >> >> >> > -
> >> >> >> > -		idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
> >> >> >> > -		ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
> >> >> >> > -
> >> >> >> > -		slb_allocate_user(mm, ea);
> >> >> >> > -	}
> >> >> >> > +	if (!mm->context.skip_slb_preload)
> >> >> >> > +		preload_slb_entries(tsk, mm);
> >> >> >>
> >> >> >> Should this be wrapped in likely()?
> >> >> > 
> >> >> > Seems like a good idea - yes.
> >> >> > 
> >> >> >>
> >> >> >> >  
> >> >> >> >  	/*
> >> >> >> >  	 * Synchronize slbmte preloads with possible subsequent user memory
> >> >> >>
> >> >> >> Right below this comment is the isync. It seems to be specifically
> >> >> >> concerned with synchronising preloaded slbs. Do you need it if you are
> >> >> >> skipping SLB preloads?
> >> >> >>
> >> >> >> It's probably not a big deal to have an extra isync in the fairly rare
> >> >> >> path when we're skipping preloads, but I thought I'd check.
> >> >> > 
> >> >> > I don't _think_ we need the `isync` if we are skipping the SLB preloads,
> >> >> > but then again it was always in the code-path before. If someone can
> >> >> > make a compelling argument to drop it when not preloading SLBs I will,
> >> >> > otherwise (considering some of the other non-obvious things I stepped
> >> >> > into with the Hash code) I will keep it here for now.
> >> >>
> >> >> The ISA says slbia wants an isync afterward, so we probably should keep
> >> >> it. The comment is a bit misleading in that case.
> >> >>
> >> >> Why isn't preloading appropriate for a temporary mm?
> >> > 
> >> > The preloaded entries come from the thread_info struct which isn't
> >> > necessarily related to the temporary mm at all. I saw SLB multihits
> >> > while testing this series with my LKDTM test where the "patching
> >> > address" (userspace address for the temporary mapping w/
> >> > write-permissions) ends up in a thread's preload list and then we
> >> > explicitly insert it again in map_patch() when trying to patch. At that
> >> > point the SLB multihit triggers.
> >>
> >> Hmm, so what if we use a mm, take some SLB faults then unuse it and
> >> use a different one? I wonder if kthread_use_mm has existing problems
> >> with this incorrect SLB preloading. Quite possibly. We should clear
> >> the preload whenever mm changes I think. That should cover this as
> >> well.
> > 
> > I actually did this initially but thought it was a bit too intrusive to
> > include as part of this series and hurt performance. I agree that
> > preloading the SLB from the thread may be a problem in general when
> > switching in/out an mm.
> > 
> > kthread_use_mm may not be affected unless we explicitly insert some SLB
> > entries which could collide with an existing preload (which I don't
> > think we do anywhere until this series).
>
> kthread_use_mm(mm1);
> *ea = blah; /* slb preload[n++][ea] = va */
> kthread_unuse_mm(mm1);
>
> kthread_use_mm(mm2);
> switch_slb();
> schedule();
> /* preload ea=va? */
> x = *ea;
> kthread_unuse_mm(mm2);
>
> ? I'm sure we'd have a bug in existing code if you're hitting a bug
> there.

Not exactly - the SLB multihit happens because of the new code in this
series - specifically the slb_allocate_user() call during patching:

put_user(..., ea); /* insert ea into thread's preload list */
...
patch_instruction(..., ea);
  map_patch()
    switch_slb(); /* preload slb entry for ea from thread_info */
    ...
    slb_allocate_user(..., ea); /* insert slb entry for ea */
    __put_kernel_nofault(..., ea); /* ie. a 'stw' to patch */
    >>> SLB Multihit since we have an SLBE from the preload and the
        explicit slb_allocate_user()

Based on your other comments on this series I am dropping the Hash
support for percpu temp mm altogether for now so this is moot. But, I
still think it doesn't make much sense to preload SLB entries from a
thread_info struct when switching to a completely different mm.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
  2021-07-01  7:51         ` Nicholas Piggin
@ 2021-07-09  5:03           ` Christopher M. Riedl
  -1 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-09  5:03 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 2:51 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> > On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> >> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> >> > address to be patched is temporarily mapped as writeable. Currently, a
> >> > per-cpu vmalloc patch area is used for this purpose. While the patch
> >> > area is per-cpu, the temporary page mapping is inserted into the kernel
> >> > page tables for the duration of patching. The mapping is exposed to CPUs
> >> > other than the patching CPU - this is undesirable from a hardening
> >> > perspective. Use a temporary mm instead which keeps the mapping local to
> >> > the CPU doing the patching.
> >> > 
> >> > Use the `poking_init` init hook to prepare a temporary mm and patching
> >> > address. Initialize the temporary mm by copying the init mm. Choose a
> >> > randomized patching address inside the temporary mm userspace address
> >> > space. The patching address is randomized between PAGE_SIZE and
> >> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> >> > the Book3s64 Hash MMU operates - by default the space above
> >> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> >> > all platforms/MMUs is randomized inside this range.  The number of
> >> > possible random addresses is dependent on PAGE_SIZE and limited by
> >> > DEFAULT_MAP_WINDOW.
> >> > 
> >> > Bits of entropy with 64K page size on BOOK3S_64:
> >> > 
> >> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >> > 
> >> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> >> > 
> >> > Randomization occurs only once during initialization at boot.
> >> > 
> >> > Introduce two new functions, map_patch() and unmap_patch(), to
> >> > respectively create and remove the temporary mapping with write
> >> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> >> > the page for patching with PAGE_SHARED since the kernel cannot access
> >> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> >> > 
> >> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> >> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> >> > taking an SLB and Hash fault during patching.
> >>
> >> What prevents the SLBE or HPTE from being removed before the last
> >> access?
> > 
> > This code runs with local IRQs disabled - we also don't access anything
> > else in userspace so I'm not sure what else could cause the entries to
> > be removed TBH.
> > 
> >>
> >>
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >> >  {
> >> > -	struct vm_struct *area;
> >> > +	int err;
> >> >  
> >> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> > -	if (!area) {
> >> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> >> > -			cpu);
> >> > -		return -1;
> >> > -	}
> >> > -	this_cpu_write(text_poke_area, area);
> >> > +	if (radix_enabled())
> >> > +		return 0;
> >> >  
> >> > -	return 0;
> >> > -}
> >> > +	err = slb_allocate_user(patching_mm, patching_addr);
> >> > +	if (err)
> >> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >> >  
> >> > -static int text_area_cpu_down(unsigned int cpu)
> >> > -{
> >> > -	free_vm_area(this_cpu_read(text_poke_area));
> >> > -	return 0;
> >> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> >> > +			   HPTE_USE_KERNEL_KEY);
> >> > +	if (err)
> >> > +		pr_warn("map patch: failed to insert hashed page\n");
> >> > +
> >> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> >> > +	isync();
> >>
> >> I'm not sure if this is enough. Could we context switch here? You've
> >> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> >> How about taking an machine check that clears the SLB? Could the HPTE
> >> get removed by something else here?
> > 
> > All of this happens after a local_irq_save() which should at least
> > prevent context switches IIUC.
>
> Ah yeah I didn't look that far back. A machine check can take out SLB
> entries.
>
> > I am not sure what else could cause the
> > HPTE to get removed here.
>
> Other CPUs?
>

Right because the HPTEs are "global".

> >> You want to prevent faults because you might be patching a fault
> >> handler?
> > 
> > In a more general sense: I don't think we want to take page faults every
> > time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> > fault handler codepath also checks `current->mm` in some places which
> > won't match the temporary mm. Also `current->mm` can be NULL which
> > caused problems in my earlier revisions of this series.
>
> Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and
> setting current->mm properly would explode too much. Maybe that's
> okayish.
> But I can't see how the HPT code is up to the job of this in general
> (even if that current->mm issue was fixed).
>
> To do it without holes you would either have to get the SLB MCE handler
> to restore that particular SLB if it flushed it, or restart the patch
> code from a fixup location if it took an MCE after installing the SLB.
> And bolt a hash table entry.

We discussed this a bit off list and decided that it's not worth the
trouble implementing percpu temp mm support for Hash at this time.
Instead, I will post a new version of this series where we drop into
realmode to patch with the Hash MMU. This avoids the W+X mapping
altogether and so doesn't expose anything to other CPUs during patching.
We will keep the Radix support for a percpu temp mm since 1) it doesn't
require hacks like Hash and 2) it's overall preferable to dropping into
realmode.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
@ 2021-07-09  5:03           ` Christopher M. Riedl
  0 siblings, 0 replies; 45+ messages in thread
From: Christopher M. Riedl @ 2021-07-09  5:03 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: tglx, x86, keescook, linux-hardening

On Thu Jul 1, 2021 at 2:51 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> > On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> >> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> >> > address to be patched is temporarily mapped as writeable. Currently, a
> >> > per-cpu vmalloc patch area is used for this purpose. While the patch
> >> > area is per-cpu, the temporary page mapping is inserted into the kernel
> >> > page tables for the duration of patching. The mapping is exposed to CPUs
> >> > other than the patching CPU - this is undesirable from a hardening
> >> > perspective. Use a temporary mm instead which keeps the mapping local to
> >> > the CPU doing the patching.
> >> > 
> >> > Use the `poking_init` init hook to prepare a temporary mm and patching
> >> > address. Initialize the temporary mm by copying the init mm. Choose a
> >> > randomized patching address inside the temporary mm userspace address
> >> > space. The patching address is randomized between PAGE_SIZE and
> >> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> >> > the Book3s64 Hash MMU operates - by default the space above
> >> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> >> > all platforms/MMUs is randomized inside this range.  The number of
> >> > possible random addresses is dependent on PAGE_SIZE and limited by
> >> > DEFAULT_MAP_WINDOW.
> >> > 
> >> > Bits of entropy with 64K page size on BOOK3S_64:
> >> > 
> >> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >> > 
> >> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> >> > 
> >> > Randomization occurs only once during initialization at boot.
> >> > 
> >> > Introduce two new functions, map_patch() and unmap_patch(), to
> >> > respectively create and remove the temporary mapping with write
> >> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> >> > the page for patching with PAGE_SHARED since the kernel cannot access
> >> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> >> > 
> >> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> >> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> >> > taking an SLB and Hash fault during patching.
> >>
> >> What prevents the SLBE or HPTE from being removed before the last
> >> access?
> > 
> > This code runs with local IRQs disabled - we also don't access anything
> > else in userspace so I'm not sure what else could cause the entries to
> > be removed TBH.
> > 
> >>
> >>
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >> >  {
> >> > -	struct vm_struct *area;
> >> > +	int err;
> >> >  
> >> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> > -	if (!area) {
> >> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> >> > -			cpu);
> >> > -		return -1;
> >> > -	}
> >> > -	this_cpu_write(text_poke_area, area);
> >> > +	if (radix_enabled())
> >> > +		return 0;
> >> >  
> >> > -	return 0;
> >> > -}
> >> > +	err = slb_allocate_user(patching_mm, patching_addr);
> >> > +	if (err)
> >> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >> >  
> >> > -static int text_area_cpu_down(unsigned int cpu)
> >> > -{
> >> > -	free_vm_area(this_cpu_read(text_poke_area));
> >> > -	return 0;
> >> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> >> > +			   HPTE_USE_KERNEL_KEY);
> >> > +	if (err)
> >> > +		pr_warn("map patch: failed to insert hashed page\n");
> >> > +
> >> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> >> > +	isync();
> >>
> >> I'm not sure if this is enough. Could we context switch here? You've
> >> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> >> How about taking an machine check that clears the SLB? Could the HPTE
> >> get removed by something else here?
> > 
> > All of this happens after a local_irq_save() which should at least
> > prevent context switches IIUC.
>
> Ah yeah I didn't look that far back. A machine check can take out SLB
> entries.
>
> > I am not sure what else could cause the
> > HPTE to get removed here.
>
> Other CPUs?
>

Right because the HPTEs are "global".

> >> You want to prevent faults because you might be patching a fault
> >> handler?
> > 
> > In a more general sense: I don't think we want to take page faults every
> > time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> > fault handler codepath also checks `current->mm` in some places which
> > won't match the temporary mm. Also `current->mm` can be NULL which
> > caused problems in my earlier revisions of this series.
>
> Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and
> setting current->mm properly would explode too much. Maybe that's
> okayish.
> But I can't see how the HPT code is up to the job of this in general
> (even if that current->mm issue was fixed).
>
> To do it without holes you would either have to get the SLB MCE handler
> to restore that particular SLB if it flushed it, or restart the patch
> code from a fixup location if it took an MCE after installing the SLB.
> And bolt a hash table entry.

We discussed this a bit off list and decided that it's not worth the
trouble implementing percpu temp mm support for Hash at this time.
Instead, I will post a new version of this series where we drop into
realmode to patch with the Hash MMU. This avoids the W+X mapping
altogether and so doesn't expose anything to other CPUs during patching.
We will keep the Radix support for a percpu temp mm since 1) it doesn't
require hacks like Hash and 2) it's overall preferable to dropping into
realmode.

>
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-07-09  5:03 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06  4:34 [RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 02/11] lkdtm/powerpc: Add test to hijack a patch mapping Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 03/11] x86_64: Add LKDTM accessor for patching addr Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 04/11] lkdtm/x86_64: Add test to hijack a patch mapping Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload Christopher M. Riedl
2021-06-21  3:13   ` Daniel Axtens
2021-07-01  3:48     ` Christopher M. Riedl
2021-07-01  3:48       ` Christopher M. Riedl
2021-07-01  4:15       ` Nicholas Piggin
2021-07-01  4:15         ` Nicholas Piggin
2021-07-01  5:28         ` Christopher M. Riedl
2021-07-01  5:28           ` Christopher M. Riedl
2021-07-01  6:04           ` Nicholas Piggin
2021-07-01  6:04             ` Nicholas Piggin
2021-07-01  6:53             ` Christopher M. Riedl
2021-07-01  6:53               ` Christopher M. Riedl
2021-07-01  7:37               ` Nicholas Piggin
2021-07-01  7:37                 ` Nicholas Piggin
2021-07-01 11:30                 ` Nicholas Piggin
2021-07-01 11:30                   ` Nicholas Piggin
2021-07-09  4:55                 ` Christopher M. Riedl
2021-07-09  4:55                   ` Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 06/11] powerpc: Introduce temporary mm Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 07/11] powerpc/64s: Make slb_allocate_user() non-static Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching Christopher M. Riedl
2021-06-21  3:19   ` Daniel Axtens
2021-07-01  5:11     ` Christopher M. Riedl
2021-07-01  5:11       ` Christopher M. Riedl
2021-07-01  6:12   ` Nicholas Piggin
2021-07-01  6:12     ` Nicholas Piggin
2021-07-01  7:02     ` Christopher M. Riedl
2021-07-01  7:02       ` Christopher M. Riedl
2021-07-01  7:51       ` Nicholas Piggin
2021-07-01  7:51         ` Nicholas Piggin
2021-07-09  5:03         ` Christopher M. Riedl
2021-07-09  5:03           ` Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 09/11] lkdtm/powerpc: Fix code patching hijack test Christopher M. Riedl
2021-05-06  4:34 ` [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock Christopher M. Riedl
2021-05-06 10:51   ` Peter Zijlstra
2021-05-06 10:51     ` Peter Zijlstra
2021-05-07 20:03     ` Christopher M. Riedl
2021-05-07 20:03       ` Christopher M. Riedl
2021-05-07 22:26       ` Peter Zijlstra
2021-05-06  4:34 ` [RESEND PATCH v4 11/11] powerpc: Use patch_instruction_unlocked() in loops Christopher M. Riedl

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.