linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU
@ 2021-09-11  2:29 Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for " Christopher M. Riedl
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening

When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

	1. Map page in per-CPU VM area w/ PAGE_KERNEL protection
	2. Patch text
	3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching [0].

x86 introduced "temporary mm" structs which allow the creation of mappings
local to a particular CPU [1]. This series intends to bring the notion of a
temporary mm to powerpc's Book3s64 Radix MMU and harden it by using such a
mapping for patching a kernel with strict RWX permissions.

Tested boot and ftrace:
	- QEMU+KVM (host: POWER9 Blackbird): Radix MMU w/ KUAP
	- QEMU+KVM (host: POWER9 Blackbird): Hash MMU

Tested boot:
	- QEMU+TCG: ppc44x (bamboo)
	- QEMU+TCG: g5 (mac99)

I also tested with various extra config options enabled as suggested in
section 12) in Documentation/process/submit-checklist.rst.

v6:	* Split series to separate powerpc percpu temporary mm
	  implementation and LKDTM test (still working on this one) and
	  implement some of Christophe Leroy's feedback.
	* Rebase on linuxppc/next: powerpc-5.15-1

v5:	* Only support Book3s64 Radix MMU for now. There are some issues with
	  the previous implementation on the Hash MMU as pointed out by Nick
	  Piggin. Fixing these is not trivial so we only support the Radix MMU
	  for now.

v4:	* It's time to revisit this series again since @jpn and @mpe fixed
	  our known STRICT_*_RWX bugs on powerpc/64s.
	* Rebase on linuxppc/next:
          commit ee1bc694fbaec ("powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n")
	* Completely rework how map_patch() works on book3s64 Hash MMU
	* Split the LKDTM x86_64 and powerpc bits into separate patches
	* Annotate commit messages with changes from v3 instead of
	  listing them here completely out-of context...

v3:	* Rebase on linuxppc/next: commit 9123e3a74ec7 ("Linux 5.9-rc1")
	* Move temporary mm implementation into code-patching.c where it
	  belongs
	* Implement LKDTM hijacker test on x86_64 (on IBM time oof) Do
	* not use address zero for the patching address in the
	  temporary mm (thanks @dja for pointing this out!)
	* Wrap the LKDTM test w/ CONFIG_SMP as suggested by Christophe
	  Leroy
	* Comments to clarify PTE pre-allocation and patching addr
	  selection

v2:	* Rebase on linuxppc/next:
	  commit 105fb38124a4 ("powerpc/8xx: Modify ptep_get()")
	* Always dirty pte when mapping patch
	* Use `ppc_inst_len` instead of `sizeof` on instructions
	* Declare LKDTM patching addr accessor in header where it belongs	

v1:	* Rebase on linuxppc/next (4336b9337824)
	* Save and restore second hw watchpoint
	* Use new ppc_inst_* functions for patching check and in LKDTM test

rfc-v2:	* Many fixes and improvements mostly based on extensive feedback
          and testing by Christophe Leroy (thanks!).
	* Make patching_mm and patching_addr static and move
	  '__ro_after_init' to after the variable name (more common in
	  other parts of the kernel)
	* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to
	  fix PPC64e compile
	* Add comment explaining why we use BUG_ON() during the init
	  call to setup for patching later
	* Move ptep into patch_mapping to avoid walking page tables a
	  second time when unmapping the temporary mapping
	* Use KUAP under non-radix, also manually dirty the PTE for patch
	  mapping on non-BOOK3S_64 platforms
	* Properly return any error from __patch_instruction
        * Do not use 'memcmp' where a simple comparison is appropriate
	* Simplify expression for patch address by removing pointer maths
	* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.amit@gmail.com/

Christopher M. Riedl (4):
  powerpc/64s: Introduce temporary mm for Radix MMU
  powerpc: Rework and improve STRICT_KERNEL_RWX patching
  powerpc: Use WARN_ON and fix check in poking_init
  powerpc/64s: Initialize and use a temporary mm for patching on Radix

 arch/powerpc/include/asm/debug.h |   1 +
 arch/powerpc/kernel/process.c    |   5 +
 arch/powerpc/lib/code-patching.c | 215 ++++++++++++++++++++++++++-----
 3 files changed, 191 insertions(+), 30 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU
  2021-09-11  2:29 [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU Christopher M. Riedl
@ 2021-09-11  2:29 ` Christopher M. Riedl
  2021-09-11  8:26   ` Jordan Niethe
  2021-09-11  2:29 ` [PATCH v6 2/4] powerpc: Rework and improve STRICT_KERNEL_RWX patching Christopher M. Riedl
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening

x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * Use {start,stop}_using_temporary_mm() instead of
       {use,unuse}_temporary_mm() as suggested by Christophe.

v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
       using/unusing the temp mm as suggested by Jann Horn to keep
       the context.active counter in-sync on mm/nohash.
     * Disable SLB preload in the temporary mm when initializing the
       temp_mm struct.
     * Include asm/debug.h header to fix build issue with
       ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c    |  5 +++
 arch/powerpc/lib/code-patching.c | 56 ++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c..dfd82635ea8b 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 50436b52c213..6aa1f5c4d520 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
 	return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+	memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
 	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index f9a3019e37b4..8d61a7d35b89 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,9 @@
 #include <asm/code-patching.h>
 #include <asm/setup.h>
 #include <asm/inst.h>
+#include <asm/mmu_context.h>
+#include <asm/debug.h>
+#include <asm/tlb.h>
 
 static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 *patch_addr)
 {
@@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+	struct mm_struct *temp;
+	struct mm_struct *prev;
+	struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+	/* We currently only support temporary mm on the Book3s64 Radix MMU */
+	WARN_ON(!radix_enabled());
+
+	temp_mm->temp = mm;
+	temp_mm->prev = NULL;
+	memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
+{
+	lockdep_assert_irqs_disabled();
+
+	temp_mm->prev = current->active_mm;
+	switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+	WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+	if (ppc_breakpoint_available()) {
+		struct arch_hw_breakpoint null_brk = {0};
+		int i = 0;
+
+		for (; i < nr_wp_slots(); ++i) {
+			__get_breakpoint(i, &temp_mm->brk[i]);
+			if (temp_mm->brk[i].type != 0)
+				__set_breakpoint(i, &null_brk);
+		}
+	}
+}
+
+static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
+{
+	lockdep_assert_irqs_disabled();
+
+	switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
+
+	if (ppc_breakpoint_available()) {
+		int i = 0;
+
+		for (; i < nr_wp_slots(); ++i)
+			if (temp_mm->brk[i].type != 0)
+				__set_breakpoint(i, &temp_mm->brk[i]);
+	}
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
 static int text_area_cpu_up(unsigned int cpu)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 2/4] powerpc: Rework and improve STRICT_KERNEL_RWX patching
  2021-09-11  2:29 [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for " Christopher M. Riedl
@ 2021-09-11  2:29 ` Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 3/4] powerpc: Use WARN_ON and fix check in poking_init Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix Christopher M. Riedl
  3 siblings, 0 replies; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening

Rework code-patching with STRICT_KERNEL_RWX to prepare for a later patch
which uses a temporary mm for patching under the Book3s64 Radix MMU.
Make improvements by adding a WARN_ON when the patchsite doesn't match
after patching and return the error from __patch_instruction() properly.

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * Remove the pr_warn() message from unmap_patch_area().

v5:  * New to series.
---
 arch/powerpc/lib/code-patching.c | 35 ++++++++++++++++----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 8d61a7d35b89..8d0bb86125d5 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -102,6 +102,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
 }
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
 
 static int text_area_cpu_up(unsigned int cpu)
 {
@@ -114,6 +115,7 @@ static int text_area_cpu_up(unsigned int cpu)
 		return -1;
 	}
 	this_cpu_write(text_poke_area, area);
+	this_cpu_write(cpu_patching_addr, (unsigned long)area->addr);
 
 	return 0;
 }
@@ -139,7 +141,7 @@ void __init poking_init(void)
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch_area(void *addr)
 {
 	unsigned long pfn;
 	int err;
@@ -149,17 +151,20 @@ static int map_patch_area(void *addr, unsigned long text_poke_addr)
 	else
 		pfn = __pa_symbol(addr) >> PAGE_SHIFT;
 
-	err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+	err = map_kernel_page(__this_cpu_read(cpu_patching_addr),
+			      (pfn << PAGE_SHIFT), PAGE_KERNEL);
 
-	pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
+	pr_devel("Mapped addr %lx with pfn %lx:%d\n",
+		 __this_cpu_read(cpu_patching_addr), pfn, err);
 	if (err)
 		return -1;
 
 	return 0;
 }
 
-static inline int unmap_patch_area(unsigned long addr)
+static inline int unmap_patch_area(void)
 {
+	unsigned long addr = __this_cpu_read(cpu_patching_addr);
 	pte_t *ptep;
 	pmd_t *pmdp;
 	pud_t *pudp;
@@ -199,11 +204,9 @@ static inline int unmap_patch_area(unsigned long addr)
 
 static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 {
-	int err;
+	int err, rc = 0;
 	u32 *patch_addr = NULL;
 	unsigned long flags;
-	unsigned long text_poke_addr;
-	unsigned long kaddr = (unsigned long)addr;
 
 	/*
 	 * During early early boot patch_instruction is called
@@ -215,24 +218,20 @@ static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 
 	local_irq_save(flags);
 
-	text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
-	if (map_patch_area(addr, text_poke_addr)) {
-		err = -1;
+	err = map_patch_area(addr);
+	if (err)
 		goto out;
-	}
 
-	patch_addr = (u32 *)(text_poke_addr + (kaddr & ~PAGE_MASK));
+	patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
+	rc = __patch_instruction(addr, instr, patch_addr);
 
-	__patch_instruction(addr, instr, patch_addr);
-
-	err = unmap_patch_area(text_poke_addr);
-	if (err)
-		pr_warn("failed to unmap %lx\n", text_poke_addr);
+	err = unmap_patch_area();
 
 out:
 	local_irq_restore(flags);
+	WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
 
-	return err;
+	return rc ? rc : err;
 }
 #else /* !CONFIG_STRICT_KERNEL_RWX */
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 3/4] powerpc: Use WARN_ON and fix check in poking_init
  2021-09-11  2:29 [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for " Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 2/4] powerpc: Rework and improve STRICT_KERNEL_RWX patching Christopher M. Riedl
@ 2021-09-11  2:29 ` Christopher M. Riedl
  2021-09-11  2:29 ` [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix Christopher M. Riedl
  3 siblings, 0 replies; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening

The latest kernel docs list BUG_ON() as 'deprecated' and that they
should be replaced with WARN_ON() (or pr_warn()) when possible. The
BUG_ON() in poking_init() warrants a WARN_ON() rather than a pr_warn()
since the error condition is deemed "unreachable".

Also take this opportunity to fix the failure check in the WARN_ON():
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ...) returns a positive integer
on success and a negative integer on failure.

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * New to series - based on Christophe's relentless feedback in the
       crusade against BUG_ON()s :)
---
 arch/powerpc/lib/code-patching.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 8d0bb86125d5..e802e42c2789 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -126,16 +126,11 @@ static int text_area_cpu_down(unsigned int cpu)
 	return 0;
 }
 
-/*
- * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
- * we judge it as being preferable to a kernel that will crash later when
- * someone tries to use patch_instruction().
- */
 void __init poking_init(void)
 {
-	BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
 		"powerpc/text_poke:online", text_area_cpu_up,
-		text_area_cpu_down));
+		text_area_cpu_down) < 0);
 }
 
 /*
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-11  2:29 [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU Christopher M. Riedl
                   ` (2 preceding siblings ...)
  2021-09-11  2:29 ` [PATCH v6 3/4] powerpc: Use WARN_ON and fix check in poking_init Christopher M. Riedl
@ 2021-09-11  2:29 ` Christopher M. Riedl
  2021-09-11  9:14   ` Jordan Niethe
  2021-09-15  4:24   ` Jordan Niethe
  3 siblings, 2 replies; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening

When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

        bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

        PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
        bits of entropy = log2(128TB / 64K)
	bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization at boot for each
possible CPU in the system.

Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
respectively create and remove the temporary mapping with write
permissions at patching_addr. Map the page with PAGE_KERNEL to set
EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
KUAP) according to PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * Small clean-ups (naming, formatting, style, etc).
     * Call stop_using_temporary_mm() before pte_unmap_unlock() after
       patching.
     * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.

v5:  * Only support Book3s64 Radix MMU for now.
     * Use a per-cpu datastructure to hold the patching_addr and
       patching_mm to avoid the need for a synchronization lock/mutex.

v4:  * In the previous series this was two separate patches: one to init
       the temporary mm in poking_init() (unused in powerpc at the time)
       and the other to use it for patching (which removed all the
       per-cpu vmalloc code). Now that we use poking_init() in the
       existing per-cpu vmalloc approach, that separation doesn't work
       as nicely anymore so I just merged the two patches into one.
     * Preload the SLB entry and hash the page for the patching_addr
       when using Hash on book3s64 to avoid taking an SLB and Hash fault
       during patching. The previous implementation was a hack which
       changed current->mm to allow the SLB and Hash fault handlers to
       work with the temporary mm since both of those code-paths always
       assume mm == current->mm.
     * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
       have to manage the mm->context.active_cpus counter and mm cpumask
       since they determine (via mm_is_thread_local()) if the TLB flush
       in pte_clear() is local or not - it should always be local when
       we're using the temporary mm. On book3s64's Radix MMU we can
       just call local_flush_tlb_mm().
     * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
       KUAP.
---
 arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
 1 file changed, 112 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index e802e42c2789..af8e2a02a9dd 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/random.h>
 
 #include <asm/tlbflush.h>
 #include <asm/page.h>
@@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
 
 static int text_area_cpu_up(unsigned int cpu)
 {
@@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
 	return 0;
 }
 
+static __always_inline void __poking_init_temp_mm(void)
+{
+	int cpu;
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+	struct mm_struct *patching_mm;
+	unsigned long patching_addr;
+
+	for_each_possible_cpu(cpu) {
+		patching_mm = copy_init_mm();
+		WARN_ON(!patching_mm);
+		per_cpu(cpu_patching_mm, cpu) = patching_mm;
+
+		/*
+		 * Choose a randomized, page-aligned address from the range:
+		 * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
+		 * address bound is PAGE_SIZE to avoid the zero-page.  The
+		 * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
+		 * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
+		 */
+		patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
+				% (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
+		per_cpu(cpu_patching_addr, cpu) = patching_addr;
+
+		/*
+		 * PTE allocation uses GFP_KERNEL which means we need to
+		 * pre-allocate the PTE here because we cannot do the
+		 * allocation during patching when IRQs are disabled.
+		 */
+		ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
+		WARN_ON(!ptep);
+		pte_unmap_unlock(ptep, ptl);
+	}
+}
+
 void __init poking_init(void)
 {
+	if (radix_enabled()) {
+		__poking_init_temp_mm();
+		return;
+	}
+
 	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
 		"powerpc/text_poke:online", text_area_cpu_up,
 		text_area_cpu_down) < 0);
@@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
 	return 0;
 }
 
+struct patch_mapping {
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+	struct temp_mm temp_mm;
+};
+
+/*
+ * This can be called for kernel text or a module.
+ */
+static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
+{
+	struct page *page;
+	struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
+	unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
+
+	if (is_vmalloc_or_module_addr(addr))
+		page = vmalloc_to_page(addr);
+	else
+		page = virt_to_page(addr);
+
+	patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
+					     &patch_mapping->ptl);
+	if (unlikely(!patch_mapping->ptep)) {
+		pr_warn("map patch: failed to allocate pte for patching\n");
+		return -1;
+	}
+
+	set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
+		   pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
+
+	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+	start_using_temporary_mm(&patch_mapping->temp_mm);
+
+	return 0;
+}
+
+static int unmap_patch_mm(struct patch_mapping *patch_mapping)
+{
+	struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
+	unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
+
+	pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
+
+	local_flush_tlb_mm(patching_mm);
+	stop_using_temporary_mm(&patch_mapping->temp_mm);
+
+	pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
+
+	return 0;
+}
+
 static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 {
 	int err, rc = 0;
 	u32 *patch_addr = NULL;
 	unsigned long flags;
+	struct patch_mapping patch_mapping;
 
 	/*
-	 * During early early boot patch_instruction is called
-	 * when text_poke_area is not ready, but we still need
-	 * to allow patching. We just do the plain old patching
+	 * During early early boot patch_instruction is called when the
+	 * patching_mm/text_poke_area is not ready, but we still need to allow
+	 * patching. We just do the plain old patching.
 	 */
-	if (!this_cpu_read(text_poke_area))
-		return raw_patch_instruction(addr, instr);
+	if (radix_enabled()) {
+		if (!this_cpu_read(cpu_patching_mm))
+			return raw_patch_instruction(addr, instr);
+	} else {
+		if (!this_cpu_read(text_poke_area))
+			return raw_patch_instruction(addr, instr);
+	}
 
 	local_irq_save(flags);
 
-	err = map_patch_area(addr);
+	if (radix_enabled())
+		err = map_patch_mm(addr, &patch_mapping);
+	else
+		err = map_patch_area(addr);
 	if (err)
 		goto out;
 
 	patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
 	rc = __patch_instruction(addr, instr, patch_addr);
 
-	err = unmap_patch_area();
+	if (radix_enabled())
+		err = unmap_patch_mm(&patch_mapping);
+	else
+		err = unmap_patch_area();
 
 out:
 	local_irq_restore(flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU
  2021-09-11  2:29 ` [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for " Christopher M. Riedl
@ 2021-09-11  8:26   ` Jordan Niethe
  2021-09-16  0:24     ` Christopher M. Riedl
  0 siblings, 1 reply; 13+ messages in thread
From: Jordan Niethe @ 2021-09-11  8:26 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening

On Sat, Sep 11, 2021 at 12:35 PM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> x86 supports the notion of a temporary mm which restricts access to
> temporary PTEs to a single CPU. A temporary mm is useful for situations
> where a CPU needs to perform sensitive operations (such as patching a
> STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> said mappings to other CPUs. Another benefit is that other CPU TLBs do
> not need to be flushed when the temporary mm is torn down.
>
> Mappings in the temporary mm can be set in the userspace portion of the
> address-space.
>
> Interrupts must be disabled while the temporary mm is in use. HW
> breakpoints, which may have been set by userspace as watchpoints on
> addresses now within the temporary mm, are saved and disabled when
> loading the temporary mm. The HW breakpoints are restored when unloading
> the temporary mm. All HW breakpoints are indiscriminately disabled while
> the temporary mm is in use - this may include breakpoints set by perf.

I had thought CPUs with a DAWR might not need to do this because the
privilege level that breakpoints trigger on can be configured. But it
turns out in ptrace, etc we use HW_BRK_TYPE_PRIV_ALL.

>
> Based on x86 implementation:
>
> commit cefa929c034e
> ("x86/mm: Introduce temporary mm structs")
>
> Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
>
> ---
>
> v6:  * Use {start,stop}_using_temporary_mm() instead of
>        {use,unuse}_temporary_mm() as suggested by Christophe.
>
> v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
>
> v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
>        using/unusing the temp mm as suggested by Jann Horn to keep
>        the context.active counter in-sync on mm/nohash.
>      * Disable SLB preload in the temporary mm when initializing the
>        temp_mm struct.
>      * Include asm/debug.h header to fix build issue with
>        ppc44x_defconfig.
> ---
>  arch/powerpc/include/asm/debug.h |  1 +
>  arch/powerpc/kernel/process.c    |  5 +++
>  arch/powerpc/lib/code-patching.c | 56 ++++++++++++++++++++++++++++++++
>  3 files changed, 62 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
> index 86a14736c76c..dfd82635ea8b 100644
> --- a/arch/powerpc/include/asm/debug.h
> +++ b/arch/powerpc/include/asm/debug.h
> @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
>  #endif
>
>  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
>  bool ppc_breakpoint_available(void);
>  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
>  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 50436b52c213..6aa1f5c4d520 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
>         return 0;
>  }
>
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> +{
> +       memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
> +}

The breakpoint code is already a little hard to follow. I'm worried
doing this might spread breakpoint handling into more places in the
future.
What about something like having a breakpoint_pause() function which
clears the hardware registers only and then a breakpoint_resume()
function that copies from current_brk[] back to the hardware
registers?
Then we don't have to make another copy of the breakpoint state.

> +
>  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
>  {
>         memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index f9a3019e37b4..8d61a7d35b89 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c

Sorry I might have missed it, but what was the reason for not putting
this stuff in mmu_context.h?

> @@ -17,6 +17,9 @@
>  #include <asm/code-patching.h>
>  #include <asm/setup.h>
>  #include <asm/inst.h>
> +#include <asm/mmu_context.h>
> +#include <asm/debug.h>
> +#include <asm/tlb.h>
>
>  static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 *patch_addr)
>  {
> @@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
>  }
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct temp_mm {
> +       struct mm_struct *temp;
> +       struct mm_struct *prev;
> +       struct arch_hw_breakpoint brk[HBP_NUM_MAX];
^ Then we wouldn't need this.
> +};
> +
> +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> +{
> +       /* We currently only support temporary mm on the Book3s64 Radix MMU */
> +       WARN_ON(!radix_enabled());
> +
> +       temp_mm->temp = mm;
> +       temp_mm->prev = NULL;
> +       memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> +}
> +
> +static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
> +{
> +       lockdep_assert_irqs_disabled();
> +
> +       temp_mm->prev = current->active_mm;
> +       switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);

Now that we only support radix it should be fine again to have it like this:
switch_mm_irqs_off(NULL, temp_mm->temp, current);?
It was changed from that because it would cause issues on nohash I thought.

> +
> +       WARN_ON(!mm_is_thread_local(temp_mm->temp));
> +
> +       if (ppc_breakpoint_available()) {
> +               struct arch_hw_breakpoint null_brk = {0};
> +               int i = 0;
> +
> +               for (; i < nr_wp_slots(); ++i) {
> +                       __get_breakpoint(i, &temp_mm->brk[i]);
> +                       if (temp_mm->brk[i].type != 0)
> +                               __set_breakpoint(i, &null_brk);
> +               }
> +       }
> +}
> +
> +static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> +{
> +       lockdep_assert_irqs_disabled();
> +
> +       switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
> +
> +       if (ppc_breakpoint_available()) {
> +               int i = 0;
> +
> +               for (; i < nr_wp_slots(); ++i)
> +                       if (temp_mm->brk[i].type != 0)
> +                               __set_breakpoint(i, &temp_mm->brk[i]);
> +       }
> +}
> +
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>
>  static int text_area_cpu_up(unsigned int cpu)
> --
> 2.32.0
>
Thanks,
Jordan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-11  2:29 ` [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix Christopher M. Riedl
@ 2021-09-11  9:14   ` Jordan Niethe
  2021-09-16  0:29     ` Christopher M. Riedl
  2021-09-15  4:24   ` Jordan Niethe
  1 sibling, 1 reply; 13+ messages in thread
From: Jordan Niethe @ 2021-09-11  9:14 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening

On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
>
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE.
>
> Bits of entropy with 64K page size on BOOK3S_64:
>
>         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>
>         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>         bits of entropy = log2(128TB / 64K)
>         bits of entropy = 31
>
> The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> operates - by default the space above DEFAULT_MAP_WINDOW is not
> available. Currently the Hash MMU does not use a temporary mm so
> technically this upper limit isn't necessary; however, a larger
> randomization range does not further "harden" this overall approach and
> future work may introduce patching with a temporary mm on Hash as well.
>
> Randomization occurs only once during initialization at boot for each
> possible CPU in the system.
>
> Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. Map the page with PAGE_KERNEL to set
> EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> KUAP) according to PowerISA v3.0b Figure 35 on Radix.
>
> Based on x86 implementation:
>
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
>
> and:
>
> commit b3fd8e83ada0
> ("x86/alternatives: Use temporary mm for text poking")
>
> Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
>
> ---
>
> v6:  * Small clean-ups (naming, formatting, style, etc).
>      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
>        patching.
>      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
>
> v5:  * Only support Book3s64 Radix MMU for now.
>      * Use a per-cpu datastructure to hold the patching_addr and
>        patching_mm to avoid the need for a synchronization lock/mutex.
>
> v4:  * In the previous series this was two separate patches: one to init
>        the temporary mm in poking_init() (unused in powerpc at the time)
>        and the other to use it for patching (which removed all the
>        per-cpu vmalloc code). Now that we use poking_init() in the
>        existing per-cpu vmalloc approach, that separation doesn't work
>        as nicely anymore so I just merged the two patches into one.
>      * Preload the SLB entry and hash the page for the patching_addr
>        when using Hash on book3s64 to avoid taking an SLB and Hash fault
>        during patching. The previous implementation was a hack which
>        changed current->mm to allow the SLB and Hash fault handlers to
>        work with the temporary mm since both of those code-paths always
>        assume mm == current->mm.
>      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
>        have to manage the mm->context.active_cpus counter and mm cpumask
>        since they determine (via mm_is_thread_local()) if the TLB flush
>        in pte_clear() is local or not - it should always be local when
>        we're using the temporary mm. On book3s64's Radix MMU we can
>        just call local_flush_tlb_mm().
>      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
>        KUAP.
> ---
>  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
>  1 file changed, 112 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index e802e42c2789..af8e2a02a9dd 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -11,6 +11,7 @@
>  #include <linux/cpuhotplug.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/random.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/page.h>
> @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
>
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
>
>  static int text_area_cpu_up(unsigned int cpu)
>  {
> @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
>         return 0;
>  }
>
> +static __always_inline void __poking_init_temp_mm(void)
> +{
> +       int cpu;
> +       spinlock_t *ptl; /* for protecting pte table */

ptl is just used so we don't have to open code allocating a pte in
patching_mm isn't it?

> +       pte_t *ptep;
> +       struct mm_struct *patching_mm;
> +       unsigned long patching_addr;
> +
> +       for_each_possible_cpu(cpu) {
> +               patching_mm = copy_init_mm();
> +               WARN_ON(!patching_mm);

Would it be okay to just let the mmu handle null pointer dereferences?

> +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> +
> +               /*
> +                * Choose a randomized, page-aligned address from the range:
> +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> +                */
> +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> +               per_cpu(cpu_patching_addr, cpu) = patching_addr;

On x86 the randomization depends on CONFIG_RANDOMIZE_BASE. Should it
be controllable here too?

> +
> +               /*
> +                * PTE allocation uses GFP_KERNEL which means we need to
> +                * pre-allocate the PTE here because we cannot do the
> +                * allocation during patching when IRQs are disabled.
> +                */
> +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> +               WARN_ON(!ptep);
> +               pte_unmap_unlock(ptep, ptl);
> +       }
> +}
> +
>  void __init poking_init(void)
>  {
> +       if (radix_enabled()) {
> +               __poking_init_temp_mm();

Should this also be done with cpuhp_setup_state()?

> +               return;
> +       }
> +
>         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
>                 "powerpc/text_poke:online", text_area_cpu_up,
>                 text_area_cpu_down) < 0);
> @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
>         return 0;
>  }
>
> +struct patch_mapping {
> +       spinlock_t *ptl; /* for protecting pte table */
> +       pte_t *ptep;
> +       struct temp_mm temp_mm;
> +};
> +
> +/*
> + * This can be called for kernel text or a module.
> + */
> +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> +{
> +       struct page *page;
> +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> +
> +       if (is_vmalloc_or_module_addr(addr))
> +               page = vmalloc_to_page(addr);
> +       else
> +               page = virt_to_page(addr);
> +
> +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> +                                            &patch_mapping->ptl);
> +       if (unlikely(!patch_mapping->ptep)) {
> +               pr_warn("map patch: failed to allocate pte for patching\n");
> +               return -1;
> +       }
> +
> +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> +
> +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> +       start_using_temporary_mm(&patch_mapping->temp_mm);
> +
> +       return 0;
> +}
> +
> +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> +{
> +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> +
> +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> +
> +       local_flush_tlb_mm(patching_mm);
> +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> +
> +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> +
> +       return 0;
> +}
> +
>  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
>  {
>         int err, rc = 0;
>         u32 *patch_addr = NULL;
>         unsigned long flags;
> +       struct patch_mapping patch_mapping;
>
>         /*
> -        * During early early boot patch_instruction is called
> -        * when text_poke_area is not ready, but we still need
> -        * to allow patching. We just do the plain old patching
> +        * During early early boot patch_instruction is called when the
> +        * patching_mm/text_poke_area is not ready, but we still need to allow
> +        * patching. We just do the plain old patching.
>          */
> -       if (!this_cpu_read(text_poke_area))
> -               return raw_patch_instruction(addr, instr);
> +       if (radix_enabled()) {
> +               if (!this_cpu_read(cpu_patching_mm))
> +                       return raw_patch_instruction(addr, instr);
> +       } else {
> +               if (!this_cpu_read(text_poke_area))
> +                       return raw_patch_instruction(addr, instr);
> +       }

Would testing cpu_patching_addr handler both of these cases?

Then I think it might be clearer to do something like this:
if (radix_enabled()) {
      return patch_instruction_mm(addr, instr);
}

patch_instruction_mm() would combine map_patch_mm(), then patching and
unmap_patch_mm() into one function.

IMO, a bit of code duplication would be cleaner than checking multiple
times for radix_enabled() and having struct patch_mapping especially
for maintaining state.

>
>         local_irq_save(flags);
>
> -       err = map_patch_area(addr);
> +       if (radix_enabled())
> +               err = map_patch_mm(addr, &patch_mapping);
> +       else
> +               err = map_patch_area(addr);
>         if (err)
>                 goto out;
>
>         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
>         rc = __patch_instruction(addr, instr, patch_addr);
>
> -       err = unmap_patch_area();
> +       if (radix_enabled())
> +               err = unmap_patch_mm(&patch_mapping);
> +       else
> +               err = unmap_patch_area();
>
>  out:
>         local_irq_restore(flags);
> --
> 2.32.0
>
Thanks,
Jordan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-11  2:29 ` [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix Christopher M. Riedl
  2021-09-11  9:14   ` Jordan Niethe
@ 2021-09-15  4:24   ` Jordan Niethe
  2021-09-16  0:45     ` Christopher M. Riedl
  1 sibling, 1 reply; 13+ messages in thread
From: Jordan Niethe @ 2021-09-15  4:24 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening

On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
>
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE.
>
> Bits of entropy with 64K page size on BOOK3S_64:
>
>         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>
>         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>         bits of entropy = log2(128TB / 64K)
>         bits of entropy = 31
>
> The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> operates - by default the space above DEFAULT_MAP_WINDOW is not
> available. Currently the Hash MMU does not use a temporary mm so
> technically this upper limit isn't necessary; however, a larger
> randomization range does not further "harden" this overall approach and
> future work may introduce patching with a temporary mm on Hash as well.
>
> Randomization occurs only once during initialization at boot for each
> possible CPU in the system.
>
> Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. Map the page with PAGE_KERNEL to set
> EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> KUAP) according to PowerISA v3.0b Figure 35 on Radix.
>
> Based on x86 implementation:
>
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
>
> and:
>
> commit b3fd8e83ada0
> ("x86/alternatives: Use temporary mm for text poking")
>
> Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
>
> ---
>
> v6:  * Small clean-ups (naming, formatting, style, etc).
>      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
>        patching.
>      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
>
> v5:  * Only support Book3s64 Radix MMU for now.
>      * Use a per-cpu datastructure to hold the patching_addr and
>        patching_mm to avoid the need for a synchronization lock/mutex.
>
> v4:  * In the previous series this was two separate patches: one to init
>        the temporary mm in poking_init() (unused in powerpc at the time)
>        and the other to use it for patching (which removed all the
>        per-cpu vmalloc code). Now that we use poking_init() in the
>        existing per-cpu vmalloc approach, that separation doesn't work
>        as nicely anymore so I just merged the two patches into one.
>      * Preload the SLB entry and hash the page for the patching_addr
>        when using Hash on book3s64 to avoid taking an SLB and Hash fault
>        during patching. The previous implementation was a hack which
>        changed current->mm to allow the SLB and Hash fault handlers to
>        work with the temporary mm since both of those code-paths always
>        assume mm == current->mm.
>      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
>        have to manage the mm->context.active_cpus counter and mm cpumask
>        since they determine (via mm_is_thread_local()) if the TLB flush
>        in pte_clear() is local or not - it should always be local when
>        we're using the temporary mm. On book3s64's Radix MMU we can
>        just call local_flush_tlb_mm().
>      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
>        KUAP.
> ---
>  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
>  1 file changed, 112 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index e802e42c2789..af8e2a02a9dd 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -11,6 +11,7 @@
>  #include <linux/cpuhotplug.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/random.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/page.h>
> @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
>
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
>
>  static int text_area_cpu_up(unsigned int cpu)
>  {
> @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
>         return 0;
>  }
>
> +static __always_inline void __poking_init_temp_mm(void)
> +{
> +       int cpu;
> +       spinlock_t *ptl; /* for protecting pte table */
> +       pte_t *ptep;
> +       struct mm_struct *patching_mm;
> +       unsigned long patching_addr;
> +
> +       for_each_possible_cpu(cpu) {
> +               patching_mm = copy_init_mm();
> +               WARN_ON(!patching_mm);
> +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> +
> +               /*
> +                * Choose a randomized, page-aligned address from the range:
> +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> +                */
> +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> +               per_cpu(cpu_patching_addr, cpu) = patching_addr;
> +
> +               /*
> +                * PTE allocation uses GFP_KERNEL which means we need to
> +                * pre-allocate the PTE here because we cannot do the
> +                * allocation during patching when IRQs are disabled.
> +                */
> +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> +               WARN_ON(!ptep);
> +               pte_unmap_unlock(ptep, ptl);
> +       }
> +}
> +
>  void __init poking_init(void)
>  {
> +       if (radix_enabled()) {
> +               __poking_init_temp_mm();
> +               return;
> +       }
> +
>         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
>                 "powerpc/text_poke:online", text_area_cpu_up,
>                 text_area_cpu_down) < 0);
> @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
>         return 0;
>  }
>
> +struct patch_mapping {
> +       spinlock_t *ptl; /* for protecting pte table */
> +       pte_t *ptep;
> +       struct temp_mm temp_mm;
> +};
> +
> +/*
> + * This can be called for kernel text or a module.
> + */
> +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> +{
> +       struct page *page;
> +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> +
> +       if (is_vmalloc_or_module_addr(addr))
> +               page = vmalloc_to_page(addr);
> +       else
> +               page = virt_to_page(addr);
> +
> +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> +                                            &patch_mapping->ptl);
> +       if (unlikely(!patch_mapping->ptep)) {
> +               pr_warn("map patch: failed to allocate pte for patching\n");
> +               return -1;
> +       }
> +
> +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));

I think because switch_mm_irqs_off() will  not necessarily have a
barrier so a ptesync would be needed.
A spurious fault here from __patch_instruction() would not be handled correctly.

> +
> +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> +       start_using_temporary_mm(&patch_mapping->temp_mm);
> +
> +       return 0;
> +}
> +
> +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> +{
> +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> +
> +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> +
> +       local_flush_tlb_mm(patching_mm);
> +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> +
> +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> +
> +       return 0;
> +}
> +
>  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
>  {
>         int err, rc = 0;
>         u32 *patch_addr = NULL;
>         unsigned long flags;
> +       struct patch_mapping patch_mapping;
>
>         /*
> -        * During early early boot patch_instruction is called
> -        * when text_poke_area is not ready, but we still need
> -        * to allow patching. We just do the plain old patching
> +        * During early early boot patch_instruction is called when the
> +        * patching_mm/text_poke_area is not ready, but we still need to allow
> +        * patching. We just do the plain old patching.
>          */
> -       if (!this_cpu_read(text_poke_area))
> -               return raw_patch_instruction(addr, instr);
> +       if (radix_enabled()) {
> +               if (!this_cpu_read(cpu_patching_mm))
> +                       return raw_patch_instruction(addr, instr);
> +       } else {
> +               if (!this_cpu_read(text_poke_area))
> +                       return raw_patch_instruction(addr, instr);
> +       }
>
>         local_irq_save(flags);
>
> -       err = map_patch_area(addr);
> +       if (radix_enabled())
> +               err = map_patch_mm(addr, &patch_mapping);
> +       else
> +               err = map_patch_area(addr);
>         if (err)
>                 goto out;
>
>         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
>         rc = __patch_instruction(addr, instr, patch_addr);
>
> -       err = unmap_patch_area();
> +       if (radix_enabled())
> +               err = unmap_patch_mm(&patch_mapping);
> +       else
> +               err = unmap_patch_area();
>
>  out:
>         local_irq_restore(flags);
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU
  2021-09-11  8:26   ` Jordan Niethe
@ 2021-09-16  0:24     ` Christopher M. Riedl
  0 siblings, 0 replies; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-16  0:24 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, linux-hardening

On Sat Sep 11, 2021 at 3:26 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:35 PM Christopher M. Riedl
> <cmr@bluescreens.de> wrote:
> >
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. Another benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> >
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> >
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use - this may include breakpoints set by perf.
>
> I had thought CPUs with a DAWR might not need to do this because the
> privilege level that breakpoints trigger on can be configured. But it
> turns out in ptrace, etc we use HW_BRK_TYPE_PRIV_ALL.

Thanks for double checking :)

>
> >
> > Based on x86 implementation:
> >
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> >
> > Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> >
> > ---
> >
> > v6:  * Use {start,stop}_using_temporary_mm() instead of
> >        {use,unuse}_temporary_mm() as suggested by Christophe.
> >
> > v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
> >
> > v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
> >        using/unusing the temp mm as suggested by Jann Horn to keep
> >        the context.active counter in-sync on mm/nohash.
> >      * Disable SLB preload in the temporary mm when initializing the
> >        temp_mm struct.
> >      * Include asm/debug.h header to fix build issue with
> >        ppc44x_defconfig.
> > ---
> >  arch/powerpc/include/asm/debug.h |  1 +
> >  arch/powerpc/kernel/process.c    |  5 +++
> >  arch/powerpc/lib/code-patching.c | 56 ++++++++++++++++++++++++++++++++
> >  3 files changed, 62 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
> > index 86a14736c76c..dfd82635ea8b 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
> >  #endif
> >
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> >  bool ppc_breakpoint_available(void);
> >  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> >  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 50436b52c213..6aa1f5c4d520 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
> >         return 0;
> >  }
> >
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > +{
> > +       memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
> > +}
>
> The breakpoint code is already a little hard to follow. I'm worried
> doing this might spread breakpoint handling into more places in the
> future.
> What about something like having a breakpoint_pause() function which
> clears the hardware registers only and then a breakpoint_resume()
> function that copies from current_brk[] back to the hardware
> registers?
> Then we don't have to make another copy of the breakpoint state.

I think that sounds reasonable - I'll add those functions instead with
the next spin.

>
> > +
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> >  {
> >         memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
> > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > index f9a3019e37b4..8d61a7d35b89 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
>
> Sorry I might have missed it, but what was the reason for not putting
> this stuff in mmu_context.h?

x86 ended up moving this code into their code-patching file as well. I
suppose nobody has thought of another use for a temporary mm like this
yet :)

>
> > @@ -17,6 +17,9 @@
> >  #include <asm/code-patching.h>
> >  #include <asm/setup.h>
> >  #include <asm/inst.h>
> > +#include <asm/mmu_context.h>
> > +#include <asm/debug.h>
> > +#include <asm/tlb.h>
> >
> >  static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 *patch_addr)
> >  {
> > @@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
> >  }
> >
> >  #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +struct temp_mm {
> > +       struct mm_struct *temp;
> > +       struct mm_struct *prev;
> > +       struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> ^ Then we wouldn't need this.
> > +};
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> > +{
> > +       /* We currently only support temporary mm on the Book3s64 Radix MMU */
> > +       WARN_ON(!radix_enabled());
> > +
> > +       temp_mm->temp = mm;
> > +       temp_mm->prev = NULL;
> > +       memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +       lockdep_assert_irqs_disabled();
> > +
> > +       temp_mm->prev = current->active_mm;
> > +       switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
>
> Now that we only support radix it should be fine again to have it like
> this:
> switch_mm_irqs_off(NULL, temp_mm->temp, current);?
> It was changed from that because it would cause issues on nohash I
> thought.

That's true - but if we want to support the other MMUs in the future
I'd rather just keep it as-is. AFAICS there's no harm in passing
temp_mm->prev here instead of NULL.

>
> > +
> > +       WARN_ON(!mm_is_thread_local(temp_mm->temp));
> > +
> > +       if (ppc_breakpoint_available()) {
> > +               struct arch_hw_breakpoint null_brk = {0};
> > +               int i = 0;
> > +
> > +               for (; i < nr_wp_slots(); ++i) {
> > +                       __get_breakpoint(i, &temp_mm->brk[i]);
> > +                       if (temp_mm->brk[i].type != 0)
> > +                               __set_breakpoint(i, &null_brk);
> > +               }
> > +       }
> > +}
> > +
> > +static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +       lockdep_assert_irqs_disabled();
> > +
> > +       switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
> > +
> > +       if (ppc_breakpoint_available()) {
> > +               int i = 0;
> > +
> > +               for (; i < nr_wp_slots(); ++i)
> > +                       if (temp_mm->brk[i].type != 0)
> > +                               __set_breakpoint(i, &temp_mm->brk[i]);
> > +       }
> > +}
> > +
> >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >
> >  static int text_area_cpu_up(unsigned int cpu)
> > --
> > 2.32.0
> >
> Thanks,
> Jordan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-11  9:14   ` Jordan Niethe
@ 2021-09-16  0:29     ` Christopher M. Riedl
  2021-09-16  1:52       ` Jordan Niethe
  0 siblings, 1 reply; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-16  0:29 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, linux-hardening

On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> <cmr@bluescreens.de> wrote:
> >
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> >
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> >
> > Bits of entropy with 64K page size on BOOK3S_64:
> >
> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >
> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >         bits of entropy = log2(128TB / 64K)
> >         bits of entropy = 31
> >
> > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > available. Currently the Hash MMU does not use a temporary mm so
> > technically this upper limit isn't necessary; however, a larger
> > randomization range does not further "harden" this overall approach and
> > future work may introduce patching with a temporary mm on Hash as well.
> >
> > Randomization occurs only once during initialization at boot for each
> > possible CPU in the system.
> >
> > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> >
> > Based on x86 implementation:
> >
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> >
> > and:
> >
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> >
> > Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> >
> > ---
> >
> > v6:  * Small clean-ups (naming, formatting, style, etc).
> >      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> >        patching.
> >      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> >
> > v5:  * Only support Book3s64 Radix MMU for now.
> >      * Use a per-cpu datastructure to hold the patching_addr and
> >        patching_mm to avoid the need for a synchronization lock/mutex.
> >
> > v4:  * In the previous series this was two separate patches: one to init
> >        the temporary mm in poking_init() (unused in powerpc at the time)
> >        and the other to use it for patching (which removed all the
> >        per-cpu vmalloc code). Now that we use poking_init() in the
> >        existing per-cpu vmalloc approach, that separation doesn't work
> >        as nicely anymore so I just merged the two patches into one.
> >      * Preload the SLB entry and hash the page for the patching_addr
> >        when using Hash on book3s64 to avoid taking an SLB and Hash fault
> >        during patching. The previous implementation was a hack which
> >        changed current->mm to allow the SLB and Hash fault handlers to
> >        work with the temporary mm since both of those code-paths always
> >        assume mm == current->mm.
> >      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> >        have to manage the mm->context.active_cpus counter and mm cpumask
> >        since they determine (via mm_is_thread_local()) if the TLB flush
> >        in pte_clear() is local or not - it should always be local when
> >        we're using the temporary mm. On book3s64's Radix MMU we can
> >        just call local_flush_tlb_mm().
> >      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> >        KUAP.
> > ---
> >  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
> >  1 file changed, 112 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > index e802e42c2789..af8e2a02a9dd 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/cpuhotplug.h>
> >  #include <linux/slab.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/random.h>
> >
> >  #include <asm/tlbflush.h>
> >  #include <asm/page.h>
> > @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> >
> >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> > +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> >
> >  static int text_area_cpu_up(unsigned int cpu)
> >  {
> > @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
> >         return 0;
> >  }
> >
> > +static __always_inline void __poking_init_temp_mm(void)
> > +{
> > +       int cpu;
> > +       spinlock_t *ptl; /* for protecting pte table */
>
> ptl is just used so we don't have to open code allocating a pte in
> patching_mm isn't it?

Yup - I think that comment was a copy-pasta... I'll improve it.

>
> > +       pte_t *ptep;
> > +       struct mm_struct *patching_mm;
> > +       unsigned long patching_addr;
> > +
> > +       for_each_possible_cpu(cpu) {
> > +               patching_mm = copy_init_mm();
> > +               WARN_ON(!patching_mm);
>
> Would it be okay to just let the mmu handle null pointer dereferences?

In general I think yes; however, the NULL dereference wouldn't occur
until later during actual patching so I thought an early WARN here is
appropriate. 

>
> > +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> > +
> > +               /*
> > +                * Choose a randomized, page-aligned address from the range:
> > +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> > +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> > +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> > +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > +                */
> > +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> > +               per_cpu(cpu_patching_addr, cpu) = patching_addr;
>
> On x86 the randomization depends on CONFIG_RANDOMIZE_BASE. Should it
> be controllable here too?

IIRC CONFIG_RANDOMIZE_BASE is for KASLR which IMO doesn't really have
much to do with this.

>
> > +
> > +               /*
> > +                * PTE allocation uses GFP_KERNEL which means we need to
> > +                * pre-allocate the PTE here because we cannot do the
> > +                * allocation during patching when IRQs are disabled.
> > +                */
> > +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +               WARN_ON(!ptep);
> > +               pte_unmap_unlock(ptep, ptl);
> > +       }
> > +}
> > +
> >  void __init poking_init(void)
> >  {
> > +       if (radix_enabled()) {
> > +               __poking_init_temp_mm();
>
> Should this also be done with cpuhp_setup_state()?

I think I prefer doing the setup ahead of time during boot.

>
> > +               return;
> > +       }
> > +
> >         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> >                 "powerpc/text_poke:online", text_area_cpu_up,
> >                 text_area_cpu_down) < 0);
> > @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
> >         return 0;
> >  }
> >
> > +struct patch_mapping {
> > +       spinlock_t *ptl; /* for protecting pte table */
> > +       pte_t *ptep;
> > +       struct temp_mm temp_mm;
> > +};
> > +
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > +{
> > +       struct page *page;
> > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +       if (is_vmalloc_or_module_addr(addr))
> > +               page = vmalloc_to_page(addr);
> > +       else
> > +               page = virt_to_page(addr);
> > +
> > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +                                            &patch_mapping->ptl);
> > +       if (unlikely(!patch_mapping->ptep)) {
> > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > +               return -1;
> > +       }
> > +
> > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> > +
> > +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > +       start_using_temporary_mm(&patch_mapping->temp_mm);
> > +
> > +       return 0;
> > +}
> > +
> > +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> > +{
> > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> > +
> > +       local_flush_tlb_mm(patching_mm);
> > +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> > +
> > +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> > +
> > +       return 0;
> > +}
> > +
> >  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
> >  {
> >         int err, rc = 0;
> >         u32 *patch_addr = NULL;
> >         unsigned long flags;
> > +       struct patch_mapping patch_mapping;
> >
> >         /*
> > -        * During early early boot patch_instruction is called
> > -        * when text_poke_area is not ready, but we still need
> > -        * to allow patching. We just do the plain old patching
> > +        * During early early boot patch_instruction is called when the
> > +        * patching_mm/text_poke_area is not ready, but we still need to allow
> > +        * patching. We just do the plain old patching.
> >          */
> > -       if (!this_cpu_read(text_poke_area))
> > -               return raw_patch_instruction(addr, instr);
> > +       if (radix_enabled()) {
> > +               if (!this_cpu_read(cpu_patching_mm))
> > +                       return raw_patch_instruction(addr, instr);
> > +       } else {
> > +               if (!this_cpu_read(text_poke_area))
> > +                       return raw_patch_instruction(addr, instr);
> > +       }
>
> Would testing cpu_patching_addr handler both of these cases?
>
> Then I think it might be clearer to do something like this:
> if (radix_enabled()) {
> return patch_instruction_mm(addr, instr);
> }
>
> patch_instruction_mm() would combine map_patch_mm(), then patching and
> unmap_patch_mm() into one function.
>
> IMO, a bit of code duplication would be cleaner than checking multiple
> times for radix_enabled() and having struct patch_mapping especially
> for maintaining state.

Hmm, I think it's a good idea - I'll give it a go for the next version.
Thanks for the suggestion!

>
> >
> >         local_irq_save(flags);
> >
> > -       err = map_patch_area(addr);
> > +       if (radix_enabled())
> > +               err = map_patch_mm(addr, &patch_mapping);
> > +       else
> > +               err = map_patch_area(addr);
> >         if (err)
> >                 goto out;
> >
> >         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
> >         rc = __patch_instruction(addr, instr, patch_addr);
> >
> > -       err = unmap_patch_area();
> > +       if (radix_enabled())
> > +               err = unmap_patch_mm(&patch_mapping);
> > +       else
> > +               err = unmap_patch_area();
> >
> >  out:
> >         local_irq_restore(flags);
> > --
> > 2.32.0
> >
> Thanks,
> Jordan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-15  4:24   ` Jordan Niethe
@ 2021-09-16  0:45     ` Christopher M. Riedl
  2021-09-16  2:04       ` Jordan Niethe
  0 siblings, 1 reply; 13+ messages in thread
From: Christopher M. Riedl @ 2021-09-16  0:45 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, linux-hardening

On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> <cmr@bluescreens.de> wrote:
> > ... 
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > +{
> > +       struct page *page;
> > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +       if (is_vmalloc_or_module_addr(addr))
> > +               page = vmalloc_to_page(addr);
> > +       else
> > +               page = virt_to_page(addr);
> > +
> > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +                                            &patch_mapping->ptl);
> > +       if (unlikely(!patch_mapping->ptep)) {
> > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > +               return -1;
> > +       }
> > +
> > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
>
> I think because switch_mm_irqs_off() will not necessarily have a
> barrier so a ptesync would be needed.
> A spurious fault here from __patch_instruction() would not be handled
> correctly.

Sorry I don't quite follow - can you explain this to me in a bit more
detail?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-16  0:29     ` Christopher M. Riedl
@ 2021-09-16  1:52       ` Jordan Niethe
  0 siblings, 0 replies; 13+ messages in thread
From: Jordan Niethe @ 2021-09-16  1:52 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening

On Thu, Sep 16, 2021 at 10:38 AM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> > On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> > <cmr@bluescreens.de> wrote:
> > >
> > > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > > address to be patched is temporarily mapped as writeable. Currently, a
> > > per-cpu vmalloc patch area is used for this purpose. While the patch
> > > area is per-cpu, the temporary page mapping is inserted into the kernel
> > > page tables for the duration of patching. The mapping is exposed to CPUs
> > > other than the patching CPU - this is undesirable from a hardening
> > > perspective. Use a temporary mm instead which keeps the mapping local to
> > > the CPU doing the patching.
> > >
> > > Use the `poking_init` init hook to prepare a temporary mm and patching
> > > address. Initialize the temporary mm by copying the init mm. Choose a
> > > randomized patching address inside the temporary mm userspace address
> > > space. The patching address is randomized between PAGE_SIZE and
> > > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> > >
> > > Bits of entropy with 64K page size on BOOK3S_64:
> > >
> > >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > >
> > >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > >         bits of entropy = log2(128TB / 64K)
> > >         bits of entropy = 31
> > >
> > > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > > available. Currently the Hash MMU does not use a temporary mm so
> > > technically this upper limit isn't necessary; however, a larger
> > > randomization range does not further "harden" this overall approach and
> > > future work may introduce patching with a temporary mm on Hash as well.
> > >
> > > Randomization occurs only once during initialization at boot for each
> > > possible CPU in the system.
> > >
> > > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > > respectively create and remove the temporary mapping with write
> > > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> > >
> > > Based on x86 implementation:
> > >
> > > commit 4fc19708b165
> > > ("x86/alternatives: Initialize temporary mm for patching")
> > >
> > > and:
> > >
> > > commit b3fd8e83ada0
> > > ("x86/alternatives: Use temporary mm for text poking")
> > >
> > > Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> > >
> > > ---
> > >
> > > v6:  * Small clean-ups (naming, formatting, style, etc).
> > >      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> > >        patching.
> > >      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> > >
> > > v5:  * Only support Book3s64 Radix MMU for now.
> > >      * Use a per-cpu datastructure to hold the patching_addr and
> > >        patching_mm to avoid the need for a synchronization lock/mutex.
> > >
> > > v4:  * In the previous series this was two separate patches: one to init
> > >        the temporary mm in poking_init() (unused in powerpc at the time)
> > >        and the other to use it for patching (which removed all the
> > >        per-cpu vmalloc code). Now that we use poking_init() in the
> > >        existing per-cpu vmalloc approach, that separation doesn't work
> > >        as nicely anymore so I just merged the two patches into one.
> > >      * Preload the SLB entry and hash the page for the patching_addr
> > >        when using Hash on book3s64 to avoid taking an SLB and Hash fault
> > >        during patching. The previous implementation was a hack which
> > >        changed current->mm to allow the SLB and Hash fault handlers to
> > >        work with the temporary mm since both of those code-paths always
> > >        assume mm == current->mm.
> > >      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> > >        have to manage the mm->context.active_cpus counter and mm cpumask
> > >        since they determine (via mm_is_thread_local()) if the TLB flush
> > >        in pte_clear() is local or not - it should always be local when
> > >        we're using the temporary mm. On book3s64's Radix MMU we can
> > >        just call local_flush_tlb_mm().
> > >      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> > >        KUAP.
> > > ---
> > >  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
> > >  1 file changed, 112 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > > index e802e42c2789..af8e2a02a9dd 100644
> > > --- a/arch/powerpc/lib/code-patching.c
> > > +++ b/arch/powerpc/lib/code-patching.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/cpuhotplug.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/uaccess.h>
> > > +#include <linux/random.h>
> > >
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/page.h>
> > > @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> > >
> > >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > >  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> > > +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> > >
> > >  static int text_area_cpu_up(unsigned int cpu)
> > >  {
> > > @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
> > >         return 0;
> > >  }
> > >
> > > +static __always_inline void __poking_init_temp_mm(void)
> > > +{
> > > +       int cpu;
> > > +       spinlock_t *ptl; /* for protecting pte table */
> >
> > ptl is just used so we don't have to open code allocating a pte in
> > patching_mm isn't it?
>
> Yup - I think that comment was a copy-pasta... I'll improve it.
>
> >
> > > +       pte_t *ptep;
> > > +       struct mm_struct *patching_mm;
> > > +       unsigned long patching_addr;
> > > +
> > > +       for_each_possible_cpu(cpu) {
> > > +               patching_mm = copy_init_mm();
> > > +               WARN_ON(!patching_mm);
> >
> > Would it be okay to just let the mmu handle null pointer dereferences?
>
> In general I think yes; however, the NULL dereference wouldn't occur
> until later during actual patching so I thought an early WARN here is
> appropriate.
>
> >
> > > +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> > > +
> > > +               /*
> > > +                * Choose a randomized, page-aligned address from the range:
> > > +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> > > +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> > > +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> > > +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > > +                */
> > > +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > > +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> > > +               per_cpu(cpu_patching_addr, cpu) = patching_addr;
> >
> > On x86 the randomization depends on CONFIG_RANDOMIZE_BASE. Should it
> > be controllable here too?
>
> IIRC CONFIG_RANDOMIZE_BASE is for KASLR which IMO doesn't really have
> much to do with this.
>
> >
> > > +
> > > +               /*
> > > +                * PTE allocation uses GFP_KERNEL which means we need to
> > > +                * pre-allocate the PTE here because we cannot do the
> > > +                * allocation during patching when IRQs are disabled.
> > > +                */
> > > +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > > +               WARN_ON(!ptep);
> > > +               pte_unmap_unlock(ptep, ptl);
> > > +       }
> > > +}
> > > +
> > >  void __init poking_init(void)
> > >  {
> > > +       if (radix_enabled()) {
> > > +               __poking_init_temp_mm();
> >
> > Should this also be done with cpuhp_setup_state()?
>
> I think I prefer doing the setup ahead of time during boot.

It does lose the ability to free up memory after a cpu is hot
unplugged but I'm not sure if that's a big problem.

>
> >
> > > +               return;
> > > +       }
> > > +
> > >         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > >                 "powerpc/text_poke:online", text_area_cpu_up,
> > >                 text_area_cpu_down) < 0);
> > > @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
> > >         return 0;
> > >  }
> > >
> > > +struct patch_mapping {
> > > +       spinlock_t *ptl; /* for protecting pte table */
> > > +       pte_t *ptep;
> > > +       struct temp_mm temp_mm;
> > > +};
> > > +
> > > +/*
> > > + * This can be called for kernel text or a module.
> > > + */
> > > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > > +{
> > > +       struct page *page;
> > > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +       if (is_vmalloc_or_module_addr(addr))
> > > +               page = vmalloc_to_page(addr);
> > > +       else
> > > +               page = virt_to_page(addr);
> > > +
> > > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > > +                                            &patch_mapping->ptl);
> > > +       if (unlikely(!patch_mapping->ptep)) {
> > > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > > +               return -1;
> > > +       }
> > > +
> > > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> > > +
> > > +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > > +       start_using_temporary_mm(&patch_mapping->temp_mm);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> > > +{
> > > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> > > +
> > > +       local_flush_tlb_mm(patching_mm);
> > > +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> > > +
> > > +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
> > >  {
> > >         int err, rc = 0;
> > >         u32 *patch_addr = NULL;
> > >         unsigned long flags;
> > > +       struct patch_mapping patch_mapping;
> > >
> > >         /*
> > > -        * During early early boot patch_instruction is called
> > > -        * when text_poke_area is not ready, but we still need
> > > -        * to allow patching. We just do the plain old patching
> > > +        * During early early boot patch_instruction is called when the
> > > +        * patching_mm/text_poke_area is not ready, but we still need to allow
> > > +        * patching. We just do the plain old patching.
> > >          */
> > > -       if (!this_cpu_read(text_poke_area))
> > > -               return raw_patch_instruction(addr, instr);
> > > +       if (radix_enabled()) {
> > > +               if (!this_cpu_read(cpu_patching_mm))
> > > +                       return raw_patch_instruction(addr, instr);
> > > +       } else {
> > > +               if (!this_cpu_read(text_poke_area))
> > > +                       return raw_patch_instruction(addr, instr);
> > > +       }
> >
> > Would testing cpu_patching_addr handler both of these cases?
> >
> > Then I think it might be clearer to do something like this:
> > if (radix_enabled()) {
> > return patch_instruction_mm(addr, instr);
> > }
> >
> > patch_instruction_mm() would combine map_patch_mm(), then patching and
> > unmap_patch_mm() into one function.
> >
> > IMO, a bit of code duplication would be cleaner than checking multiple
> > times for radix_enabled() and having struct patch_mapping especially
> > for maintaining state.
>
> Hmm, I think it's a good idea - I'll give it a go for the next version.
> Thanks for the suggestion!
>
> >
> > >
> > >         local_irq_save(flags);
> > >
> > > -       err = map_patch_area(addr);
> > > +       if (radix_enabled())
> > > +               err = map_patch_mm(addr, &patch_mapping);
> > > +       else
> > > +               err = map_patch_area(addr);
> > >         if (err)
> > >                 goto out;
> > >
> > >         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
> > >         rc = __patch_instruction(addr, instr, patch_addr);
> > >
> > > -       err = unmap_patch_area();
> > > +       if (radix_enabled())
> > > +               err = unmap_patch_mm(&patch_mapping);
> > > +       else
> > > +               err = unmap_patch_area();
> > >
> > >  out:
> > >         local_irq_restore(flags);
> > > --
> > > 2.32.0
> > >
> > Thanks,
> > Jordan
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
  2021-09-16  0:45     ` Christopher M. Riedl
@ 2021-09-16  2:04       ` Jordan Niethe
  0 siblings, 0 replies; 13+ messages in thread
From: Jordan Niethe @ 2021-09-16  2:04 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening

On Thu, Sep 16, 2021 at 10:40 AM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> > On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> > <cmr@bluescreens.de> wrote:
> > > ...
> > > +/*
> > > + * This can be called for kernel text or a module.
> > > + */
> > > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > > +{
> > > +       struct page *page;
> > > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +       if (is_vmalloc_or_module_addr(addr))
> > > +               page = vmalloc_to_page(addr);
> > > +       else
> > > +               page = virt_to_page(addr);
> > > +
> > > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > > +                                            &patch_mapping->ptl);
> > > +       if (unlikely(!patch_mapping->ptep)) {
> > > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > > +               return -1;
> > > +       }
> > > +
> > > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> >
> > I think because switch_mm_irqs_off() will not necessarily have a
> > barrier so a ptesync would be needed.
> > A spurious fault here from __patch_instruction() would not be handled
> > correctly.
>
> Sorry I don't quite follow - can you explain this to me in a bit more
> detail?

radix__set_pte_at() skips calling ptesync as an optimization.
If there is no ordering between changing the pte and then accessing
the page with __patch_instruction(), a spurious fault could be raised.
I think such a fault would end up being causing bad_kernel_fault() ->
true and would not be fixed up.

I thought there might be a barrier in switch_mm_irqs_off() that would
provide this ordering but afaics that is not always the case.

So I think that we need to have a ptesync after set_pte_at().

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-16  2:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-11  2:29 [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU Christopher M. Riedl
2021-09-11  2:29 ` [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for " Christopher M. Riedl
2021-09-11  8:26   ` Jordan Niethe
2021-09-16  0:24     ` Christopher M. Riedl
2021-09-11  2:29 ` [PATCH v6 2/4] powerpc: Rework and improve STRICT_KERNEL_RWX patching Christopher M. Riedl
2021-09-11  2:29 ` [PATCH v6 3/4] powerpc: Use WARN_ON and fix check in poking_init Christopher M. Riedl
2021-09-11  2:29 ` [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix Christopher M. Riedl
2021-09-11  9:14   ` Jordan Niethe
2021-09-16  0:29     ` Christopher M. Riedl
2021-09-16  1:52       ` Jordan Niethe
2021-09-15  4:24   ` Jordan Niethe
2021-09-16  0:45     ` Christopher M. Riedl
2021-09-16  2:04       ` Jordan Niethe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).