All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU
@ 2022-10-25  4:44 Benjamin Gray
  2022-10-25  4:44 ` [PATCH v9 1/7] powerpc: Allow clearing and restoring registers independent of saved breakpoint state Benjamin Gray
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

This is a revision of Chris and Jordan's series to introduce a per-cpu temporary
mm to be used for patching with strict rwx on radix mmus.

v9:	* Fixed patch series name to include "on Radix MMU" again
	* Renamed breakpoint functions
	* Introduce patch to gracefully return when patching not possible
	* Make book3s/32/tlbflush.h TLB page flush implementation a warning
	* Removed temp_mm_state
	* Consolidate patching context into single struct shared by both paths

Previous versions:
v8: https://lore.kernel.org/all/20221021052238.580986-1-bgray@linux.ibm.com/
v7: https://lore.kernel.org/all/20211110003717.1150965-1-jniethe5@gmail.com/
v6: https://lore.kernel.org/all/20210911022904.30962-1-cmr@bluescreens.de/
v5: https://lore.kernel.org/all/20210713053113.4632-1-cmr@linux.ibm.com/
v4: https://lore.kernel.org/all/20210429072057.8870-1-cmr@bluescreens.de/
v3: https://lore.kernel.org/all/20200827052659.24922-1-cmr@codefail.de/
v2: https://lore.kernel.org/all/20200709040316.12789-1-cmr@informatik.wtf/
v1: https://lore.kernel.org/all/20200603051912.23296-1-cmr@informatik.wtf/
RFC: https://lore.kernel.org/all/20200323045205.20314-1-cmr@informatik.wtf/
x86: https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.amit@gmail.com/


Benjamin Gray (7):
  powerpc: Allow clearing and restoring registers independent of saved
    breakpoint state
  powerpc/code-patching: Handle RWX patching initialisation error
  powerpc/code-patching: Use WARN_ON and fix check in poking_init
  powerpc/code-patching: Verify instruction patch succeeded
  powerpc/tlb: Add local flush for page given mm_struct and psize
  powerpc/code-patching: Use temporary mm for Radix MMU
  powerpc/code-patching: Consolidate and cache per-cpu patching context

 arch/powerpc/include/asm/book3s/32/tlbflush.h |   9 +
 .../include/asm/book3s/64/tlbflush-hash.h     |   5 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/debug.h              |   2 +
 arch/powerpc/include/asm/nohash/tlbflush.h    |   8 +
 arch/powerpc/kernel/process.c                 |  38 ++-
 arch/powerpc/lib/code-patching.c              | 252 +++++++++++++++++-
 7 files changed, 305 insertions(+), 17 deletions(-)


base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
--
2.37.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v9 1/7] powerpc: Allow clearing and restoring registers independent of saved breakpoint state
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-10-25  4:44 ` [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error Benjamin Gray
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

From: Jordan Niethe <jniethe5@gmail.com>

For the coming temporary mm used for instruction patching, the
breakpoint registers need to be cleared to prevent them from
accidentally being triggered. As soon as the patching is done, the
breakpoints will be restored.

The breakpoint state is stored in the per-cpu variable current_brk[].
Add a suspend_breakpoints() function which will clear the breakpoint
registers without touching the state in current_brk[]. Add a pair
function restore_breakpoints() which will move the state in
current_brk[] back to the registers.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
v9:	* Renamed ____set_breakpoint to set_hw_breakpoint
	* Renamed pause/unpause to suspend/restore
	* Removed unrelated whitespace change
---
 arch/powerpc/include/asm/debug.h |  2 ++
 arch/powerpc/kernel/process.c    | 38 +++++++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c..51c744608f37 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,8 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void suspend_breakpoints(void);
+void restore_breakpoints(void);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 67da147fe34d..5d5109ed01a5 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -862,10 +862,8 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
 	return 0;
 }
 
-void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+static void set_hw_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
-	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
-
 	if (dawr_enabled())
 		// Power8 or later
 		set_dawr(nr, brk);
@@ -879,6 +877,12 @@ void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 		WARN_ON_ONCE(1);
 }
 
+void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
+	set_hw_breakpoint(nr, brk);
+}
+
 /* Check if we have DAWR or DABR hardware */
 bool ppc_breakpoint_available(void)
 {
@@ -891,6 +895,34 @@ bool ppc_breakpoint_available(void)
 }
 EXPORT_SYMBOL_GPL(ppc_breakpoint_available);
 
+/* Disable the breakpoint in hardware without touching current_brk[] */
+void suspend_breakpoints(void)
+{
+	struct arch_hw_breakpoint brk = {0};
+	int i;
+
+	if (!ppc_breakpoint_available())
+		return;
+
+	for (i = 0; i < nr_wp_slots(); i++)
+		set_hw_breakpoint(i, &brk);
+}
+
+/*
+ * Re-enable breakpoints suspended by suspend_breakpoints() in hardware
+ * from current_brk[]
+ */
+void restore_breakpoints(void)
+{
+	int i;
+
+	if (!ppc_breakpoint_available())
+		return;
+
+	for (i = 0; i < nr_wp_slots(); i++)
+		set_hw_breakpoint(i, this_cpu_ptr(&current_brk[i]));
+}
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 
 static inline bool tm_enabled(struct task_struct *tsk)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
  2022-10-25  4:44 ` [PATCH v9 1/7] powerpc: Allow clearing and restoring registers independent of saved breakpoint state Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-11-02  9:36   ` Christophe Leroy
  2022-10-25  4:44 ` [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init Benjamin Gray
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

Detect and abort __do_patch_instruction() when there is no text_poke_area,
which implies there is no patching address. This allows patch_instruction()
to fail gracefully and let the caller decide what to do, as opposed to
the current behaviour of kernel panicking when the null pointer is
dereferenced.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
v9:	* New in v9
---
 arch/powerpc/lib/code-patching.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index ad0cf3108dd0..54e145247643 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -76,6 +76,7 @@ static int text_area_cpu_up(unsigned int cpu)
 static int text_area_cpu_down(unsigned int cpu)
 {
 	free_vm_area(this_cpu_read(text_poke_area));
+	this_cpu_write(text_poke_area, NULL);
 	return 0;
 }
 
@@ -151,11 +152,16 @@ static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
 {
 	int err;
 	u32 *patch_addr;
+	struct vm_struct *area;
 	unsigned long text_poke_addr;
 	pte_t *pte;
 	unsigned long pfn = get_patch_pfn(addr);
 
-	text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr & PAGE_MASK;
+	area = __this_cpu_read(text_poke_area);
+	if (unlikely(!area))
+		return -ENOMEM;
+
+	text_poke_addr = (unsigned long)area->addr & PAGE_MASK;
 	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
 
 	pte = virt_to_kpte(text_poke_addr);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
  2022-10-25  4:44 ` [PATCH v9 1/7] powerpc: Allow clearing and restoring registers independent of saved breakpoint state Benjamin Gray
  2022-10-25  4:44 ` [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-11-02  9:38   ` Christophe Leroy
  2022-10-25  4:44 ` [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded Benjamin Gray
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

BUG_ON() when failing to initialise the code patching window is
excessive, as most critical patching happens during boot before strict
RWX control is enabled. Failure to patch after boot is not inherently
fatal, so aborting the kernel is better determined by the caller.

The return value of cpuhp_setup_state() is also >= 0 on success,
so check for < 0.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
---
v9:	* Reword commit message to explain why init failure is not fatal
---
 arch/powerpc/lib/code-patching.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 54e145247643..3b3b09d5d2e1 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -82,16 +82,13 @@ static int text_area_cpu_down(unsigned int cpu)
 
 static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
 
-/*
- * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
- * we judge it as being preferable to a kernel that will crash later when
- * someone tries to use patch_instruction().
- */
 void __init poking_init(void)
 {
-	BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-		"powerpc/text_poke:online", text_area_cpu_up,
-		text_area_cpu_down));
+	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+				  "powerpc/text_poke:online",
+				  text_area_cpu_up,
+				  text_area_cpu_down) < 0);
+
 	static_branch_enable(&poking_init_done);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
                   ` (2 preceding siblings ...)
  2022-10-25  4:44 ` [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-10-26  0:47   ` Benjamin Gray
  2022-11-02  9:43   ` Christophe Leroy
  2022-10-25  4:44 ` [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize Benjamin Gray
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

Verifies that if the instruction patching did not return an error then
the value stored at the given address to patch is now equal to the
instruction we patched it to.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
 arch/powerpc/lib/code-patching.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 3b3b09d5d2e1..b0a12b2d5a9b 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -192,6 +192,8 @@ static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
 	err = __do_patch_instruction(addr, instr);
 	local_irq_restore(flags);
 
+	WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));
+
 	return err;
 }
 #else /* !CONFIG_STRICT_KERNEL_RWX */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
                   ` (3 preceding siblings ...)
  2022-10-25  4:44 ` [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-11-02  9:56   ` Christophe Leroy
  2022-10-25  4:44 ` [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU Benjamin Gray
  2022-10-25  4:44 ` [PATCH v9 7/7] powerpc/code-patching: Consolidate and cache per-cpu patching context Benjamin Gray
  6 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

Adds a local TLB flush operation that works given an mm_struct, VA to
flush, and page size representation. Most implementations mirror the
surrounding code. The book3s/32/tlbflush.h implementation is left as
a WARN_ONCE_ON because it is more complicated and not required for
anything as yet.

This removes the need to create a vm_area_struct, which the temporary
patching mm work does not need.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
v9:	* Replace book3s/32/tlbflush.h implementation with warning
---
 arch/powerpc/include/asm/book3s/32/tlbflush.h      | 9 +++++++++
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 5 +++++
 arch/powerpc/include/asm/book3s/64/tlbflush.h      | 8 ++++++++
 arch/powerpc/include/asm/nohash/tlbflush.h         | 8 ++++++++
 4 files changed, 30 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/tlbflush.h b/arch/powerpc/include/asm/book3s/32/tlbflush.h
index ba1743c52b56..14d989d41f75 100644
--- a/arch/powerpc/include/asm/book3s/32/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/32/tlbflush.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
 #define _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
 
+#include <asm/bug.h>
+
 #define MMU_NO_CONTEXT      (0)
 /*
  * TLB flushing for "classic" hash-MMU 32-bit CPUs, 6xx, 7xx, 7xxx
@@ -74,6 +76,13 @@ static inline void local_flush_tlb_page(struct vm_area_struct *vma,
 {
 	flush_tlb_page(vma, vmaddr);
 }
+
+static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
+					      unsigned long vmaddr, int psize)
+{
+	WARN_ONCE(true, "local TLB flush not implemented");
+}
+
 static inline void local_flush_tlb_mm(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index fab8332fe1ad..8fd9dc49b2a1 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -94,6 +94,11 @@ static inline void hash__local_flush_tlb_page(struct vm_area_struct *vma,
 {
 }
 
+static inline void hash__local_flush_tlb_page_psize(struct mm_struct *mm,
+						    unsigned long vmaddr, int psize)
+{
+}
+
 static inline void hash__flush_tlb_page(struct vm_area_struct *vma,
 				    unsigned long vmaddr)
 {
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 67655cd60545..2d839dd5c08c 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -92,6 +92,14 @@ static inline void local_flush_tlb_page(struct vm_area_struct *vma,
 	return hash__local_flush_tlb_page(vma, vmaddr);
 }
 
+static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
+					      unsigned long vmaddr, int psize)
+{
+	if (radix_enabled())
+		return radix__local_flush_tlb_page_psize(mm, vmaddr, psize);
+	return hash__local_flush_tlb_page_psize(mm, vmaddr, psize);
+}
+
 static inline void local_flush_all_mm(struct mm_struct *mm)
 {
 	if (radix_enabled())
diff --git a/arch/powerpc/include/asm/nohash/tlbflush.h b/arch/powerpc/include/asm/nohash/tlbflush.h
index bdaf34ad41ea..432bca4cac62 100644
--- a/arch/powerpc/include/asm/nohash/tlbflush.h
+++ b/arch/powerpc/include/asm/nohash/tlbflush.h
@@ -45,6 +45,12 @@ static inline void local_flush_tlb_page(struct vm_area_struct *vma, unsigned lon
 	asm volatile ("tlbie %0; sync" : : "r" (vmaddr) : "memory");
 }
 
+static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
+					      unsigned long vmaddr, int psize)
+{
+	asm volatile ("tlbie %0; sync" : : "r" (vmaddr) : "memory");
+}
+
 static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	start &= PAGE_MASK;
@@ -58,6 +64,8 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 extern void local_flush_tlb_mm(struct mm_struct *mm);
 extern void local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
+extern void local_flush_tlb_page_psize(struct mm_struct *mm,
+				       unsigned long vmaddr, int psize);
 
 extern void __local_flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
 				   int tsize, int ind);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
                   ` (4 preceding siblings ...)
  2022-10-25  4:44 ` [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-11-02 10:11   ` Christophe Leroy
  2022-10-25  4:44 ` [PATCH v9 7/7] powerpc/code-patching: Consolidate and cache per-cpu patching context Benjamin Gray
  6 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

From: "Christopher M. Riedl" <cmr@bluescreens.de>

x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

	bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

	PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
	bits of entropy = log2(128TB / 64K)
	bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization for each CPU as it
comes online.

The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE
which ignores the AMR (so no need to unlock/lock KUAP) according to
PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

From: Benjamin Gray <bgray@linux.ibm.com>

Synchronisation is done according to ISA 3.1B Book 3 Chapter 13
"Synchronization Requirements for Context Alterations". Switching the mm
is a change to the PID, which requires a CSI before and after the change,
and a hwsync between the last instruction that performs address
translation for an associated storage access.

Instruction fetch is an associated storage access, but the instruction
address mappings are not being changed, so it should not matter which
context they use. We must still perform a hwsync to guard arbitrary
prior code that may have accessed a userspace address.

TLB invalidation is local and VA specific. Local because only this core
used the patching mm, and VA specific because we only care that the
writable mapping is purged. Leaving the other mappings intact is more
efficient, especially when performing many code patches in a row (e.g.,
as ftrace would).

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
v9:	* Add back Christopher M. Riedl signed-off-by
	* Remove temp_mm_state
---
 arch/powerpc/lib/code-patching.c | 221 ++++++++++++++++++++++++++++++-
 1 file changed, 216 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index b0a12b2d5a9b..3fe99d0086fc 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -4,12 +4,17 @@
  */
 
 #include <linux/kprobes.h>
+#include <linux/mmu_context.h>
+#include <linux/random.h>
 #include <linux/vmalloc.h>
 #include <linux/init.h>
 #include <linux/cpuhotplug.h>
 #include <linux/uaccess.h>
 #include <linux/jump_label.h>
 
+#include <asm/debug.h>
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/page.h>
 #include <asm/code-patching.h>
@@ -42,11 +47,54 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
 
 static int map_patch_area(void *addr, unsigned long text_poke_addr);
 static void unmap_patch_area(unsigned long addr);
 
+static bool mm_patch_enabled(void)
+{
+	return IS_ENABLED(CONFIG_SMP) && radix_enabled();
+}
+
+/*
+ * The following applies for Radix MMU. Hash MMU has different requirements,
+ * and so is not supported.
+ *
+ * Changing mm requires context synchronising instructions on both sides of
+ * the context switch, as well as a hwsync between the last instruction for
+ * which the address of an associated storage access was translated using
+ * the current context.
+ *
+ * switch_mm_irqs_off() performs an isync after the context switch. It is
+ * the responsibility of the caller to perform the CSI and hwsync before
+ * starting/stopping the temp mm.
+ */
+static struct mm_struct *start_using_temp_mm(struct mm_struct *temp_mm)
+{
+	struct mm_struct *orig_mm = current->active_mm;
+
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(orig_mm, temp_mm, current);
+
+	WARN_ON(!mm_is_thread_local(temp_mm));
+
+	suspend_breakpoints();
+	return orig_mm;
+}
+
+static void stop_using_temp_mm(struct mm_struct *temp_mm,
+			       struct mm_struct *orig_mm)
+{
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(temp_mm, orig_mm, current);
+	restore_breakpoints();
+}
+
 static int text_area_cpu_up(unsigned int cpu)
 {
 	struct vm_struct *area;
@@ -80,14 +128,127 @@ static int text_area_cpu_down(unsigned int cpu)
 	return 0;
 }
 
+static int text_area_cpu_up_mm(unsigned int cpu)
+{
+	struct mm_struct *mm;
+	unsigned long addr;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+
+	mm = copy_init_mm();
+	if (WARN_ON(!mm))
+		goto fail_no_mm;
+
+	/*
+	 * Choose a random page-aligned address from the interval
+	 * [PAGE_SIZE .. DEFAULT_MAP_WINDOW - PAGE_SIZE].
+	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
+	 */
+	addr = (1 + (get_random_long() % (DEFAULT_MAP_WINDOW / PAGE_SIZE - 2))) << PAGE_SHIFT;
+
+	/*
+	 * PTE allocation uses GFP_KERNEL which means we need to
+	 * pre-allocate the PTE here because we cannot do the
+	 * allocation during patching when IRQs are disabled.
+	 */
+	pgdp = pgd_offset(mm, addr);
+
+	p4dp = p4d_alloc(mm, pgdp, addr);
+	if (WARN_ON(!p4dp))
+		goto fail_no_p4d;
+
+	pudp = pud_alloc(mm, p4dp, addr);
+	if (WARN_ON(!pudp))
+		goto fail_no_pud;
+
+	pmdp = pmd_alloc(mm, pudp, addr);
+	if (WARN_ON(!pmdp))
+		goto fail_no_pmd;
+
+	ptep = pte_alloc_map(mm, pmdp, addr);
+	if (WARN_ON(!ptep))
+		goto fail_no_pte;
+
+	this_cpu_write(cpu_patching_mm, mm);
+	this_cpu_write(cpu_patching_addr, addr);
+	this_cpu_write(cpu_patching_pte, ptep);
+
+	return 0;
+
+fail_no_pte:
+	pmd_free(mm, pmdp);
+	mm_dec_nr_pmds(mm);
+fail_no_pmd:
+	pud_free(mm, pudp);
+	mm_dec_nr_puds(mm);
+fail_no_pud:
+	p4d_free(patching_mm, p4dp);
+fail_no_p4d:
+	mmput(mm);
+fail_no_mm:
+	return -ENOMEM;
+}
+
+static int text_area_cpu_down_mm(unsigned int cpu)
+{
+	struct mm_struct *mm;
+	unsigned long addr;
+	pte_t *ptep;
+	pmd_t *pmdp;
+	pud_t *pudp;
+	p4d_t *p4dp;
+	pgd_t *pgdp;
+
+	mm = this_cpu_read(cpu_patching_mm);
+	addr = this_cpu_read(cpu_patching_addr);
+
+	pgdp = pgd_offset(mm, addr);
+	p4dp = p4d_offset(pgdp, addr);
+	pudp = pud_offset(p4dp, addr);
+	pmdp = pmd_offset(pudp, addr);
+	ptep = pte_offset_map(pmdp, addr);
+
+	pte_free(mm, ptep);
+	pmd_free(mm, pmdp);
+	pud_free(mm, pudp);
+	p4d_free(mm, p4dp);
+	/* pgd is dropped in mmput */
+
+	mm_dec_nr_ptes(mm);
+	mm_dec_nr_pmds(mm);
+	mm_dec_nr_puds(mm);
+
+	mmput(mm);
+
+	this_cpu_write(cpu_patching_mm, NULL);
+	this_cpu_write(cpu_patching_addr, 0);
+	this_cpu_write(cpu_patching_pte, NULL);
+
+	return 0;
+}
+
 static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
 
 void __init poking_init(void)
 {
-	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-				  "powerpc/text_poke:online",
-				  text_area_cpu_up,
-				  text_area_cpu_down) < 0);
+	int ret;
+
+	if (mm_patch_enabled())
+		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+					"powerpc/text_poke_mm:online",
+					text_area_cpu_up_mm,
+					text_area_cpu_down_mm);
+	else
+		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+					"powerpc/text_poke:online",
+					text_area_cpu_up,
+					text_area_cpu_down);
+
+	/* cpuhp_setup_state returns >= 0 on success */
+	WARN_ON(ret < 0);
 
 	static_branch_enable(&poking_init_done);
 }
@@ -145,6 +306,53 @@ static void unmap_patch_area(unsigned long addr)
 	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 }
 
+static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
+{
+	int err;
+	u32 *patch_addr;
+	unsigned long text_poke_addr;
+	pte_t *pte;
+	unsigned long pfn = get_patch_pfn(addr);
+	struct mm_struct *patching_mm;
+	struct mm_struct *orig_mm;
+
+	patching_mm = __this_cpu_read(cpu_patching_mm);
+	pte = __this_cpu_read(cpu_patching_pte);
+	text_poke_addr = __this_cpu_read(cpu_patching_addr);
+	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
+
+	if (unlikely(!patching_mm))
+		return -ENOMEM;
+
+	set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL));
+
+	/* order PTE update before use, also serves as the hwsync */
+	asm volatile("ptesync": : :"memory");
+
+	/* order context switch after arbitrary prior code */
+	isync();
+
+	orig_mm = start_using_temp_mm(patching_mm);
+
+	err = __patch_instruction(addr, instr, patch_addr);
+
+	/* hwsync performed by __patch_instruction (sync) if successful */
+	if (err)
+		mb();  /* sync */
+
+	/* context synchronisation performed by __patch_instruction (isync or exception) */
+	stop_using_temp_mm(patching_mm, orig_mm);
+
+	pte_clear(patching_mm, text_poke_addr, pte);
+	/*
+	 * ptesync to order PTE update before TLB invalidation done
+	 * by radix__local_flush_tlb_page_psize (in _tlbiel_va)
+	 */
+	local_flush_tlb_page_psize(patching_mm, text_poke_addr, mmu_virtual_psize);
+
+	return err;
+}
+
 static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
 {
 	int err;
@@ -189,7 +397,10 @@ static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
 		return raw_patch_instruction(addr, instr);
 
 	local_irq_save(flags);
-	err = __do_patch_instruction(addr, instr);
+	if (mm_patch_enabled())
+		err = __do_patch_instruction_mm(addr, instr);
+	else
+		err = __do_patch_instruction(addr, instr);
 	local_irq_restore(flags);
 
 	WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 7/7] powerpc/code-patching: Consolidate and cache per-cpu patching context
  2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
                   ` (5 preceding siblings ...)
  2022-10-25  4:44 ` [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU Benjamin Gray
@ 2022-10-25  4:44 ` Benjamin Gray
  2022-11-02 10:17   ` Christophe Leroy
  6 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-10-25  4:44 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ajd, jniethe5, Benjamin Gray, npiggin, cmr

With the temp mm context support, there are CPU local variables to hold
the patch address and pte. Use these in the non-temp mm path as well
instead of adding a level of indirection through the text_poke_area
vm_struct and pointer chasing the pte.

As both paths use these fields now, there is no need to let unreferenced
variables be dropped by the compiler, so it is cleaner to merge them into
a single context struct. This has the additional benefit of removing a
redundant CPU local pointer, as only one of cpu_patching_mm /
text_poke_area is ever used, while remaining well-typed.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
v9:	* Consolidate patching context into single struct
---
 arch/powerpc/lib/code-patching.c | 58 ++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 3fe99d0086fc..cefb938f7217 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -48,10 +48,16 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
 
-static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
-static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
-static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
-static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
+struct patch_context {
+	union {
+		struct vm_struct *text_poke_area;
+		struct mm_struct *mm;
+	};
+	unsigned long addr;
+	pte_t * pte;
+};
+
+static DEFINE_PER_CPU(struct patch_context, cpu_patching_context);
 
 static int map_patch_area(void *addr, unsigned long text_poke_addr);
 static void unmap_patch_area(unsigned long addr);
@@ -116,15 +122,19 @@ static int text_area_cpu_up(unsigned int cpu)
 
 	unmap_patch_area(addr);
 
-	this_cpu_write(text_poke_area, area);
+	this_cpu_write(cpu_patching_context.text_poke_area, area);
+	this_cpu_write(cpu_patching_context.addr, addr);
+	this_cpu_write(cpu_patching_context.pte, virt_to_kpte(addr));
 
 	return 0;
 }
 
 static int text_area_cpu_down(unsigned int cpu)
 {
-	free_vm_area(this_cpu_read(text_poke_area));
-	this_cpu_write(text_poke_area, NULL);
+	free_vm_area(this_cpu_read(cpu_patching_context.text_poke_area));
+	this_cpu_write(cpu_patching_context.text_poke_area, NULL);
+	this_cpu_write(cpu_patching_context.addr, 0);
+	this_cpu_write(cpu_patching_context.pte, NULL);
 	return 0;
 }
 
@@ -172,9 +182,9 @@ static int text_area_cpu_up_mm(unsigned int cpu)
 	if (WARN_ON(!ptep))
 		goto fail_no_pte;
 
-	this_cpu_write(cpu_patching_mm, mm);
-	this_cpu_write(cpu_patching_addr, addr);
-	this_cpu_write(cpu_patching_pte, ptep);
+	this_cpu_write(cpu_patching_context.mm, mm);
+	this_cpu_write(cpu_patching_context.addr, addr);
+	this_cpu_write(cpu_patching_context.pte, ptep);
 
 	return 0;
 
@@ -202,8 +212,8 @@ static int text_area_cpu_down_mm(unsigned int cpu)
 	p4d_t *p4dp;
 	pgd_t *pgdp;
 
-	mm = this_cpu_read(cpu_patching_mm);
-	addr = this_cpu_read(cpu_patching_addr);
+	mm = this_cpu_read(cpu_patching_context.mm);
+	addr = this_cpu_read(cpu_patching_context.addr);
 
 	pgdp = pgd_offset(mm, addr);
 	p4dp = p4d_offset(pgdp, addr);
@@ -223,9 +233,9 @@ static int text_area_cpu_down_mm(unsigned int cpu)
 
 	mmput(mm);
 
-	this_cpu_write(cpu_patching_mm, NULL);
-	this_cpu_write(cpu_patching_addr, 0);
-	this_cpu_write(cpu_patching_pte, NULL);
+	this_cpu_write(cpu_patching_context.mm, NULL);
+	this_cpu_write(cpu_patching_context.addr, 0);
+	this_cpu_write(cpu_patching_context.pte, NULL);
 
 	return 0;
 }
@@ -316,9 +326,9 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
 	struct mm_struct *patching_mm;
 	struct mm_struct *orig_mm;
 
-	patching_mm = __this_cpu_read(cpu_patching_mm);
-	pte = __this_cpu_read(cpu_patching_pte);
-	text_poke_addr = __this_cpu_read(cpu_patching_addr);
+	patching_mm = __this_cpu_read(cpu_patching_context.mm);
+	pte = __this_cpu_read(cpu_patching_context.pte);
+	text_poke_addr = __this_cpu_read(cpu_patching_context.addr);
 	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
 
 	if (unlikely(!patching_mm))
@@ -357,19 +367,17 @@ static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
 {
 	int err;
 	u32 *patch_addr;
-	struct vm_struct *area;
 	unsigned long text_poke_addr;
 	pte_t *pte;
 	unsigned long pfn = get_patch_pfn(addr);
 
-	area = __this_cpu_read(text_poke_area);
-	if (unlikely(!area))
-		return -ENOMEM;
-
-	text_poke_addr = (unsigned long)area->addr & PAGE_MASK;
+	text_poke_addr = (unsigned long)__this_cpu_read(cpu_patching_context.addr) & PAGE_MASK;
 	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
 
-	pte = virt_to_kpte(text_poke_addr);
+	if (unlikely(!text_poke_addr))
+		return -ENOMEM;
+
+	pte = __this_cpu_read(cpu_patching_context.pte);
 	__set_pte_at(&init_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL), 0);
 	/* See ptesync comment in radix__set_pte_at() */
 	if (radix_enabled())
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded
  2022-10-25  4:44 ` [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded Benjamin Gray
@ 2022-10-26  0:47   ` Benjamin Gray
  2022-11-02  9:43   ` Christophe Leroy
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-10-26  0:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

It occurred to me that we don't require holding a lock when patching
text. Many use cases do hold text_mutex, but it's not required, so it's
possible for this warning to show false positives.

If we do want text_mutex be held, then lockdep_assert_held(&text_mutex)
ought to be added too.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error
  2022-10-25  4:44 ` [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error Benjamin Gray
@ 2022-11-02  9:36   ` Christophe Leroy
  2022-11-02 22:37     ` Benjamin Gray
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02  9:36 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> Detect and abort __do_patch_instruction() when there is no text_poke_area,
> which implies there is no patching address. This allows patch_instruction()
> to fail gracefully and let the caller decide what to do, as opposed to
> the current behaviour of kernel panicking when the null pointer is
> dereferenced.

Is there any reason at all to have no text_poke_area ?

If that's the boot CPU, then it means we are really early in the boot 
process and will use raw_patch_instruction() directly. Or it means we 
don't have enough memory for the boot CPU, in which case you are going 
to have so many problems that it is not worth any effort.

If it is not the boot CPU, isn't there a way to not add a CPU for which 
the allocation has failed ? If text_area_cpu_up() returns an error, 
isn't the CPU deactivated ?



> 
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> ---
> v9:	* New in v9
> ---
>   arch/powerpc/lib/code-patching.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index ad0cf3108dd0..54e145247643 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -76,6 +76,7 @@ static int text_area_cpu_up(unsigned int cpu)
>   static int text_area_cpu_down(unsigned int cpu)
>   {
>   	free_vm_area(this_cpu_read(text_poke_area));
> +	this_cpu_write(text_poke_area, NULL);
>   	return 0;
>   }
>   
> @@ -151,11 +152,16 @@ static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
>   {
>   	int err;
>   	u32 *patch_addr;
> +	struct vm_struct *area;
>   	unsigned long text_poke_addr;
>   	pte_t *pte;
>   	unsigned long pfn = get_patch_pfn(addr);
>   
> -	text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr & PAGE_MASK;
> +	area = __this_cpu_read(text_poke_area);
> +	if (unlikely(!area))
> +		return -ENOMEM;
> +
> +	text_poke_addr = (unsigned long)area->addr & PAGE_MASK;
>   	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
>   
>   	pte = virt_to_kpte(text_poke_addr);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init
  2022-10-25  4:44 ` [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init Benjamin Gray
@ 2022-11-02  9:38   ` Christophe Leroy
  2022-11-02 22:42     ` Benjamin Gray
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02  9:38 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> BUG_ON() when failing to initialise the code patching window is
> excessive, as most critical patching happens during boot before strict
> RWX control is enabled. Failure to patch after boot is not inherently
> fatal, so aborting the kernel is better determined by the caller.
> 
> The return value of cpuhp_setup_state() is also >= 0 on success,
> so check for < 0.
> 
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> Reviewed-by: Russell Currey <ruscur@russell.cc>
> ---
> v9:	* Reword commit message to explain why init failure is not fatal
> ---
>   arch/powerpc/lib/code-patching.c | 13 +++++--------
>   1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 54e145247643..3b3b09d5d2e1 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -82,16 +82,13 @@ static int text_area_cpu_down(unsigned int cpu)
>   
>   static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
>   
> -/*
> - * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
> - * we judge it as being preferable to a kernel that will crash later when
> - * someone tries to use patch_instruction().
> - */
>   void __init poking_init(void)
>   {
> -	BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> -		"powerpc/text_poke:online", text_area_cpu_up,
> -		text_area_cpu_down));
> +	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +				  "powerpc/text_poke:online",
> +				  text_area_cpu_up,
> +				  text_area_cpu_down) < 0);
> +
>   	static_branch_enable(&poking_init_done);

Wouldn't it be better to not enable the poking_init_done branch if the 
allocation failed ?

>   }
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded
  2022-10-25  4:44 ` [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded Benjamin Gray
  2022-10-26  0:47   ` Benjamin Gray
@ 2022-11-02  9:43   ` Christophe Leroy
  2022-11-02 10:13     ` Christophe Leroy
  2022-11-02 22:58     ` Benjamin Gray
  1 sibling, 2 replies; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02  9:43 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> Verifies that if the instruction patching did not return an error then
> the value stored at the given address to patch is now equal to the
> instruction we patched it to.

Why do we need that verification ? Until now it wasn't necessary, can 
you describe a failure that occurs because we don't have this 
verification ? Or is that until now it was reliable but the new method 
you are adding will not be as reliable as before ?

What worries me with that new verification is that you are reading a 
memory address with is mostly only used for code execution. That means:
- You will almost always take a DATA TLB Miss, hence performance impact.
- If one day we want Exec-only text mappings, it will become problematic.

So really the question is, is that patch really required ?

> 
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> ---
>   arch/powerpc/lib/code-patching.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 3b3b09d5d2e1..b0a12b2d5a9b 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -192,6 +192,8 @@ static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
>   	err = __do_patch_instruction(addr, instr);
>   	local_irq_restore(flags);
>   
> +	WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));
> +
>   	return err;
>   }
>   #else /* !CONFIG_STRICT_KERNEL_RWX */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize
  2022-10-25  4:44 ` [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize Benjamin Gray
@ 2022-11-02  9:56   ` Christophe Leroy
  2022-11-03  0:39     ` Benjamin Gray
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02  9:56 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> Adds a local TLB flush operation that works given an mm_struct, VA to
> flush, and page size representation. Most implementations mirror the
> surrounding code. The book3s/32/tlbflush.h implementation is left as
> a WARN_ONCE_ON because it is more complicated and not required for

s/ONCE_ON/ONCE

> anything as yet.

Is a WARN_ONCE() really needed ? Can't a BUILD_BUG() be used instead ?


> 
> This removes the need to create a vm_area_struct, which the temporary
> patching mm work does not need.
> 
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> ---
> v9:	* Replace book3s/32/tlbflush.h implementation with warning
> ---
>   arch/powerpc/include/asm/book3s/32/tlbflush.h      | 9 +++++++++
>   arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 5 +++++
>   arch/powerpc/include/asm/book3s/64/tlbflush.h      | 8 ++++++++
>   arch/powerpc/include/asm/nohash/tlbflush.h         | 8 ++++++++
>   4 files changed, 30 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/32/tlbflush.h b/arch/powerpc/include/asm/book3s/32/tlbflush.h
> index ba1743c52b56..14d989d41f75 100644
> --- a/arch/powerpc/include/asm/book3s/32/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/32/tlbflush.h
> @@ -2,6 +2,8 @@
>   #ifndef _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
>   #define _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
>   
> +#include <asm/bug.h>
> +
>   #define MMU_NO_CONTEXT      (0)
>   /*
>    * TLB flushing for "classic" hash-MMU 32-bit CPUs, 6xx, 7xx, 7xxx
> @@ -74,6 +76,13 @@ static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>   {
>   	flush_tlb_page(vma, vmaddr);
>   }
> +
> +static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
> +					      unsigned long vmaddr, int psize)
> +{
> +	WARN_ONCE(true, "local TLB flush not implemented");

Is it possible to use BUILD_BUG() instead ?

> +}
> +
>   static inline void local_flush_tlb_mm(struct mm_struct *mm)
>   {
>   	flush_tlb_mm(mm);
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index fab8332fe1ad..8fd9dc49b2a1 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -94,6 +94,11 @@ static inline void hash__local_flush_tlb_page(struct vm_area_struct *vma,
>   {
>   }
>   
> +static inline void hash__local_flush_tlb_page_psize(struct mm_struct *mm,
> +						    unsigned long vmaddr, int psize)
> +{
> +}
> +

Is it worth an empty function ?

>   static inline void hash__flush_tlb_page(struct vm_area_struct *vma,
>   				    unsigned long vmaddr)
>   {
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> index 67655cd60545..2d839dd5c08c 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> @@ -92,6 +92,14 @@ static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>   	return hash__local_flush_tlb_page(vma, vmaddr);
>   }
>   
> +static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
> +					      unsigned long vmaddr, int psize)
> +{
> +	if (radix_enabled())
> +		return radix__local_flush_tlb_page_psize(mm, vmaddr, psize);
> +	return hash__local_flush_tlb_page_psize(mm, vmaddr, psize);

Those functions are 'void', shouldn't need the "return".

Could just be:

+static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
+					      unsigned long vmaddr, int psize)
+{
+	if (radix_enabled())
+		radix__local_flush_tlb_page_psize(mm, vmaddr, psize);
+}


> +}
> +
>   static inline void local_flush_all_mm(struct mm_struct *mm)
>   {
>   	if (radix_enabled())
> diff --git a/arch/powerpc/include/asm/nohash/tlbflush.h b/arch/powerpc/include/asm/nohash/tlbflush.h
> index bdaf34ad41ea..432bca4cac62 100644
> --- a/arch/powerpc/include/asm/nohash/tlbflush.h
> +++ b/arch/powerpc/include/asm/nohash/tlbflush.h
> @@ -45,6 +45,12 @@ static inline void local_flush_tlb_page(struct vm_area_struct *vma, unsigned lon
>   	asm volatile ("tlbie %0; sync" : : "r" (vmaddr) : "memory");
>   }
>   
> +static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
> +					      unsigned long vmaddr, int psize)
> +{
> +	asm volatile ("tlbie %0; sync" : : "r" (vmaddr) : "memory");
> +}
> +
>   static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>   {
>   	start &= PAGE_MASK;
> @@ -58,6 +64,8 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
>   extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>   extern void local_flush_tlb_mm(struct mm_struct *mm);
>   extern void local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
> +extern void local_flush_tlb_page_psize(struct mm_struct *mm,
> +				       unsigned long vmaddr, int psize);

That function doesn't seem to be defined anywhere. Is this prototype 
just to make compiler happy ? If so a static inline with a BUILD_BUG 
would likely be better, it would allow detection of a build problem at 
compile time instead of link time.

By the way, 'extern' keyword is pointless and deprecated for functions 
prototypes, please don't add new ones, even if other historical 
prototypes have one.

>   
>   extern void __local_flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
>   				   int tsize, int ind);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU
  2022-10-25  4:44 ` [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU Benjamin Gray
@ 2022-11-02 10:11   ` Christophe Leroy
  2022-11-03  3:10     ` Benjamin Gray
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02 10:11 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> From: "Christopher M. Riedl" <cmr@bluescreens.de>
> 
> x86 supports the notion of a temporary mm which restricts access to
> temporary PTEs to a single CPU. A temporary mm is useful for situations
> where a CPU needs to perform sensitive operations (such as patching a
> STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> said mappings to other CPUs. Another benefit is that other CPU TLBs do
> not need to be flushed when the temporary mm is torn down.
> 
> Mappings in the temporary mm can be set in the userspace portion of the
> address-space.
> 
> Interrupts must be disabled while the temporary mm is in use. HW
> breakpoints, which may have been set by userspace as watchpoints on
> addresses now within the temporary mm, are saved and disabled when
> loading the temporary mm. The HW breakpoints are restored when unloading
> the temporary mm. All HW breakpoints are indiscriminately disabled while
> the temporary mm is in use - this may include breakpoints set by perf.
> 
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE.
> 
> Bits of entropy with 64K page size on BOOK3S_64:
> 
> 	bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> 
> 	PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> 	bits of entropy = log2(128TB / 64K)
> 	bits of entropy = 31
> 
> The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> operates - by default the space above DEFAULT_MAP_WINDOW is not
> available. Currently the Hash MMU does not use a temporary mm so
> technically this upper limit isn't necessary; however, a larger
> randomization range does not further "harden" this overall approach and
> future work may introduce patching with a temporary mm on Hash as well.
> 
> Randomization occurs only once during initialization for each CPU as it
> comes online.
> 
> The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE
> which ignores the AMR (so no need to unlock/lock KUAP) according to
> PowerISA v3.0b Figure 35 on Radix.
> 
> Based on x86 implementation:
> 
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
> 
> and:
> 
> commit b3fd8e83ada0
> ("x86/alternatives: Use temporary mm for text poking")
> 
> From: Benjamin Gray <bgray@linux.ibm.com>
> 
> Synchronisation is done according to ISA 3.1B Book 3 Chapter 13
> "Synchronization Requirements for Context Alterations". Switching the mm
> is a change to the PID, which requires a CSI before and after the change,
> and a hwsync between the last instruction that performs address
> translation for an associated storage access.
> 
> Instruction fetch is an associated storage access, but the instruction
> address mappings are not being changed, so it should not matter which
> context they use. We must still perform a hwsync to guard arbitrary
> prior code that may have accessed a userspace address.
> 
> TLB invalidation is local and VA specific. Local because only this core
> used the patching mm, and VA specific because we only care that the
> writable mapping is purged. Leaving the other mappings intact is more
> efficient, especially when performing many code patches in a row (e.g.,
> as ftrace would).
> 
> Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> ---
> v9:	* Add back Christopher M. Riedl signed-off-by
> 	* Remove temp_mm_state
> ---
>   arch/powerpc/lib/code-patching.c | 221 ++++++++++++++++++++++++++++++-
>   1 file changed, 216 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index b0a12b2d5a9b..3fe99d0086fc 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -4,12 +4,17 @@
>    */
>   
>   #include <linux/kprobes.h>
> +#include <linux/mmu_context.h>
> +#include <linux/random.h>
>   #include <linux/vmalloc.h>
>   #include <linux/init.h>
>   #include <linux/cpuhotplug.h>
>   #include <linux/uaccess.h>
>   #include <linux/jump_label.h>
>   
> +#include <asm/debug.h>
> +#include <asm/pgalloc.h>
> +#include <asm/tlb.h>
>   #include <asm/tlbflush.h>
>   #include <asm/page.h>
>   #include <asm/code-patching.h>
> @@ -42,11 +47,54 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
>   }
>   
>   #ifdef CONFIG_STRICT_KERNEL_RWX
> +
>   static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> +static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> +static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
>   
>   static int map_patch_area(void *addr, unsigned long text_poke_addr);
>   static void unmap_patch_area(unsigned long addr);
>   
> +static bool mm_patch_enabled(void)
> +{
> +	return IS_ENABLED(CONFIG_SMP) && radix_enabled();
> +}
> +
> +/*
> + * The following applies for Radix MMU. Hash MMU has different requirements,
> + * and so is not supported.
> + *
> + * Changing mm requires context synchronising instructions on both sides of
> + * the context switch, as well as a hwsync between the last instruction for
> + * which the address of an associated storage access was translated using
> + * the current context.
> + *
> + * switch_mm_irqs_off() performs an isync after the context switch. It is
> + * the responsibility of the caller to perform the CSI and hwsync before
> + * starting/stopping the temp mm.
> + */
> +static struct mm_struct *start_using_temp_mm(struct mm_struct *temp_mm)
> +{
> +	struct mm_struct *orig_mm = current->active_mm;
> +
> +	lockdep_assert_irqs_disabled();
> +	switch_mm_irqs_off(orig_mm, temp_mm, current);
> +
> +	WARN_ON(!mm_is_thread_local(temp_mm));
> +
> +	suspend_breakpoints();
> +	return orig_mm;
> +}
> +
> +static void stop_using_temp_mm(struct mm_struct *temp_mm,
> +			       struct mm_struct *orig_mm)
> +{
> +	lockdep_assert_irqs_disabled();
> +	switch_mm_irqs_off(temp_mm, orig_mm, current);
> +	restore_breakpoints();
> +}
> +
>   static int text_area_cpu_up(unsigned int cpu)
>   {
>   	struct vm_struct *area;
> @@ -80,14 +128,127 @@ static int text_area_cpu_down(unsigned int cpu)
>   	return 0;
>   }
>   
> +static int text_area_cpu_up_mm(unsigned int cpu)
> +{
> +	struct mm_struct *mm;
> +	unsigned long addr;
> +	pgd_t *pgdp;
> +	p4d_t *p4dp;
> +	pud_t *pudp;
> +	pmd_t *pmdp;
> +	pte_t *ptep;
> +
> +	mm = copy_init_mm();
> +	if (WARN_ON(!mm))
> +		goto fail_no_mm;
> +
> +	/*
> +	 * Choose a random page-aligned address from the interval
> +	 * [PAGE_SIZE .. DEFAULT_MAP_WINDOW - PAGE_SIZE].
> +	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
> +	 */
> +	addr = (1 + (get_random_long() % (DEFAULT_MAP_WINDOW / PAGE_SIZE - 2))) << PAGE_SHIFT;

There is some work in progress to get rid of (get_random_long() % 
something), see 
https://patchwork.kernel.org/project/linux-media/cover/20221010230613.1076905-1-Jason@zx2c4.com/

> +
> +	/*
> +	 * PTE allocation uses GFP_KERNEL which means we need to
> +	 * pre-allocate the PTE here because we cannot do the
> +	 * allocation during patching when IRQs are disabled.
> +	 */
> +	pgdp = pgd_offset(mm, addr);
> +
> +	p4dp = p4d_alloc(mm, pgdp, addr);
> +	if (WARN_ON(!p4dp))
> +		goto fail_no_p4d;
> +
> +	pudp = pud_alloc(mm, p4dp, addr);
> +	if (WARN_ON(!pudp))
> +		goto fail_no_pud;
> +
> +	pmdp = pmd_alloc(mm, pudp, addr);
> +	if (WARN_ON(!pmdp))
> +		goto fail_no_pmd;
> +
> +	ptep = pte_alloc_map(mm, pmdp, addr);
> +	if (WARN_ON(!ptep))
> +		goto fail_no_pte;

Insn't there standard generic functions to do that ?

For instance, __get_locked_pte() seems to do more or less the same.

> +
> +	this_cpu_write(cpu_patching_mm, mm);
> +	this_cpu_write(cpu_patching_addr, addr);
> +	this_cpu_write(cpu_patching_pte, ptep);
> +
> +	return 0;
> +
> +fail_no_pte:
> +	pmd_free(mm, pmdp);
> +	mm_dec_nr_pmds(mm);
> +fail_no_pmd:
> +	pud_free(mm, pudp);
> +	mm_dec_nr_puds(mm);
> +fail_no_pud:
> +	p4d_free(patching_mm, p4dp);
> +fail_no_p4d:
> +	mmput(mm);
> +fail_no_mm:
> +	return -ENOMEM;
> +}
> +
> +static int text_area_cpu_down_mm(unsigned int cpu)
> +{
> +	struct mm_struct *mm;
> +	unsigned long addr;
> +	pte_t *ptep;
> +	pmd_t *pmdp;
> +	pud_t *pudp;
> +	p4d_t *p4dp;
> +	pgd_t *pgdp;
> +
> +	mm = this_cpu_read(cpu_patching_mm);
> +	addr = this_cpu_read(cpu_patching_addr);
> +
> +	pgdp = pgd_offset(mm, addr);
> +	p4dp = p4d_offset(pgdp, addr);
> +	pudp = pud_offset(p4dp, addr);
> +	pmdp = pmd_offset(pudp, addr);
> +	ptep = pte_offset_map(pmdp, addr);
> +
> +	pte_free(mm, ptep);
> +	pmd_free(mm, pmdp);
> +	pud_free(mm, pudp);
> +	p4d_free(mm, p4dp);
> +	/* pgd is dropped in mmput */
> +
> +	mm_dec_nr_ptes(mm);
> +	mm_dec_nr_pmds(mm);
> +	mm_dec_nr_puds(mm);

Same question, can't something generic be used, something like 
free_pgd_range() ?

> +
> +	mmput(mm);
> +
> +	this_cpu_write(cpu_patching_mm, NULL);
> +	this_cpu_write(cpu_patching_addr, 0);
> +	this_cpu_write(cpu_patching_pte, NULL);
> +
> +	return 0;
> +}
> +
>   static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
>   
>   void __init poking_init(void)
>   {
> -	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> -				  "powerpc/text_poke:online",
> -				  text_area_cpu_up,
> -				  text_area_cpu_down) < 0);
> +	int ret;
> +
> +	if (mm_patch_enabled())
> +		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +					"powerpc/text_poke_mm:online",
> +					text_area_cpu_up_mm,
> +					text_area_cpu_down_mm);
> +	else
> +		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +					"powerpc/text_poke:online",
> +					text_area_cpu_up,
> +					text_area_cpu_down);
> +
> +	/* cpuhp_setup_state returns >= 0 on success */
> +	WARN_ON(ret < 0);
>   
>   	static_branch_enable(&poking_init_done);
>   }
> @@ -145,6 +306,53 @@ static void unmap_patch_area(unsigned long addr)
>   	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>   }
>   
> +static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
> +{
> +	int err;
> +	u32 *patch_addr;
> +	unsigned long text_poke_addr;
> +	pte_t *pte;
> +	unsigned long pfn = get_patch_pfn(addr);
> +	struct mm_struct *patching_mm;
> +	struct mm_struct *orig_mm;
> +
> +	patching_mm = __this_cpu_read(cpu_patching_mm);
> +	pte = __this_cpu_read(cpu_patching_pte);
> +	text_poke_addr = __this_cpu_read(cpu_patching_addr);
> +	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
> +
> +	if (unlikely(!patching_mm))
> +		return -ENOMEM;
> +
> +	set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL));
> +
> +	/* order PTE update before use, also serves as the hwsync */
> +	asm volatile("ptesync": : :"memory");

You assume it is radix only ?

> +
> +	/* order context switch after arbitrary prior code */
> +	isync();
> +
> +	orig_mm = start_using_temp_mm(patching_mm);
> +
> +	err = __patch_instruction(addr, instr, patch_addr);
> +
> +	/* hwsync performed by __patch_instruction (sync) if successful */
> +	if (err)
> +		mb();  /* sync */
> +
> +	/* context synchronisation performed by __patch_instruction (isync or exception) */
> +	stop_using_temp_mm(patching_mm, orig_mm);
> +
> +	pte_clear(patching_mm, text_poke_addr, pte);
> +	/*
> +	 * ptesync to order PTE update before TLB invalidation done
> +	 * by radix__local_flush_tlb_page_psize (in _tlbiel_va)
> +	 */
> +	local_flush_tlb_page_psize(patching_mm, text_poke_addr, mmu_virtual_psize);
> +
> +	return err;
> +}
> +
>   static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
>   {
>   	int err;
> @@ -189,7 +397,10 @@ static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
>   		return raw_patch_instruction(addr, instr);
>   
>   	local_irq_save(flags);
> -	err = __do_patch_instruction(addr, instr);
> +	if (mm_patch_enabled())
> +		err = __do_patch_instruction_mm(addr, instr);
> +	else
> +		err = __do_patch_instruction(addr, instr);
>   	local_irq_restore(flags);
>   
>   	WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded
  2022-11-02  9:43   ` Christophe Leroy
@ 2022-11-02 10:13     ` Christophe Leroy
  2022-11-02 23:02       ` Benjamin Gray
  2022-11-02 22:58     ` Benjamin Gray
  1 sibling, 1 reply; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02 10:13 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 02/11/2022 à 10:43, Christophe Leroy a écrit :
> 
> 
> Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
>> Verifies that if the instruction patching did not return an error then
>> the value stored at the given address to patch is now equal to the
>> instruction we patched it to.
> 
> Why do we need that verification ? Until now it wasn't necessary, can 
> you describe a failure that occurs because we don't have this 
> verification ? Or is that until now it was reliable but the new method 
> you are adding will not be as reliable as before ?
> 
> What worries me with that new verification is that you are reading a 
> memory address with is mostly only used for code execution. That means:
> - You will almost always take a DATA TLB Miss, hence performance impact.
> - If one day we want Exec-only text mappings, it will become problematic.
> 
> So really the question is, is that patch really required ?
> 
>>
>> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
>> ---
>>   arch/powerpc/lib/code-patching.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/powerpc/lib/code-patching.c 
>> b/arch/powerpc/lib/code-patching.c
>> index 3b3b09d5d2e1..b0a12b2d5a9b 100644
>> --- a/arch/powerpc/lib/code-patching.c
>> +++ b/arch/powerpc/lib/code-patching.c
>> @@ -192,6 +192,8 @@ static int do_patch_instruction(u32 *addr, 
>> ppc_inst_t instr)
>>       err = __do_patch_instruction(addr, instr);
>>       local_irq_restore(flags);
>> +    WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));
>> +

Another point: you are doing the check outside of IRQ disabled section, 
is that correct ? What if an interrupt changed it in-between ?

Or insn't that possible ? In that case what's the real purpose of 
disabling IRQs here ?

>>       return err;
>>   }
>>   #else /* !CONFIG_STRICT_KERNEL_RWX */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 7/7] powerpc/code-patching: Consolidate and cache per-cpu patching context
  2022-10-25  4:44 ` [PATCH v9 7/7] powerpc/code-patching: Consolidate and cache per-cpu patching context Benjamin Gray
@ 2022-11-02 10:17   ` Christophe Leroy
  0 siblings, 0 replies; 26+ messages in thread
From: Christophe Leroy @ 2022-11-02 10:17 UTC (permalink / raw)
  To: Benjamin Gray, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin



Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> With the temp mm context support, there are CPU local variables to hold
> the patch address and pte. Use these in the non-temp mm path as well
> instead of adding a level of indirection through the text_poke_area
> vm_struct and pointer chasing the pte.
> 
> As both paths use these fields now, there is no need to let unreferenced
> variables be dropped by the compiler, so it is cleaner to merge them into
> a single context struct. This has the additional benefit of removing a
> redundant CPU local pointer, as only one of cpu_patching_mm /
> text_poke_area is ever used, while remaining well-typed.

Another advantage is to group all data for a given CPU in the same 
cacheline.

> 
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> ---
> v9:	* Consolidate patching context into single struct
> ---
>   arch/powerpc/lib/code-patching.c | 58 ++++++++++++++++++--------------
>   1 file changed, 33 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 3fe99d0086fc..cefb938f7217 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -48,10 +48,16 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
>   
>   #ifdef CONFIG_STRICT_KERNEL_RWX
>   
> -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> -static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> -static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> -static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
> +struct patch_context {
> +	union {
> +		struct vm_struct *text_poke_area;
> +		struct mm_struct *mm;
> +	};
> +	unsigned long addr;
> +	pte_t * pte;
> +};
> +
> +static DEFINE_PER_CPU(struct patch_context, cpu_patching_context);
>   
>   static int map_patch_area(void *addr, unsigned long text_poke_addr);
>   static void unmap_patch_area(unsigned long addr);
> @@ -116,15 +122,19 @@ static int text_area_cpu_up(unsigned int cpu)
>   
>   	unmap_patch_area(addr);
>   
> -	this_cpu_write(text_poke_area, area);
> +	this_cpu_write(cpu_patching_context.text_poke_area, area);
> +	this_cpu_write(cpu_patching_context.addr, addr);
> +	this_cpu_write(cpu_patching_context.pte, virt_to_kpte(addr));
>   
>   	return 0;
>   }
>   
>   static int text_area_cpu_down(unsigned int cpu)
>   {
> -	free_vm_area(this_cpu_read(text_poke_area));
> -	this_cpu_write(text_poke_area, NULL);
> +	free_vm_area(this_cpu_read(cpu_patching_context.text_poke_area));
> +	this_cpu_write(cpu_patching_context.text_poke_area, NULL);
> +	this_cpu_write(cpu_patching_context.addr, 0);
> +	this_cpu_write(cpu_patching_context.pte, NULL);
>   	return 0;
>   }
>   
> @@ -172,9 +182,9 @@ static int text_area_cpu_up_mm(unsigned int cpu)
>   	if (WARN_ON(!ptep))
>   		goto fail_no_pte;
>   
> -	this_cpu_write(cpu_patching_mm, mm);
> -	this_cpu_write(cpu_patching_addr, addr);
> -	this_cpu_write(cpu_patching_pte, ptep);
> +	this_cpu_write(cpu_patching_context.mm, mm);
> +	this_cpu_write(cpu_patching_context.addr, addr);
> +	this_cpu_write(cpu_patching_context.pte, ptep);
>   
>   	return 0;
>   
> @@ -202,8 +212,8 @@ static int text_area_cpu_down_mm(unsigned int cpu)
>   	p4d_t *p4dp;
>   	pgd_t *pgdp;
>   
> -	mm = this_cpu_read(cpu_patching_mm);
> -	addr = this_cpu_read(cpu_patching_addr);
> +	mm = this_cpu_read(cpu_patching_context.mm);
> +	addr = this_cpu_read(cpu_patching_context.addr);
>   
>   	pgdp = pgd_offset(mm, addr);
>   	p4dp = p4d_offset(pgdp, addr);
> @@ -223,9 +233,9 @@ static int text_area_cpu_down_mm(unsigned int cpu)
>   
>   	mmput(mm);
>   
> -	this_cpu_write(cpu_patching_mm, NULL);
> -	this_cpu_write(cpu_patching_addr, 0);
> -	this_cpu_write(cpu_patching_pte, NULL);
> +	this_cpu_write(cpu_patching_context.mm, NULL);
> +	this_cpu_write(cpu_patching_context.addr, 0);
> +	this_cpu_write(cpu_patching_context.pte, NULL);
>   
>   	return 0;
>   }
> @@ -316,9 +326,9 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
>   	struct mm_struct *patching_mm;
>   	struct mm_struct *orig_mm;
>   
> -	patching_mm = __this_cpu_read(cpu_patching_mm);
> -	pte = __this_cpu_read(cpu_patching_pte);
> -	text_poke_addr = __this_cpu_read(cpu_patching_addr);
> +	patching_mm = __this_cpu_read(cpu_patching_context.mm);
> +	pte = __this_cpu_read(cpu_patching_context.pte);
> +	text_poke_addr = __this_cpu_read(cpu_patching_context.addr);
>   	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
>   
>   	if (unlikely(!patching_mm))
> @@ -357,19 +367,17 @@ static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
>   {
>   	int err;
>   	u32 *patch_addr;
> -	struct vm_struct *area;
>   	unsigned long text_poke_addr;
>   	pte_t *pte;
>   	unsigned long pfn = get_patch_pfn(addr);
>   
> -	area = __this_cpu_read(text_poke_area);
> -	if (unlikely(!area))
> -		return -ENOMEM;
> -
> -	text_poke_addr = (unsigned long)area->addr & PAGE_MASK;
> +	text_poke_addr = (unsigned long)__this_cpu_read(cpu_patching_context.addr) & PAGE_MASK;
>   	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
>   
> -	pte = virt_to_kpte(text_poke_addr);
> +	if (unlikely(!text_poke_addr))
> +		return -ENOMEM;
> +
> +	pte = __this_cpu_read(cpu_patching_context.pte);
>   	__set_pte_at(&init_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL), 0);
>   	/* See ptesync comment in radix__set_pte_at() */
>   	if (radix_enabled())

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error
  2022-11-02  9:36   ` Christophe Leroy
@ 2022-11-02 22:37     ` Benjamin Gray
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-11-02 22:37 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

On Wed, 2022-11-02 at 09:36 +0000, Christophe Leroy wrote:
> Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > Detect and abort __do_patch_instruction() when there is no
> > text_poke_area,
> > which implies there is no patching address. This allows
> > patch_instruction()
> > to fail gracefully and let the caller decide what to do, as opposed
> > to
> > the current behaviour of kernel panicking when the null pointer is
> > dereferenced.
> 
> Is there any reason at all to have no text_poke_area ?
> 
> If that's the boot CPU, then it means we are really early in the boot
> process and will use raw_patch_instruction() directly. Or it means we
> don't have enough memory for the boot CPU, in which case you are
> going 
> to have so many problems that it is not worth any effort.
> 
> If it is not the boot CPU, isn't there a way to not add a CPU for
> which 
> the allocation has failed ? If text_area_cpu_up() returns an error, 
> isn't the CPU deactivated ?

Right, I hadn't seen that the CPU isn't onlined if the startup fails.
That would make the check redundant then.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init
  2022-11-02  9:38   ` Christophe Leroy
@ 2022-11-02 22:42     ` Benjamin Gray
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-11-02 22:42 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

On Wed, 2022-11-02 at 09:38 +0000, Christophe Leroy wrote:
> Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > diff --git a/arch/powerpc/lib/code-patching.c
> > b/arch/powerpc/lib/code-patching.c
> > index 54e145247643..3b3b09d5d2e1 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -82,16 +82,13 @@ static int text_area_cpu_down(unsigned int cpu)
> >   
> >   static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
> >   
> > -/*
> > - * Although BUG_ON() is rude, in this case it should only happen
> > if ENOMEM, and
> > - * we judge it as being preferable to a kernel that will crash
> > later when
> > - * someone tries to use patch_instruction().
> > - */
> >   void __init poking_init(void)
> >   {
> > -       BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > -               "powerpc/text_poke:online", text_area_cpu_up,
> > -               text_area_cpu_down));
> > +       WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > +                                 "powerpc/text_poke:online",
> > +                                 text_area_cpu_up,
> > +                                 text_area_cpu_down) < 0);
> > +
> >         static_branch_enable(&poking_init_done);
> 
> Wouldn't it be better to not enable the poking_init_done branch if
> the 
> allocation failed ?

Yeah that probably works better. If it manages to reach a patch attempt
after that it should fail anyway with strict RWX.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded
  2022-11-02  9:43   ` Christophe Leroy
  2022-11-02 10:13     ` Christophe Leroy
@ 2022-11-02 22:58     ` Benjamin Gray
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-11-02 22:58 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

On Wed, 2022-11-02 at 09:43 +0000, Christophe Leroy wrote:
> Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > Verifies that if the instruction patching did not return an error
> > then
> > the value stored at the given address to patch is now equal to the
> > instruction we patched it to.
> 
> Why do we need that verification ? Until now it wasn't necessary, can
> you describe a failure that occurs because we don't have this 
> verification ? Or is that until now it was reliable but the new
> method 
> you are adding will not be as reliable as before ?
> 
> What worries me with that new verification is that you are reading a 
> memory address with is mostly only used for code execution. That
> means:
> - You will almost always take a DATA TLB Miss, hence performance
> impact.
> - If one day we want Exec-only text mappings, it will become
> problematic.
> 
> So really the question is, is that patch really required ?

It's required as much as any sanity check in the kernel. I agree
running it all the time is not great though, I would prefer to make it
a debug-only check. I've seen VM_WARN_ON be used for this purpose I
think? It's especially useful now with churn on the code-patching code.
I don't expect the new method to be unreliable—I wouldn't be submitting
it if I did—but I'd much prefer to have an obvious tell if it does turn
out so.

But the above is all moot because we allow parallel patching, so the
check is just plain incorrect.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded
  2022-11-02 10:13     ` Christophe Leroy
@ 2022-11-02 23:02       ` Benjamin Gray
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-11-02 23:02 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

On Wed, 2022-11-02 at 11:13 +0100, Christophe Leroy wrote:
> Le 02/11/2022 à 10:43, Christophe Leroy a écrit :
> > Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > > Verifies that if the instruction patching did not return an error
> > > then
> > > the value stored at the given address to patch is now equal to
> > > the
> > > instruction we patched it to.
> > 
> > Why do we need that verification ? Until now it wasn't necessary,
> > can 
> > you describe a failure that occurs because we don't have this 
> > verification ? Or is that until now it was reliable but the new
> > method 
> > you are adding will not be as reliable as before ?
> > 
> > What worries me with that new verification is that you are reading
> > a 
> > memory address with is mostly only used for code execution. That
> > means:
> > - You will almost always take a DATA TLB Miss, hence performance
> > impact.
> > - If one day we want Exec-only text mappings, it will become
> > problematic.
> > 
> > So really the question is, is that patch really required ?
> > 
> > > 
> > > Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> > > ---
> > >   arch/powerpc/lib/code-patching.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/lib/code-patching.c 
> > > b/arch/powerpc/lib/code-patching.c
> > > index 3b3b09d5d2e1..b0a12b2d5a9b 100644
> > > --- a/arch/powerpc/lib/code-patching.c
> > > +++ b/arch/powerpc/lib/code-patching.c
> > > @@ -192,6 +192,8 @@ static int do_patch_instruction(u32 *addr, 
> > > ppc_inst_t instr)
> > >       err = __do_patch_instruction(addr, instr);
> > >       local_irq_restore(flags);
> > > +    WARN_ON(!err && !ppc_inst_equal(instr,
> > > ppc_inst_read(addr)));
> > > +
> 
> Another point: you are doing the check outside of IRQ disabled
> section, 
> is that correct ? What if an interrupt changed it in-between ?
> 
> Or insn't that possible ? In that case what's the real purpose of 
> disabling IRQs here ?

Disabling IRQ's also serves to prevent the writeable mapping being
visible outside of the patching function from my understanding. But I
would move the check inside the disabled section if I were keeping it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize
  2022-11-02  9:56   ` Christophe Leroy
@ 2022-11-03  0:39     ` Benjamin Gray
  2022-11-03  0:45       ` Andrew Donnellan
  2022-11-07  6:58       ` Benjamin Gray
  0 siblings, 2 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-11-03  0:39 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

On Wed, 2022-11-02 at 09:56 +0000, Christophe Leroy wrote:
> 
> 
> Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > Adds a local TLB flush operation that works given an mm_struct, VA
> > to
> > flush, and page size representation. Most implementations mirror
> > the
> > surrounding code. The book3s/32/tlbflush.h implementation is left
> > as
> > a WARN_ONCE_ON because it is more complicated and not required for
> 
> s/ONCE_ON/ONCE
> 
> > anything as yet.
> 
> Is a WARN_ONCE() really needed ? Can't a BUILD_BUG() be used instead
> ?

Looks like BUILD_BUG is safe. Tested with 83xx/mpc834x_mds_defconfig,
it doesn't break the build.

> > 
> > This removes the need to create a vm_area_struct, which the
> > temporary
> > patching mm work does not need.
> > 
> > Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> > ---
> > v9:     * Replace book3s/32/tlbflush.h implementation with warning
> > ---
> >   arch/powerpc/include/asm/book3s/32/tlbflush.h      | 9 +++++++++
> >   arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 5 +++++
> >   arch/powerpc/include/asm/book3s/64/tlbflush.h      | 8 ++++++++
> >   arch/powerpc/include/asm/nohash/tlbflush.h         | 8 ++++++++
> >   4 files changed, 30 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/32/tlbflush.h
> > b/arch/powerpc/include/asm/book3s/32/tlbflush.h
> > index ba1743c52b56..14d989d41f75 100644
> > --- a/arch/powerpc/include/asm/book3s/32/tlbflush.h
> > +++ b/arch/powerpc/include/asm/book3s/32/tlbflush.h
> > @@ -2,6 +2,8 @@
> >   #ifndef _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
> >   #define _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
> >   
> > +#include <asm/bug.h>
> > +
> >   #define MMU_NO_CONTEXT      (0)
> >   /*
> >    * TLB flushing for "classic" hash-MMU 32-bit CPUs, 6xx, 7xx,
> > 7xxx
> > @@ -74,6 +76,13 @@ static inline void local_flush_tlb_page(struct
> > vm_area_struct *vma,
> >   {
> >         flush_tlb_page(vma, vmaddr);
> >   }
> > +
> > +static inline void local_flush_tlb_page_psize(struct mm_struct
> > *mm,
> > +                                             unsigned long vmaddr,
> > int psize)
> > +{
> > +       WARN_ONCE(true, "local TLB flush not implemented");
> 
> Is it possible to use BUILD_BUG() instead ?

Yep, above

> > +}
> > +
> >   static inline void local_flush_tlb_mm(struct mm_struct *mm)
> >   {
> >         flush_tlb_mm(mm);
> > diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> > b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> > index fab8332fe1ad..8fd9dc49b2a1 100644
> > --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> > @@ -94,6 +94,11 @@ static inline void
> > hash__local_flush_tlb_page(struct vm_area_struct *vma,
> >   {
> >   }
> >   
> > +static inline void hash__local_flush_tlb_page_psize(struct
> > mm_struct *mm,
> > +                                                   unsigned long
> > vmaddr, int psize)
> > +{
> > +}
> > +
> 
> Is it worth an empty function ?

See end

> >   static inline void hash__flush_tlb_page(struct vm_area_struct
> > *vma,
> >                                     unsigned long vmaddr)
> >   {
> > diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > index 67655cd60545..2d839dd5c08c 100644
> > --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > @@ -92,6 +92,14 @@ static inline void local_flush_tlb_page(struct
> > vm_area_struct *vma,
> >         return hash__local_flush_tlb_page(vma, vmaddr);
> >   }
> >   
> > +static inline void local_flush_tlb_page_psize(struct mm_struct
> > *mm,
> > +                                             unsigned long vmaddr,
> > int psize)
> > +{
> > +       if (radix_enabled())
> > +               return radix__local_flush_tlb_page_psize(mm,
> > vmaddr, psize);
> > +       return hash__local_flush_tlb_page_psize(mm, vmaddr, psize);
> 
> Those functions are 'void', shouldn't need the "return".
> 
> Could just be:
> 
> +static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
> +                                             unsigned long vmaddr,
> int psize)
> +{
> +       if (radix_enabled())
> +               radix__local_flush_tlb_page_psize(mm, vmaddr, psize);
> +}

See end

> > +}
> > +
> >   static inline void local_flush_all_mm(struct mm_struct *mm)
> >   {
> >         if (radix_enabled())
> > diff --git a/arch/powerpc/include/asm/nohash/tlbflush.h
> > b/arch/powerpc/include/asm/nohash/tlbflush.h
> > index bdaf34ad41ea..432bca4cac62 100644
> > --- a/arch/powerpc/include/asm/nohash/tlbflush.h
> > +++ b/arch/powerpc/include/asm/nohash/tlbflush.h
> > @@ -45,6 +45,12 @@ static inline void local_flush_tlb_page(struct
> > vm_area_struct *vma, unsigned lon
> >         asm volatile ("tlbie %0; sync" : : "r" (vmaddr) :
> > "memory");
> >   }
> >   
> > +static inline void local_flush_tlb_page_psize(struct mm_struct
> > *mm,
> > +                                             unsigned long vmaddr,
> > int psize)
> > +{
> > +       asm volatile ("tlbie %0; sync" : : "r" (vmaddr) :
> > "memory");
> > +}
> > +
> >   static inline void flush_tlb_kernel_range(unsigned long start,
> > unsigned long end)
> >   {
> >         start &= PAGE_MASK;
> > @@ -58,6 +64,8 @@ static inline void
> > flush_tlb_kernel_range(unsigned long start, unsigned long end
> >   extern void flush_tlb_kernel_range(unsigned long start, unsigned
> > long end);
> >   extern void local_flush_tlb_mm(struct mm_struct *mm);
> >   extern void local_flush_tlb_page(struct vm_area_struct *vma,
> > unsigned long vmaddr);
> > +extern void local_flush_tlb_page_psize(struct mm_struct *mm,
> > +                                      unsigned long vmaddr, int
> > psize);
> 
> That function doesn't seem to be defined anywhere. Is this prototype 
> just to make compiler happy ? If so a static inline with a BUILD_BUG 
> would likely be better, it would allow detection of a build problem
> at 
> compile time instead of link time.

I think I just missed the implementation. It's easy to implement though
based on local_flush_tlb_page in mm/nohash/tlb.c, so I'll just add it

  void local_flush_tlb_page_psize(struct mm_struct *mm,
  				unsigned long vmaddr, int psize)
  {
  	__local_flush_tlb_page(mm, vmaddr, mmu_get_tsize(psize), 0);
  }
  EXPORT_SYMBOL(local_flush_tlb_page_psize);

> 
> By the way, 'extern' keyword is pointless and deprecated for
> functions 
> prototypes, please don't add new ones, even if other historical 
> prototypes have one.

This and the above commented parts match the style of the surrounding
implementations. For example,

	static inline void local_flush_tlb_mm(struct mm_struct *mm)
	{
		if (radix_enabled())
			return radix__local_flush_tlb_mm(mm);
		return hash__local_flush_tlb_mm(mm);
	}

I am not going to add code that is inconsistent with the surrounding
code. That just causes confusion later down the line when readers
wonder why this function is special compared to the others. If it needs
to use modern style, then I would be happy to include a patch that
modernises the surrounding code first.

Though why the hash__* functions are empty I'm not sure... If they were
not implemented I would have expected a BUILD_BUG(). If they are
unnecessary due to hash itself, it's odd that they exist (as you point
out for the new one).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize
  2022-11-03  0:39     ` Benjamin Gray
@ 2022-11-03  0:45       ` Andrew Donnellan
  2022-11-07  6:58       ` Benjamin Gray
  1 sibling, 0 replies; 26+ messages in thread
From: Andrew Donnellan @ 2022-11-03  0:45 UTC (permalink / raw)
  To: Benjamin Gray, Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, npiggin

On Thu, 2022-11-03 at 11:39 +1100, Benjamin Gray wrote:
> > By the way, 'extern' keyword is pointless and deprecated for
> > functions 
> > prototypes, please don't add new ones, even if other historical 
> > prototypes have one.
> 
> This and the above commented parts match the style of the surrounding
> implementations. For example,
> 
>         static inline void local_flush_tlb_mm(struct mm_struct *mm)
>         {
>                 if (radix_enabled())
>                         return radix__local_flush_tlb_mm(mm);
>                 return hash__local_flush_tlb_mm(mm);
>         }
> 
> I am not going to add code that is inconsistent with the surrounding
> code. That just causes confusion later down the line when readers
> wonder why this function is special compared to the others. If it
> needs
> to use modern style, then I would be happy to include a patch that
> modernises the surrounding code first.

This series would be a good opportunity to clean the rest of that file
up; either way, as Christophe says we should avoid adding new uses
given that people have started actively trying to clean up the use of
the extern keyword.

-- 
Andrew Donnellan    OzLabs, ADL Canberra
ajd@linux.ibm.com   IBM Australia Limited

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU
  2022-11-02 10:11   ` Christophe Leroy
@ 2022-11-03  3:10     ` Benjamin Gray
  2022-11-08  5:16       ` Benjamin Gray
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-11-03  3:10 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, cmr, ajd, npiggin

On Wed, 2022-11-02 at 10:11 +0000, Christophe Leroy wrote:
> Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > +static int text_area_cpu_up_mm(unsigned int cpu)
> > +{
> > +       struct mm_struct *mm;
> > +       unsigned long addr;
> > +       pgd_t *pgdp;
> > +       p4d_t *p4dp;
> > +       pud_t *pudp;
> > +       pmd_t *pmdp;
> > +       pte_t *ptep;
> > +
> > +       mm = copy_init_mm();
> > +       if (WARN_ON(!mm))
> > +               goto fail_no_mm;
> > +
> > +       /*
> > +        * Choose a random page-aligned address from the interval
> > +        * [PAGE_SIZE .. DEFAULT_MAP_WINDOW - PAGE_SIZE].
> > +        * The lower address bound is PAGE_SIZE to avoid the zero-
> > page.
> > +        */
> > +       addr = (1 + (get_random_long() % (DEFAULT_MAP_WINDOW /
> > PAGE_SIZE - 2))) << PAGE_SHIFT;
> 
> There is some work in progress to get rid of (get_random_long() % 
> something), see 
> https://patchwork.kernel.org/project/linux-media/cover/20221010230613.1076905-1-Jason@zx2c4.com/

I had been read that series before sending. get_random_int is removed,
and the guidelines give instructions for fixed value sizes (u32, u64),
but get_random_long() doesn't have a corresponding
get_random_long_max().

> > +
> > +       /*
> > +        * PTE allocation uses GFP_KERNEL which means we need to
> > +        * pre-allocate the PTE here because we cannot do the
> > +        * allocation during patching when IRQs are disabled.
> > +        */
> > +       pgdp = pgd_offset(mm, addr);
> > +
> > +       p4dp = p4d_alloc(mm, pgdp, addr);
> > +       if (WARN_ON(!p4dp))
> > +               goto fail_no_p4d;
> > +
> > +       pudp = pud_alloc(mm, p4dp, addr);
> > +       if (WARN_ON(!pudp))
> > +               goto fail_no_pud;
> > +
> > +       pmdp = pmd_alloc(mm, pudp, addr);
> > +       if (WARN_ON(!pmdp))
> > +               goto fail_no_pmd;
> > +
> > +       ptep = pte_alloc_map(mm, pmdp, addr);
> > +       if (WARN_ON(!ptep))
> > +               goto fail_no_pte;
> 
> Insn't there standard generic functions to do that ?
> 
> For instance, __get_locked_pte() seems to do more or less the same.

__get_locked_pte invokes walk_to_pmd, which leaks memory if the
allocation fails. This may not be a concern necessarily at boot (though
I still don't like it), but startup is run every time a CPU comes
online, so the leak is theoretically unbounded.

There's no need to leak it in this context, because we know that each
page is exclusively used by the corresponding patching mm.

> > +
> > +       this_cpu_write(cpu_patching_mm, mm);
> > +       this_cpu_write(cpu_patching_addr, addr);
> > +       this_cpu_write(cpu_patching_pte, ptep);
> > +
> > +       return 0;
> > +
> > +fail_no_pte:
> > +       pmd_free(mm, pmdp);
> > +       mm_dec_nr_pmds(mm);
> > +fail_no_pmd:
> > +       pud_free(mm, pudp);
> > +       mm_dec_nr_puds(mm);
> > +fail_no_pud:
> > +       p4d_free(patching_mm, p4dp);
> > +fail_no_p4d:
> > +       mmput(mm);
> > +fail_no_mm:
> > +       return -ENOMEM;
> > +}
> > +
> > +static int text_area_cpu_down_mm(unsigned int cpu)
> > +{
> > +       struct mm_struct *mm;
> > +       unsigned long addr;
> > +       pte_t *ptep;
> > +       pmd_t *pmdp;
> > +       pud_t *pudp;
> > +       p4d_t *p4dp;
> > +       pgd_t *pgdp;
> > +
> > +       mm = this_cpu_read(cpu_patching_mm);
> > +       addr = this_cpu_read(cpu_patching_addr);
> > +
> > +       pgdp = pgd_offset(mm, addr);
> > +       p4dp = p4d_offset(pgdp, addr);
> > +       pudp = pud_offset(p4dp, addr);
> > +       pmdp = pmd_offset(pudp, addr);
> > +       ptep = pte_offset_map(pmdp, addr);
> > +
> > +       pte_free(mm, ptep);
> > +       pmd_free(mm, pmdp);
> > +       pud_free(mm, pudp);
> > +       p4d_free(mm, p4dp);
> > +       /* pgd is dropped in mmput */
> > +
> > +       mm_dec_nr_ptes(mm);
> > +       mm_dec_nr_pmds(mm);
> > +       mm_dec_nr_puds(mm);
> 
> Same question, can't something generic be used, something like 
> free_pgd_range() ?

Possibly, but I don't have a struct mmu_gather to give it. If there's a
way to make one temporarily then it might work.

> > +static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
> > +{
> > +       int err;
> > +       u32 *patch_addr;
> > +       unsigned long text_poke_addr;
> > +       pte_t *pte;
> > +       unsigned long pfn = get_patch_pfn(addr);
> > +       struct mm_struct *patching_mm;
> > +       struct mm_struct *orig_mm;
> > +
> > +       patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       pte = __this_cpu_read(cpu_patching_pte);
> > +       text_poke_addr = __this_cpu_read(cpu_patching_addr);
> > +       patch_addr = (u32 *)(text_poke_addr +
> > offset_in_page(addr));
> > +
> > +       if (unlikely(!patching_mm))
> > +               return -ENOMEM;
> > +
> > +       set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn,
> > PAGE_KERNEL));
> > +
> > +       /* order PTE update before use, also serves as the hwsync
> > */
> > +       asm volatile("ptesync": : :"memory");
> 
> You assume it is radix only ?

I enforce it in mm_patch_enabled

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize
  2022-11-03  0:39     ` Benjamin Gray
  2022-11-03  0:45       ` Andrew Donnellan
@ 2022-11-07  6:58       ` Benjamin Gray
  2022-11-07 12:28         ` Nicholas Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Benjamin Gray @ 2022-11-07  6:58 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, npiggin, ajd, cmr

On Thu, 2022-11-03 at 11:39 +1100, Benjamin Gray wrote:
> On Wed, 2022-11-02 at 09:56 +0000, Christophe Leroy wrote:
> > By the way, 'extern' keyword is pointless and deprecated for
> > functions 
> > prototypes, please don't add new ones, even if other historical 
> > prototypes have one.
> 
> This and the above commented parts match the style of the surrounding
> implementations. For example,
> 
>         static inline void local_flush_tlb_mm(struct mm_struct *mm)
>         {
>                 if (radix_enabled())
>                         return radix__local_flush_tlb_mm(mm);
>                 return hash__local_flush_tlb_mm(mm);
>         }
> 
> I am not going to add code that is inconsistent with the surrounding
> code. That just causes confusion later down the line when readers
> wonder why this function is special compared to the others. If it
> needs
> to use modern style, then I would be happy to include a patch that
> modernises the surrounding code first.
> 
> Though why the hash__* functions are empty I'm not sure... If they
> were
> not implemented I would have expected a BUILD_BUG(). If they are
> unnecessary due to hash itself, it's odd that they exist (as you
> point
> out for the new one).

From what I can see in the history, the empty hash functions were
originally introduced as 64-bit compatibility definitions for the hash
MMU (which I guess doesn't require any action). These empty functions
were shuffled around, then eventually prefixed with hash__* to make way
for the radix variants, which are hidden behind a generic
'local_flush_tlb_mm' etc. implementation as we see today. So basically,
the empty hash__* functions no longer (never, really) served a purpose
once the generic wrapper was added. I'll add a patch to delete them and
clean up the return voids.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize
  2022-11-07  6:58       ` Benjamin Gray
@ 2022-11-07 12:28         ` Nicholas Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nicholas Piggin @ 2022-11-07 12:28 UTC (permalink / raw)
  To: Benjamin Gray, Christophe Leroy, linuxppc-dev; +Cc: jniethe5, ajd, cmr

On Mon Nov 7, 2022 at 4:58 PM AEST, Benjamin Gray wrote:
> On Thu, 2022-11-03 at 11:39 +1100, Benjamin Gray wrote:
> > On Wed, 2022-11-02 at 09:56 +0000, Christophe Leroy wrote:
> > > By the way, 'extern' keyword is pointless and deprecated for
> > > functions 
> > > prototypes, please don't add new ones, even if other historical 
> > > prototypes have one.
> > 
> > This and the above commented parts match the style of the surrounding
> > implementations. For example,
> > 
> >         static inline void local_flush_tlb_mm(struct mm_struct *mm)
> >         {
> >                 if (radix_enabled())
> >                         return radix__local_flush_tlb_mm(mm);
> >                 return hash__local_flush_tlb_mm(mm);
> >         }
> > 
> > I am not going to add code that is inconsistent with the surrounding
> > code. That just causes confusion later down the line when readers
> > wonder why this function is special compared to the others. If it
> > needs
> > to use modern style, then I would be happy to include a patch that
> > modernises the surrounding code first.
> > 
> > Though why the hash__* functions are empty I'm not sure... If they
> > were
> > not implemented I would have expected a BUILD_BUG(). If they are
> > unnecessary due to hash itself, it's odd that they exist (as you
> > point
> > out for the new one).
>
> From what I can see in the history, the empty hash functions were
> originally introduced as 64-bit compatibility definitions for the hash
> MMU (which I guess doesn't require any action).

Yeah the hash MMU does hash PTE update and TLB flushing in the Linux pte
update APIs (which end up at hash__pte_update()). By the time Linux
calls flush_tlb_xxx(), the powerpc code had already done the necessary
TLB flushing.

> These empty functions
> were shuffled around, then eventually prefixed with hash__* to make way
> for the radix variants, which are hidden behind a generic
> 'local_flush_tlb_mm' etc. implementation as we see today. So basically,
> the empty hash__* functions no longer (never, really) served a purpose
> once the generic wrapper was added. I'll add a patch to delete them and
> clean up the return voids.

Yeah I think you got it - the functions had to be there pre-radix
because they were required by core code, and when radix was added
it probably just followed a template. Removing empty hash__ functions
should be fine I think.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU
  2022-11-03  3:10     ` Benjamin Gray
@ 2022-11-08  5:16       ` Benjamin Gray
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gray @ 2022-11-08  5:16 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: jniethe5, npiggin, ajd, cmr

On Thu, 2022-11-03 at 14:10 +1100, Benjamin Gray wrote:
> On Wed, 2022-11-02 at 10:11 +0000, Christophe Leroy wrote:
> > Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > > +       /*
> > > +        * PTE allocation uses GFP_KERNEL which means we need to
> > > +        * pre-allocate the PTE here because we cannot do the
> > > +        * allocation during patching when IRQs are disabled.
> > > +        */
> > > +       pgdp = pgd_offset(mm, addr);
> > > +
> > > +       p4dp = p4d_alloc(mm, pgdp, addr);
> > > +       if (WARN_ON(!p4dp))
> > > +               goto fail_no_p4d;
> > > +
> > > +       pudp = pud_alloc(mm, p4dp, addr);
> > > +       if (WARN_ON(!pudp))
> > > +               goto fail_no_pud;
> > > +
> > > +       pmdp = pmd_alloc(mm, pudp, addr);
> > > +       if (WARN_ON(!pmdp))
> > > +               goto fail_no_pmd;
> > > +
> > > +       ptep = pte_alloc_map(mm, pmdp, addr);
> > > +       if (WARN_ON(!ptep))
> > > +               goto fail_no_pte;
> > 
> > Insn't there standard generic functions to do that ?
> > 
> > For instance, __get_locked_pte() seems to do more or less the same.
> 
> __get_locked_pte invokes walk_to_pmd, which leaks memory if the
> allocation fails. This may not be a concern necessarily at boot
> (though
> I still don't like it), but startup is run every time a CPU comes
> online, so the leak is theoretically unbounded.
> 
> There's no need to leak it in this context, because we know that each
> page is exclusively used by the corresponding patching mm.

I found tlb_gather_mmu() to initialise a struct mmu_gather, so I've
removed all the open coding (it should free any partial page tables if
get_locked_pte fails). Currently running it through CI before posting,
will probably get the v10 out tomorrow.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-11-08  5:17 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-25  4:44 [PATCH v9 0/7] Use per-CPU temporary mappings for patching on Radix MMU Benjamin Gray
2022-10-25  4:44 ` [PATCH v9 1/7] powerpc: Allow clearing and restoring registers independent of saved breakpoint state Benjamin Gray
2022-10-25  4:44 ` [PATCH v9 2/7] powerpc/code-patching: Handle RWX patching initialisation error Benjamin Gray
2022-11-02  9:36   ` Christophe Leroy
2022-11-02 22:37     ` Benjamin Gray
2022-10-25  4:44 ` [PATCH v9 3/7] powerpc/code-patching: Use WARN_ON and fix check in poking_init Benjamin Gray
2022-11-02  9:38   ` Christophe Leroy
2022-11-02 22:42     ` Benjamin Gray
2022-10-25  4:44 ` [PATCH v9 4/7] powerpc/code-patching: Verify instruction patch succeeded Benjamin Gray
2022-10-26  0:47   ` Benjamin Gray
2022-11-02  9:43   ` Christophe Leroy
2022-11-02 10:13     ` Christophe Leroy
2022-11-02 23:02       ` Benjamin Gray
2022-11-02 22:58     ` Benjamin Gray
2022-10-25  4:44 ` [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize Benjamin Gray
2022-11-02  9:56   ` Christophe Leroy
2022-11-03  0:39     ` Benjamin Gray
2022-11-03  0:45       ` Andrew Donnellan
2022-11-07  6:58       ` Benjamin Gray
2022-11-07 12:28         ` Nicholas Piggin
2022-10-25  4:44 ` [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU Benjamin Gray
2022-11-02 10:11   ` Christophe Leroy
2022-11-03  3:10     ` Benjamin Gray
2022-11-08  5:16       ` Benjamin Gray
2022-10-25  4:44 ` [PATCH v9 7/7] powerpc/code-patching: Consolidate and cache per-cpu patching context Benjamin Gray
2022-11-02 10:17   ` Christophe Leroy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.