All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] arm64: Support dynamic preemption
@ 2021-09-20 23:32 Frederic Weisbecker
  2021-09-20 23:32 ` [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY " Frederic Weisbecker
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2021-09-20 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas, Will Deacon
  Cc: LKML, Frederic Weisbecker, Ard Biesheuvel, James Morse,
	Quentin Perret, Mark Rutland

Traditionally the preemption flavour was defined on Kconfig then fixed
in stone. Now with CONFIG_PREEMPT_DYNAMIC the users can overwrite that
on boot with the "preempt=" boot option (and also through debugfs but
that's a secret).

Linux distros can be particularly fond of this because it allows them
to rely on a single kernel image for all preemption flavours.

x86 was the only supported architecture so far but interests are
broader.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	preempt/arm

HEAD: 351eaa68b5304b8b0e7c6e7b4470dd917475e65e

Thanks,
	Frederic
---

Frederic Weisbecker (3):
      sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY dynamic preemption
      arm64: Implement IRQ exit preemption static call for dynamic preemption
      arm64: Implement HAVE_PREEMPT_DYNAMIC

Ard Biesheuvel (1):
      arm64: implement support for static call trampolines


 arch/Kconfig                         |  1 -
 arch/arm64/Kconfig                   |  2 ++
 arch/arm64/include/asm/insn.h        |  2 ++
 arch/arm64/include/asm/preempt.h     | 23 ++++++++++++++++++++++-
 arch/arm64/include/asm/static_call.h | 28 ++++++++++++++++++++++++++++
 arch/arm64/kernel/Makefile           |  4 ++--
 arch/arm64/kernel/entry-common.c     | 15 ++++++++++++---
 arch/arm64/kernel/patching.c         | 14 +++++++++++---
 arch/arm64/kernel/vmlinux.lds.S      |  1 +
 include/linux/entry-common.h         |  3 ++-
 kernel/sched/core.c                  |  6 ++++--
 11 files changed, 86 insertions(+), 13 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY dynamic preemption
  2021-09-20 23:32 [PATCH 0/4] arm64: Support dynamic preemption Frederic Weisbecker
@ 2021-09-20 23:32 ` Frederic Weisbecker
  2021-09-21  7:10   ` Peter Zijlstra
  2021-09-20 23:32 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2021-09-20 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas, Will Deacon
  Cc: LKML, Frederic Weisbecker, Ard Biesheuvel, James Morse,
	Quentin Perret, Mark Rutland

There is no need to force dynamic preemption to depend on the generic
entry code. The latter is convenient but not mandatory. An architecture
that doesn't support it just need to provide a static call on its
kernel IRQ exit preemption path.

Prepare the preempt dynamic code to handle that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
---
 arch/Kconfig                 | 1 -
 include/linux/entry-common.h | 3 ++-
 kernel/sched/core.c          | 6 ++++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8df1c7102643..9af493999d43 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1255,7 +1255,6 @@ config HAVE_STATIC_CALL_INLINE
 config HAVE_PREEMPT_DYNAMIC
 	bool
 	depends on HAVE_STATIC_CALL
-	depends on GENERIC_ENTRY
 	help
 	   Select this if the architecture support boot time preempt setting
 	   on top of static calls. It is strongly advised to support inline
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..81166bbc0f22 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -456,7 +456,8 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  */
 void irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+#define __irqentry_exit_cond_resched_func irqentry_exit_cond_resched
+DECLARE_STATIC_CALL(irqentry_exit_cond_resched, __irqentry_exit_cond_resched_func);
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af2ca7ea7dda..51c81da33f23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6553,7 +6553,9 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 
+#ifdef CONFIG_GENERIC_ENTRY
 #include <linux/entry-common.h>
+#endif
 
 /*
  * SC:cond_resched
@@ -6618,7 +6620,7 @@ void sched_dynamic_update(int mode)
 	static_call_update(might_resched, __cond_resched);
 	static_call_update(preempt_schedule, __preempt_schedule_func);
 	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+	static_call_update(irqentry_exit_cond_resched, __irqentry_exit_cond_resched_func);
 
 	switch (mode) {
 	case preempt_dynamic_none:
@@ -6644,7 +6646,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, (void *)&__static_call_return0);
 		static_call_update(preempt_schedule, __preempt_schedule_func);
 		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+		static_call_update(irqentry_exit_cond_resched, __irqentry_exit_cond_resched_func);
 		pr_info("Dynamic Preempt: full\n");
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-20 23:32 [PATCH 0/4] arm64: Support dynamic preemption Frederic Weisbecker
  2021-09-20 23:32 ` [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY " Frederic Weisbecker
@ 2021-09-20 23:32 ` Frederic Weisbecker
  2021-09-21  7:09   ` Peter Zijlstra
  2021-09-21 16:10   ` Ard Biesheuvel
  2021-09-20 23:32 ` [PATCH 3/4] arm64: Implement IRQ exit preemption static call for dynamic preemption Frederic Weisbecker
  2021-09-20 23:32 ` [PATCH 4/4] arm64: Implement HAVE_PREEMPT_DYNAMIC Frederic Weisbecker
  3 siblings, 2 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2021-09-20 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas, Will Deacon
  Cc: LKML, Ard Biesheuvel, James Morse, Frederic Weisbecker,
	Quentin Perret, Mark Rutland

From: Ard Biesheuvel <ardb@kernel.org>

[fweisbec: rebased against 5.15-rc2. There has been quite some changes
 on arm64 since then, especially with insn/patching, so some naming may
 not be relevant anymore]

Implement arm64 support for the 'unoptimized' static call variety, which
routes all calls through a single trampoline that is patched to perform a
tail call to the selected function.

Since static call targets may be located in modules loaded out of direct
branching range, we need to use a ADRP/ADD pair to load the branch target
into R16 and use a branch-to-register (BR) instruction to perform an
indirect call. Unlike on x86, there is no pressing need on arm64 to avoid
indirect calls at all cost, but hiding it from the compiler as is done
here does have some benefits:
- the literal is located in .rodata, which gives us the same robustness
  advantage that code patching does;
- no performance hit on CFI enabled Clang builds that decorate compiler
  emitted indirect calls with branch target validity checks.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/insn.h        |  2 ++
 arch/arm64/include/asm/static_call.h | 28 ++++++++++++++++++++++++++++
 arch/arm64/kernel/Makefile           |  4 ++--
 arch/arm64/kernel/patching.c         | 14 +++++++++++---
 arch/arm64/kernel/vmlinux.lds.S      |  1 +
 6 files changed, 45 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/static_call.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5c7ae4c3954b..5b51b359ccda 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -192,6 +192,7 @@ config ARM64
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_STATIC_CALL
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_FUTEX_CMPXCHG if FUTEX
 	select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 6b776c8667b2..681c08b170df 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -547,6 +547,8 @@ u32 aarch64_set_branch_offset(u32 insn, s32 offset);
 s32 aarch64_insn_adrp_get_offset(u32 insn);
 u32 aarch64_insn_adrp_set_offset(u32 insn, s32 offset);
 
+int aarch64_literal_write(void *addr, u64 literal);
+
 bool aarch32_insn_is_wide(u32 insn);
 
 #define A32_RN_OFFSET	16
diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
new file mode 100644
index 000000000000..665ec2a7cdb2
--- /dev/null
+++ b/arch/arm64/include/asm/static_call.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target)			    \
+	asm("	.pushsection	.static_call.text, \"ax\"		\n" \
+	    "	.align		3					\n" \
+	    "	.globl		" STATIC_CALL_TRAMP_STR(name) "		\n" \
+	    STATIC_CALL_TRAMP_STR(name) ":				\n" \
+	    "	hint 	34	/* BTI C */				\n" \
+	    "	adrp	x16, 1f						\n" \
+	    "	ldr	x16, [x16, :lo12:1f]				\n" \
+	    "	cbz	x16, 0f						\n" \
+	    "	br	x16						\n" \
+	    "0:	ret							\n" \
+	    "	.popsection						\n" \
+	    "	.pushsection	.rodata, \"a\"				\n" \
+	    "	.align		3					\n" \
+	    "1:	.quad		" target "				\n" \
+	    "	.popsection						\n")
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)			\
+	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, #func)
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)			\
+	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "0x0")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 3f1490bfb938..83f03fc1e402 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -28,8 +28,8 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
 			   return_address.o cpuinfo.o cpu_errata.o		\
 			   cpufeature.o alternative.o cacheinfo.o		\
 			   smp.o smp_spin_table.o topology.o smccc-call.o	\
-			   syscall.o proton-pack.o idreg-override.o idle.o	\
-			   patching.o
+			   syscall.o proton-pack.o static_call.o		\
+			   idreg-override.o idle.o patching.o
 
 targets			+= efi-entry.o
 
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 771f543464e0..841c0499eca5 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -66,7 +66,7 @@ int __kprobes aarch64_insn_read(void *addr, u32 *insnp)
 	return ret;
 }
 
-static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
+static int __kprobes __aarch64_insn_write(void *addr, void *insn, int size)
 {
 	void *waddr = addr;
 	unsigned long flags = 0;
@@ -75,7 +75,7 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
 	raw_spin_lock_irqsave(&patch_lock, flags);
 	waddr = patch_map(addr, FIX_TEXT_POKE0);
 
-	ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
+	ret = copy_to_kernel_nofault(waddr, insn, size);
 
 	patch_unmap(FIX_TEXT_POKE0);
 	raw_spin_unlock_irqrestore(&patch_lock, flags);
@@ -85,7 +85,15 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
 
 int __kprobes aarch64_insn_write(void *addr, u32 insn)
 {
-	return __aarch64_insn_write(addr, cpu_to_le32(insn));
+	__le32 i = cpu_to_le32(insn);
+
+	return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
+}
+
+int aarch64_literal_write(void *addr, u64 literal)
+{
+	BUG_ON(!IS_ALIGNED((u64)addr, sizeof(u64)));
+	return __aarch64_insn_write(addr, &literal, sizeof(u64));
 }
 
 int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index f6b1a88245db..ceb35c35192c 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -161,6 +161,7 @@ SECTIONS
 			IDMAP_TEXT
 			HIBERNATE_TEXT
 			TRAMP_TEXT
+			STATIC_CALL_TEXT
 			*(.fixup)
 			*(.gnu.warning)
 		. = ALIGN(16);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] arm64: Implement IRQ exit preemption static call for dynamic preemption
  2021-09-20 23:32 [PATCH 0/4] arm64: Support dynamic preemption Frederic Weisbecker
  2021-09-20 23:32 ` [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY " Frederic Weisbecker
  2021-09-20 23:32 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
@ 2021-09-20 23:32 ` Frederic Weisbecker
  2021-09-20 23:32 ` [PATCH 4/4] arm64: Implement HAVE_PREEMPT_DYNAMIC Frederic Weisbecker
  3 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2021-09-20 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas, Will Deacon
  Cc: LKML, Frederic Weisbecker, Ard Biesheuvel, James Morse,
	Quentin Perret, Mark Rutland

arm64 doesn't support generic entry yet, so the architecture's own IRQ
exit preemption path needs to be exposed through the relevant static
call.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/preempt.h |  7 +++++++
 arch/arm64/kernel/entry-common.c | 15 ++++++++++++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
index e83f0982b99c..4fbbe644532f 100644
--- a/arch/arm64/include/asm/preempt.h
+++ b/arch/arm64/include/asm/preempt.h
@@ -3,6 +3,7 @@
 #define __ASM_PREEMPT_H
 
 #include <linux/thread_info.h>
+#include <linux/static_call_types.h>
 
 #define PREEMPT_NEED_RESCHED	BIT(32)
 #define PREEMPT_ENABLED	(PREEMPT_NEED_RESCHED)
@@ -86,4 +87,10 @@ void preempt_schedule_notrace(void);
 #define __preempt_schedule_notrace() preempt_schedule_notrace()
 #endif /* CONFIG_PREEMPTION */
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+void arm64_preempt_schedule_irq(void);
+#define __irqentry_exit_cond_resched_func arm64_preempt_schedule_irq
+DECLARE_STATIC_CALL(irqentry_exit_cond_resched, __irqentry_exit_cond_resched_func);
+#endif /* CONFIG_PREEMPT_DYNAMIC */
+
 #endif /* __ASM_PREEMPT_H */
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 32f9796c4ffe..f1c739dd874d 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/sched/debug.h>
 #include <linux/thread_info.h>
+#include <linux/static_call.h>
 
 #include <asm/cpufeature.h>
 #include <asm/daifflags.h>
@@ -235,7 +236,7 @@ static void noinstr exit_el1_irq_or_nmi(struct pt_regs *regs)
 		exit_to_kernel_mode(regs);
 }
 
-static void __sched arm64_preempt_schedule_irq(void)
+void __sched arm64_preempt_schedule_irq(void)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -259,6 +260,9 @@ static void __sched arm64_preempt_schedule_irq(void)
 	if (system_capabilities_finalized())
 		preempt_schedule_irq();
 }
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(irqentry_exit_cond_resched, arm64_preempt_schedule_irq);
+#endif
 
 static void do_interrupt_handler(struct pt_regs *regs,
 				 void (*handler)(struct pt_regs *))
@@ -446,8 +450,13 @@ static void noinstr el1_interrupt(struct pt_regs *regs,
 	 * preempt_count().
 	 */
 	if (IS_ENABLED(CONFIG_PREEMPTION) &&
-	    READ_ONCE(current_thread_info()->preempt_count) == 0)
-		arm64_preempt_schedule_irq();
+	    READ_ONCE(current_thread_info()->preempt_count) == 0) {
+#ifdef CONFIG_PREEMPT_DYNAMIC
+			static_call(irqentry_exit_cond_resched)();
+#else
+			arm64_preempt_schedule_irq();
+#endif
+	}
 
 	exit_el1_irq_or_nmi(regs);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] arm64: Implement HAVE_PREEMPT_DYNAMIC
  2021-09-20 23:32 [PATCH 0/4] arm64: Support dynamic preemption Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2021-09-20 23:32 ` [PATCH 3/4] arm64: Implement IRQ exit preemption static call for dynamic preemption Frederic Weisbecker
@ 2021-09-20 23:32 ` Frederic Weisbecker
  3 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2021-09-20 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas, Will Deacon
  Cc: LKML, Frederic Weisbecker, Ard Biesheuvel, James Morse,
	Quentin Perret, Mark Rutland

Provide the static calls for the common preemption points and report
arm64 ability to support dynamic preemption.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/preempt.h | 20 +++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5b51b359ccda..e28bcca8954c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -191,6 +191,7 @@ config ARM64
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_PREEMPT_DYNAMIC
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_STATIC_CALL
 	select HAVE_FUNCTION_ARG_ACCESS_API
diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
index 4fbbe644532f..69d1cc491d3b 100644
--- a/arch/arm64/include/asm/preempt.h
+++ b/arch/arm64/include/asm/preempt.h
@@ -82,15 +82,29 @@ static inline bool should_resched(int preempt_offset)
 
 #ifdef CONFIG_PREEMPTION
 void preempt_schedule(void);
-#define __preempt_schedule() preempt_schedule()
 void preempt_schedule_notrace(void);
-#define __preempt_schedule_notrace() preempt_schedule_notrace()
-#endif /* CONFIG_PREEMPTION */
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
+
+#define __preempt_schedule_func preempt_schedule
+DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
+#define __preempt_schedule() static_call(preempt_schedule)()
+
+#define __preempt_schedule_notrace_func preempt_schedule_notrace
+DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+#define __preempt_schedule_notrace() static_call(preempt_schedule_notrace)()
+
 void arm64_preempt_schedule_irq(void);
 #define __irqentry_exit_cond_resched_func arm64_preempt_schedule_irq
 DECLARE_STATIC_CALL(irqentry_exit_cond_resched, __irqentry_exit_cond_resched_func);
+
+#else /* !CONFIG_PREEMPT_DYNAMIC */
+
+#define __preempt_schedule() preempt_schedule()
+#define __preempt_schedule_notrace() preempt_schedule_notrace()
+
 #endif /* CONFIG_PREEMPT_DYNAMIC */
 
+#endif /* CONFIG_PREEMPTION */
+
 #endif /* __ASM_PREEMPT_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-20 23:32 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
@ 2021-09-21  7:09   ` Peter Zijlstra
  2021-09-21 14:44     ` Ard Biesheuvel
  2021-09-21 16:10   ` Ard Biesheuvel
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-09-21  7:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Catalin Marinas, Will Deacon, LKML, Ard Biesheuvel, James Morse,
	Quentin Perret, Mark Rutland, christophe.leroy

On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:

> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target)			    \
> +	asm("	.pushsection	.static_call.text, \"ax\"		\n" \
> +	    "	.align		3					\n" \
> +	    "	.globl		" STATIC_CALL_TRAMP_STR(name) "		\n" \
> +	    STATIC_CALL_TRAMP_STR(name) ":				\n" \
> +	    "	hint 	34	/* BTI C */				\n" \
> +	    "	adrp	x16, 1f						\n" \
> +	    "	ldr	x16, [x16, :lo12:1f]				\n" \
> +	    "	cbz	x16, 0f						\n" \
> +	    "	br	x16						\n" \
> +	    "0:	ret							\n" \
> +	    "	.popsection						\n" \
> +	    "	.pushsection	.rodata, \"a\"				\n" \
> +	    "	.align		3					\n" \
> +	    "1:	.quad		" target "				\n" \
> +	    "	.popsection						\n")

So I like what Christophe did for PPC32:

  https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu

Where he starts with an unconditional jmp and uses that IFF the offset
fits and only does the data load when it doesn't. Ard, woulnd't that
also make sense on ARM64? I'm thinking most in-kernel function pointers
would actually fit, it's just the module muck that gets to have too
large pointers, no?

> +#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)			\
> +	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, #func)
> +
> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)			\
> +	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "0x0")

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY dynamic preemption
  2021-09-20 23:32 ` [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY " Frederic Weisbecker
@ 2021-09-21  7:10   ` Peter Zijlstra
  2021-09-21 13:50     ` Mark Rutland
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-09-21  7:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Catalin Marinas, Will Deacon, LKML, Ard Biesheuvel, James Morse,
	Quentin Perret, Mark Rutland

On Tue, Sep 21, 2021 at 01:32:34AM +0200, Frederic Weisbecker wrote:
> There is no need to force dynamic preemption to depend on the generic
> entry code. The latter is convenient but not mandatory. An architecture
> that doesn't support it just need to provide a static call on its
> kernel IRQ exit preemption path.

True; but at the same time ARM64 is also moving to generic entry. Mark?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY dynamic preemption
  2021-09-21  7:10   ` Peter Zijlstra
@ 2021-09-21 13:50     ` Mark Rutland
  0 siblings, 0 replies; 32+ messages in thread
From: Mark Rutland @ 2021-09-21 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Catalin Marinas, Will Deacon, LKML,
	Ard Biesheuvel, James Morse, Quentin Perret

On Tue, Sep 21, 2021 at 09:10:10AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 01:32:34AM +0200, Frederic Weisbecker wrote:
> > There is no need to force dynamic preemption to depend on the generic
> > entry code. The latter is convenient but not mandatory. An architecture
> > that doesn't support it just need to provide a static call on its
> > kernel IRQ exit preemption path.
> 
> True; but at the same time ARM64 is also moving to generic entry. Mark?

That's the aspiration, but it's going to take a while to rework the
arm64 and common code. So far I've just been focusing on the groundwork
of moving stuff out of asm so that we can see the wood for the trees.

Generally my preference would be to move things over in stages, to avoid
a flag day where there's the potential for many things to break
simultaneously. So if this is relatively self contained, I think it
maybe worthwhile to do on its own, but I don't have very strong feelings
on that.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-21  7:09   ` Peter Zijlstra
@ 2021-09-21 14:44     ` Ard Biesheuvel
  2021-09-21 15:08       ` Peter Zijlstra
  2021-09-21 15:33       ` Mark Rutland
  0 siblings, 2 replies; 32+ messages in thread
From: Ard Biesheuvel @ 2021-09-21 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Catalin Marinas, Will Deacon, LKML,
	James Morse, Quentin Perret, Mark Rutland, Christophe Leroy

On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:
>
> > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target)                            \
> > +     asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> > +         "   .align          3                                       \n" \
> > +         "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > +         "   hint    34      /* BTI C */                             \n" \
> > +         "   adrp    x16, 1f                                         \n" \
> > +         "   ldr     x16, [x16, :lo12:1f]                            \n" \
> > +         "   cbz     x16, 0f                                         \n" \
> > +         "   br      x16                                             \n" \
> > +         "0: ret                                                     \n" \
> > +         "   .popsection                                             \n" \
> > +         "   .pushsection    .rodata, \"a\"                          \n" \
> > +         "   .align          3                                       \n" \
> > +         "1: .quad           " target "                              \n" \
> > +         "   .popsection                                             \n")
>
> So I like what Christophe did for PPC32:
>
>   https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
>
> Where he starts with an unconditional jmp and uses that IFF the offset
> fits and only does the data load when it doesn't. Ard, woulnd't that
> also make sense on ARM64? I'm thinking most in-kernel function pointers
> would actually fit, it's just the module muck that gets to have too
> large pointers, no?
>

Yeah, I'd have to page that back in. But it seems like the following

  bti c
  <branch>
  adrp x16, <literal>
  ldr x16, [x16, ...]
  br x16

with <branch> either set to 'b target' for the near targets, 'ret' for
the NULL target, and 'nop' for the far targets should work, and the
architecture permits patching branches into NOPs and vice versa
without special synchronization. But I must be missing something here,
or why did we have that long discussion before?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-21 14:44     ` Ard Biesheuvel
@ 2021-09-21 15:08       ` Peter Zijlstra
  2021-09-21 15:33       ` Mark Rutland
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2021-09-21 15:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Frederic Weisbecker, Catalin Marinas, Will Deacon, LKML,
	James Morse, Quentin Perret, Mark Rutland, Christophe Leroy

On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:
> >
> > > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target)                            \
> > > +     asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> > > +         "   .align          3                                       \n" \
> > > +         "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > > +         "   hint    34      /* BTI C */                             \n" \
> > > +         "   adrp    x16, 1f                                         \n" \
> > > +         "   ldr     x16, [x16, :lo12:1f]                            \n" \
> > > +         "   cbz     x16, 0f                                         \n" \
> > > +         "   br      x16                                             \n" \
> > > +         "0: ret                                                     \n" \
> > > +         "   .popsection                                             \n" \
> > > +         "   .pushsection    .rodata, \"a\"                          \n" \
> > > +         "   .align          3                                       \n" \
> > > +         "1: .quad           " target "                              \n" \
> > > +         "   .popsection                                             \n")
> >
> > So I like what Christophe did for PPC32:
> >
> >   https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> >
> > Where he starts with an unconditional jmp and uses that IFF the offset
> > fits and only does the data load when it doesn't. Ard, woulnd't that
> > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > would actually fit, it's just the module muck that gets to have too
> > large pointers, no?
> >
> 
> Yeah, I'd have to page that back in. But it seems like the following
> 
>   bti c
>   <branch>
>   adrp x16, <literal>
>   ldr x16, [x16, ...]
>   br x16
> 
> with <branch> either set to 'b target' for the near targets, 'ret' for
> the NULL target, and 'nop' for the far targets should work, and the
> architecture permits patching branches into NOPs and vice versa
> without special synchronization. But I must be missing something here,
> or why did we have that long discussion before?

So the fundamental contraint is that we can only modify a single
instruction at the time and need to consider concurrent execution.

I think the first round of discussions was around getting the normal arm
pattern of constructing a long pointer 'working'. My initial suggestion
was to have 2 slots for that, then you came up with this data load
thing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-21 14:44     ` Ard Biesheuvel
  2021-09-21 15:08       ` Peter Zijlstra
@ 2021-09-21 15:33       ` Mark Rutland
  2021-09-21 15:55         ` Ard Biesheuvel
  1 sibling, 1 reply; 32+ messages in thread
From: Mark Rutland @ 2021-09-21 15:33 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, Frederic Weisbecker, Catalin Marinas,
	Will Deacon, LKML, James Morse, Quentin Perret, Christophe Leroy

On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:
> >
> > > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target)                            \
> > > +     asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> > > +         "   .align          3                                       \n" \
> > > +         "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > > +         "   hint    34      /* BTI C */                             \n" \
> > > +         "   adrp    x16, 1f                                         \n" \
> > > +         "   ldr     x16, [x16, :lo12:1f]                            \n" \
> > > +         "   cbz     x16, 0f                                         \n" \
> > > +         "   br      x16                                             \n" \
> > > +         "0: ret                                                     \n" \
> > > +         "   .popsection                                             \n" \
> > > +         "   .pushsection    .rodata, \"a\"                          \n" \
> > > +         "   .align          3                                       \n" \
> > > +         "1: .quad           " target "                              \n" \
> > > +         "   .popsection                                             \n")
> >
> > So I like what Christophe did for PPC32:
> >
> >   https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> >
> > Where he starts with an unconditional jmp and uses that IFF the offset
> > fits and only does the data load when it doesn't. Ard, woulnd't that
> > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > would actually fit, it's just the module muck that gets to have too
> > large pointers, no?
> >
> 
> Yeah, I'd have to page that back in. But it seems like the following
> 
>   bti c
>   <branch>
>   adrp x16, <literal>
>   ldr x16, [x16, ...]
>   br x16
>
> with <branch> either set to 'b target' for the near targets, 'ret' for
> the NULL target, and 'nop' for the far targets should work, and the
> architecture permits patching branches into NOPs and vice versa
> without special synchronization.

I think so, yes. We can do sligntly better with an inline literal pool
and a PC-relative LDR to fold the ADRP+LDR, e.g.

	.align 3
tramp:
	BTI	C
	{B <func> | RET | NOP}
	LDR	X16, 1f
	BR	X16
1:	.quad	<literal>

Since that's in the .text, it's RO for regular accesses anyway.

> But I must be missing something here, or why did we have that long
> discussion before?

I think the long discussion was because v2 had some more complex options
(mostly due to trying to use ADRP+ADD) and atomicity/preemption issues
meant we could only transition between some of those one-way, and it was
subtle/complex:

https://lore.kernel.org/linux-arm-kernel/20201028184114.6834-1-ardb@kernel.org/

For v3, that was all gone, but we didn't have a user.

Since the common case *should* be handled by {B <func> | RET | NOP }, I
reckon it's fine to have just that and the literal pool fallback (which
I'll definitely need for the sorts of kernel I run when fuzzing, where
the kernel Image itself can be 100s of MiBs).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-21 15:33       ` Mark Rutland
@ 2021-09-21 15:55         ` Ard Biesheuvel
  2021-09-21 16:28           ` Mark Rutland
  0 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2021-09-21 15:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Frederic Weisbecker, Catalin Marinas,
	Will Deacon, LKML, James Morse, Quentin Perret, Christophe Leroy

On Tue, 21 Sept 2021 at 17:33, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
...
> > >
> > > So I like what Christophe did for PPC32:
> > >
> > >   https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> > >
> > > Where he starts with an unconditional jmp and uses that IFF the offset
> > > fits and only does the data load when it doesn't. Ard, woulnd't that
> > > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > > would actually fit, it's just the module muck that gets to have too
> > > large pointers, no?
> > >
> >
> > Yeah, I'd have to page that back in. But it seems like the following
> >
> >   bti c
> >   <branch>
> >   adrp x16, <literal>
> >   ldr x16, [x16, ...]
> >   br x16
> >
> > with <branch> either set to 'b target' for the near targets, 'ret' for
> > the NULL target, and 'nop' for the far targets should work, and the
> > architecture permits patching branches into NOPs and vice versa
> > without special synchronization.
>
> I think so, yes. We can do sligntly better with an inline literal pool
> and a PC-relative LDR to fold the ADRP+LDR, e.g.
>
>         .align 3
> tramp:
>         BTI     C
>         {B <func> | RET | NOP}
>         LDR     X16, 1f
>         BR      X16
> 1:      .quad   <literal>
>
> Since that's in the .text, it's RO for regular accesses anyway.
>

I tried to keep the literal in .rodata to avoid inadvertent gadgets
and/or anticipate exec-only mappings of .text, but that may be a bit
overzealous.

> > But I must be missing something here, or why did we have that long
> > discussion before?
>
> I think the long discussion was because v2 had some more complex options
> (mostly due to trying to use ADRP+ADD) and atomicity/preemption issues
> meant we could only transition between some of those one-way, and it was
> subtle/complex:
>
> https://lore.kernel.org/linux-arm-kernel/20201028184114.6834-1-ardb@kernel.org/
>

Ah yes, I was trying to use ADRP/ADD to avoid the load, and this is
what created all the complexity.

> For v3, that was all gone, but we didn't have a user.
>
> Since the common case *should* be handled by {B <func> | RET | NOP }, I
> reckon it's fine to have just that and the literal pool fallback (which
> I'll definitely need for the sorts of kernel I run when fuzzing, where
> the kernel Image itself can be 100s of MiBs).

Ack. So I'll respin this along these lines. Do we care deeply about
the branch and the literal being transiently out of sync?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-20 23:32 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
  2021-09-21  7:09   ` Peter Zijlstra
@ 2021-09-21 16:10   ` Ard Biesheuvel
  1 sibling, 0 replies; 32+ messages in thread
From: Ard Biesheuvel @ 2021-09-21 16:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Catalin Marinas, Will Deacon, LKML, James Morse,
	Quentin Perret, Mark Rutland

On Tue, 21 Sept 2021 at 01:32, Frederic Weisbecker <frederic@kernel.org> wrote:
>
> From: Ard Biesheuvel <ardb@kernel.org>
>
> [fweisbec: rebased against 5.15-rc2. There has been quite some changes
>  on arm64 since then, especially with insn/patching, so some naming may
>  not be relevant anymore]
>

This patch does not include the static_call.c file references to which
are being added below.


> Implement arm64 support for the 'unoptimized' static call variety, which
> routes all calls through a single trampoline that is patched to perform a
> tail call to the selected function.
>
> Since static call targets may be located in modules loaded out of direct
> branching range, we need to use a ADRP/ADD pair to load the branch target
> into R16 and use a branch-to-register (BR) instruction to perform an
> indirect call. Unlike on x86, there is no pressing need on arm64 to avoid
> indirect calls at all cost, but hiding it from the compiler as is done
> here does have some benefits:
> - the literal is located in .rodata, which gives us the same robustness
>   advantage that code patching does;
> - no performance hit on CFI enabled Clang builds that decorate compiler
>   emitted indirect calls with branch target validity checks.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Quentin Perret <qperret@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/insn.h        |  2 ++
>  arch/arm64/include/asm/static_call.h | 28 ++++++++++++++++++++++++++++
>  arch/arm64/kernel/Makefile           |  4 ++--
>  arch/arm64/kernel/patching.c         | 14 +++++++++++---
>  arch/arm64/kernel/vmlinux.lds.S      |  1 +
>  6 files changed, 45 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/static_call.h
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 5c7ae4c3954b..5b51b359ccda 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -192,6 +192,7 @@ config ARM64
>         select HAVE_PERF_REGS
>         select HAVE_PERF_USER_STACK_DUMP
>         select HAVE_REGS_AND_STACK_ACCESS_API
> +       select HAVE_STATIC_CALL
>         select HAVE_FUNCTION_ARG_ACCESS_API
>         select HAVE_FUTEX_CMPXCHG if FUTEX
>         select MMU_GATHER_RCU_TABLE_FREE
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index 6b776c8667b2..681c08b170df 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -547,6 +547,8 @@ u32 aarch64_set_branch_offset(u32 insn, s32 offset);
>  s32 aarch64_insn_adrp_get_offset(u32 insn);
>  u32 aarch64_insn_adrp_set_offset(u32 insn, s32 offset);
>
> +int aarch64_literal_write(void *addr, u64 literal);
> +
>  bool aarch32_insn_is_wide(u32 insn);
>
>  #define A32_RN_OFFSET  16
> diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
> new file mode 100644
> index 000000000000..665ec2a7cdb2
> --- /dev/null
> +++ b/arch/arm64/include/asm/static_call.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_STATIC_CALL_H
> +#define _ASM_STATIC_CALL_H
> +
> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target)                      \
> +       asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> +           "   .align          3                                       \n" \
> +           "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> +           STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> +           "   hint    34      /* BTI C */                             \n" \
> +           "   adrp    x16, 1f                                         \n" \
> +           "   ldr     x16, [x16, :lo12:1f]                            \n" \
> +           "   cbz     x16, 0f                                         \n" \
> +           "   br      x16                                             \n" \
> +           "0: ret                                                     \n" \
> +           "   .popsection                                             \n" \
> +           "   .pushsection    .rodata, \"a\"                          \n" \
> +           "   .align          3                                       \n" \
> +           "1: .quad           " target "                              \n" \
> +           "   .popsection                                             \n")
> +
> +#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)                      \
> +       __ARCH_DEFINE_STATIC_CALL_TRAMP(name, #func)
> +
> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)                       \
> +       __ARCH_DEFINE_STATIC_CALL_TRAMP(name, "0x0")
> +
> +#endif /* _ASM_STATIC_CALL_H */
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 3f1490bfb938..83f03fc1e402 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -28,8 +28,8 @@ obj-y                 := debug-monitors.o entry.o irq.o fpsimd.o              \
>                            return_address.o cpuinfo.o cpu_errata.o              \
>                            cpufeature.o alternative.o cacheinfo.o               \
>                            smp.o smp_spin_table.o topology.o smccc-call.o       \
> -                          syscall.o proton-pack.o idreg-override.o idle.o      \
> -                          patching.o
> +                          syscall.o proton-pack.o static_call.o                \
> +                          idreg-override.o idle.o patching.o
>
>  targets                        += efi-entry.o
>
> diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
> index 771f543464e0..841c0499eca5 100644
> --- a/arch/arm64/kernel/patching.c
> +++ b/arch/arm64/kernel/patching.c
> @@ -66,7 +66,7 @@ int __kprobes aarch64_insn_read(void *addr, u32 *insnp)
>         return ret;
>  }
>
> -static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
> +static int __kprobes __aarch64_insn_write(void *addr, void *insn, int size)
>  {
>         void *waddr = addr;
>         unsigned long flags = 0;
> @@ -75,7 +75,7 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
>         raw_spin_lock_irqsave(&patch_lock, flags);
>         waddr = patch_map(addr, FIX_TEXT_POKE0);
>
> -       ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
> +       ret = copy_to_kernel_nofault(waddr, insn, size);
>
>         patch_unmap(FIX_TEXT_POKE0);
>         raw_spin_unlock_irqrestore(&patch_lock, flags);
> @@ -85,7 +85,15 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
>
>  int __kprobes aarch64_insn_write(void *addr, u32 insn)
>  {
> -       return __aarch64_insn_write(addr, cpu_to_le32(insn));
> +       __le32 i = cpu_to_le32(insn);
> +
> +       return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
> +}
> +
> +int aarch64_literal_write(void *addr, u64 literal)
> +{
> +       BUG_ON(!IS_ALIGNED((u64)addr, sizeof(u64)));
> +       return __aarch64_insn_write(addr, &literal, sizeof(u64));
>  }
>
>  int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index f6b1a88245db..ceb35c35192c 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -161,6 +161,7 @@ SECTIONS
>                         IDMAP_TEXT
>                         HIBERNATE_TEXT
>                         TRAMP_TEXT
> +                       STATIC_CALL_TEXT
>                         *(.fixup)
>                         *(.gnu.warning)
>                 . = ALIGN(16);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-21 15:55         ` Ard Biesheuvel
@ 2021-09-21 16:28           ` Mark Rutland
  2021-09-25 17:46             ` David Laight
  0 siblings, 1 reply; 32+ messages in thread
From: Mark Rutland @ 2021-09-21 16:28 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, Frederic Weisbecker, Catalin Marinas,
	Will Deacon, LKML, James Morse, Quentin Perret, Christophe Leroy

On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> On Tue, 21 Sept 2021 at 17:33, Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> ...
> > > >
> > > > So I like what Christophe did for PPC32:
> > > >
> > > >   https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> > > >
> > > > Where he starts with an unconditional jmp and uses that IFF the offset
> > > > fits and only does the data load when it doesn't. Ard, woulnd't that
> > > > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > > > would actually fit, it's just the module muck that gets to have too
> > > > large pointers, no?
> > > >
> > >
> > > Yeah, I'd have to page that back in. But it seems like the following
> > >
> > >   bti c
> > >   <branch>
> > >   adrp x16, <literal>
> > >   ldr x16, [x16, ...]
> > >   br x16
> > >
> > > with <branch> either set to 'b target' for the near targets, 'ret' for
> > > the NULL target, and 'nop' for the far targets should work, and the
> > > architecture permits patching branches into NOPs and vice versa
> > > without special synchronization.
> >
> > I think so, yes. We can do sligntly better with an inline literal pool
> > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> >
> >         .align 3
> > tramp:
> >         BTI     C
> >         {B <func> | RET | NOP}
> >         LDR     X16, 1f
> >         BR      X16
> > 1:      .quad   <literal>
> >
> > Since that's in the .text, it's RO for regular accesses anyway.
> >
> 
> I tried to keep the literal in .rodata to avoid inadvertent gadgets
> and/or anticipate exec-only mappings of .text, but that may be a bit
> overzealous.

I think that in practice the risk of gadgetisation is minimal, and
having it inline means we only need to record a single address per
trampoline, so there's less risk that we get the patching wrong.

> > > But I must be missing something here, or why did we have that long
> > > discussion before?
> >
> > I think the long discussion was because v2 had some more complex options
> > (mostly due to trying to use ADRP+ADD) and atomicity/preemption issues
> > meant we could only transition between some of those one-way, and it was
> > subtle/complex:
> >
> > https://lore.kernel.org/linux-arm-kernel/20201028184114.6834-1-ardb@kernel.org/
> >
> 
> Ah yes, I was trying to use ADRP/ADD to avoid the load, and this is
> what created all the complexity.
> 
> > For v3, that was all gone, but we didn't have a user.
> >
> > Since the common case *should* be handled by {B <func> | RET | NOP }, I
> > reckon it's fine to have just that and the literal pool fallback (which
> > I'll definitely need for the sorts of kernel I run when fuzzing, where
> > the kernel Image itself can be 100s of MiBs).
> 
> Ack. So I'll respin this along these lines.

Sounds good!

> Do we care deeply about the branch and the literal being transiently
> out of sync?

I don't think we care about the tranisent window, since even if we just
patched a branch, a thread could be preempted immediately after the
branch and sit around blocked for a while. So it's always necessary to
either handle such threads taking stale branches, or to flip the branch
such that this doesn't matter (e.g. done once at boot time).

That said, I'd suggest that we always patch the literal, then patch the
{B| RET | NOP}, so that outside of patch times those are consistent with
one another and we can't accidentally get into a state were we use a
stale/bogus target after multiple patches. We can align the trampoline
such that we know it falls within a single page, so that we only need to
map/unmap it once (and the cost of the extra STR will be far smaller
than the map/unmap anyhow).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-21 16:28           ` Mark Rutland
@ 2021-09-25 17:46             ` David Laight
  2021-09-27  8:58               ` Mark Rutland
  0 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2021-09-25 17:46 UTC (permalink / raw)
  To: 'Mark Rutland', Ard Biesheuvel
  Cc: Peter Zijlstra, Frederic Weisbecker, Catalin Marinas,
	Will Deacon, LKML, James Morse, Quentin Perret, Christophe Leroy

From: Mark Rutland
> Sent: 21 September 2021 17:28
> 
> On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> > On Tue, 21 Sept 2021 at 17:33, Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> > ...
...
> > >
> > > I think so, yes. We can do sligntly better with an inline literal pool
> > > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> > >
> > >         .align 3
> > > tramp:
> > >         BTI     C
> > >         {B <func> | RET | NOP}
> > >         LDR     X16, 1f
> > >         BR      X16
> > > 1:      .quad   <literal>
> > >
> > > Since that's in the .text, it's RO for regular accesses anyway.
> > >
> >
> > I tried to keep the literal in .rodata to avoid inadvertent gadgets
> > and/or anticipate exec-only mappings of .text, but that may be a bit
> > overzealous.
> 
> I think that in practice the risk of gadgetisation is minimal, and
> having it inline means we only need to record a single address per
> trampoline, so there's less risk that we get the patching wrong.

But doesn't that mean that it is almost certainly a data cache miss?
You really want an instruction that reads the constant from the I-cache.
Or at least be able to 'bunch together' the constants so they
stand a chance of sharing a D-cache line.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-09-25 17:46             ` David Laight
@ 2021-09-27  8:58               ` Mark Rutland
  0 siblings, 0 replies; 32+ messages in thread
From: Mark Rutland @ 2021-09-27  8:58 UTC (permalink / raw)
  To: David Laight
  Cc: Ard Biesheuvel, Peter Zijlstra, Frederic Weisbecker,
	Catalin Marinas, Will Deacon, LKML, James Morse, Quentin Perret,
	Christophe Leroy

On Sat, Sep 25, 2021 at 05:46:23PM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 21 September 2021 17:28
> > 
> > On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 21 Sept 2021 at 17:33, Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> > > ...
> ...
> > > >
> > > > I think so, yes. We can do sligntly better with an inline literal pool
> > > > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> > > >
> > > >         .align 3
> > > > tramp:
> > > >         BTI     C
> > > >         {B <func> | RET | NOP}
> > > >         LDR     X16, 1f
> > > >         BR      X16
> > > > 1:      .quad   <literal>
> > > >
> > > > Since that's in the .text, it's RO for regular accesses anyway.
> > > >
> > >
> > > I tried to keep the literal in .rodata to avoid inadvertent gadgets
> > > and/or anticipate exec-only mappings of .text, but that may be a bit
> > > overzealous.
> > 
> > I think that in practice the risk of gadgetisation is minimal, and
> > having it inline means we only need to record a single address per
> > trampoline, so there's less risk that we get the patching wrong.
> 
> But doesn't that mean that it is almost certainly a data cache miss?
> You really want an instruction that reads the constant from the I-cache.
> Or at least be able to 'bunch together' the constants so they
> stand a chance of sharing a D-cache line.

The idea is that in the common case we don't even use the literal, and
the `B <func>` goes to the target.

The literal is there as a fallback for when the target is a sufficiently
long distance away (more than +/-128MiB from the `BR X16`). By default
we try to keep modules within 128MiB of the kernel image, and this
should only happen in uncommon configs (e.g. my debug kernel configs
when the kernel can be 100s of MiBs).

With that in mind, I'd strongly prefer to optimize for simplicity rather
than making the uncommon case faster.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-26 11:06                   ` David Laight
@ 2021-10-27 12:47                     ` Mark Rutland
  0 siblings, 0 replies; 32+ messages in thread
From: Mark Rutland @ 2021-10-27 12:47 UTC (permalink / raw)
  Cc: Ard Biesheuvel, Peter Zijlstra, Frederic Weisbecker, LKML,
	James Morse, Quentin Perret, Catalin Marinas, Will Deacon

On Tue, Oct 26, 2021 at 11:06:11AM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 26 October 2021 11:37
> ...
> > My preference overall is to keep the trampoline self-contained, and I'd
> > prefer to keep the RET inline in the trampoline rather than trying to
> > factor it out so that all the control-flow is clearly in one place.
> > 
> > So I'd prefer that we have the sequence as-is:
> > 
> > | 0:	.quad 0x0
> > | 	bti	c
> > | 	< insn >
> > | 	ldr	x16, 0b
> > | 	cbz	x16, 1f
> > | 	br	x16
> > | 1:	ret
> 
> What is wrong with:
> 0:	.quad 1f
> 	bti	c
> 	< insn >
> 	ldr	x16, 0b
> 	br	x16
> 1:	bti	c
> 	ret
> 
> Self-contained and reasonably easy to read.

FWIW, that would work for me too.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-26 10:36                 ` Mark Rutland
  2021-10-26 10:45                   ` Peter Zijlstra
@ 2021-10-26 11:06                   ` David Laight
  2021-10-27 12:47                     ` Mark Rutland
  1 sibling, 1 reply; 32+ messages in thread
From: David Laight @ 2021-10-26 11:06 UTC (permalink / raw)
  To: 'Mark Rutland', Ard Biesheuvel
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, James Morse,
	Quentin Perret, Catalin Marinas, Will Deacon

From: Mark Rutland
> Sent: 26 October 2021 11:37
...
> My preference overall is to keep the trampoline self-contained, and I'd
> prefer to keep the RET inline in the trampoline rather than trying to
> factor it out so that all the control-flow is clearly in one place.
> 
> So I'd prefer that we have the sequence as-is:
> 
> | 0:	.quad 0x0
> | 	bti	c
> | 	< insn >
> | 	ldr	x16, 0b
> | 	cbz	x16, 1f
> | 	br	x16
> | 1:	ret

What is wrong with:
0:	.quad 1f
	bti	c
	< insn >
	ldr	x16, 0b
	br	x16
1:	bti	c
	ret

Self-contained and reasonably easy to read.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-26 10:36                 ` Mark Rutland
@ 2021-10-26 10:45                   ` Peter Zijlstra
  2021-10-26 11:06                   ` David Laight
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2021-10-26 10:45 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ard Biesheuvel, Frederic Weisbecker, LKML, James Morse,
	David Laight, Quentin Perret, Catalin Marinas, Will Deacon

On Tue, Oct 26, 2021 at 11:36:55AM +0100, Mark Rutland wrote:

> My preference overall is to keep the trampoline self-contained, and I'd
> prefer to keep the RET inline in the trampoline rather than trying to
> factor it out so that all the control-flow is clearly in one place.
> 
> So I'd prefer that we have the sequence as-is:
> 
> | 0:	.quad 0x0
> | 	bti	c
> | 	< insn >
> | 	ldr	x16, 0b
> | 	cbz	x16, 1f
> | 	br	x16
> | 1:	ret

OK, fair enough. In that case:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Although I do think that function can use a comment to explain the magic
involved.

> If we knew these were only called with IRQs enabled (and so we can take
> an IPI to generate a context synchronization event), we could patch
> <insn> to a RET and point the literal back at the BTI, e.g.

Given the static_call() usage on x86 I'm pretty sure you'll want them
with IRQs disabled.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 15:10               ` Ard Biesheuvel
@ 2021-10-26 10:36                 ` Mark Rutland
  2021-10-26 10:45                   ` Peter Zijlstra
  2021-10-26 11:06                   ` David Laight
  0 siblings, 2 replies; 32+ messages in thread
From: Mark Rutland @ 2021-10-26 10:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, James Morse,
	David Laight, Quentin Perret, Catalin Marinas, Will Deacon

On Mon, Oct 25, 2021 at 05:10:24PM +0200, Ard Biesheuvel wrote:
> On Mon, 25 Oct 2021 at 17:05, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Oct 25, 2021 at 04:55:17PM +0200, Ard Biesheuvel wrote:
> > > On Mon, 25 Oct 2021 at 16:47, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Perhaps a little something like so.. Shaves 2 instructions off each
> > > > trampoline.
> > > >
> > > > --- a/arch/arm64/include/asm/static_call.h
> > > > +++ b/arch/arm64/include/asm/static_call.h
> > > > @@ -11,9 +11,7 @@
> > > >             "   hint    34      /* BTI C */                             \n" \
> > > >                 insn "                                                  \n" \
> > > >             "   ldr     x16, 0b                                         \n" \
> > > > -           "   cbz     x16, 1f                                         \n" \
> > > >             "   br      x16                                             \n" \
> > > > -           "1: ret                                                     \n" \
> > > >             "   .popsection                                             \n")
> > > >
> > > >  #define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)                      \
> > > > --- a/arch/arm64/kernel/patching.c
> > > > +++ b/arch/arm64/kernel/patching.c
> > > > @@ -90,6 +90,11 @@ int __kprobes aarch64_insn_write(void *a
> > > >         return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
> > > >  }
> > > >
> > > > +asm("__static_call_ret:                \n"
> > > > +    "  ret                     \n")
> > > > +
> > >
> > > This breaks BTI as it lacks the landing pad, and it will be called indirectly.
> >
> > Argh!
> >
> > > > +extern void __static_call_ret(void);
> > > > +
> > >
> > > Better to have an ordinary C function here (with consistent linkage),
> > > but we need to take the address in a way that works with Clang CFI.
> >
> > There is that.
> >
> > > As the two additional instructions are on an ice cold path anyway, I'm
> > > not sure this is an obvious improvement tbh.
> >
> > For me it's both simpler -- by virtue of being more consistent, and
> > smaller. So double win :-)
> >
> > That is; you're already relying on the literal being unconditionally
> > updated for the normal B foo -> NOP path, and having the RET -> NOP path
> > be handled differently is just confusing.
> >
> > At least, that's how I'm seeing it today...
> 
> Fair enough. I don't have a strong opinion either way, so I'll let
> some other arm64 folks chime in as well.

My preference overall is to keep the trampoline self-contained, and I'd
prefer to keep the RET inline in the trampoline rather than trying to
factor it out so that all the control-flow is clearly in one place.

So I'd prefer that we have the sequence as-is:

| 0:	.quad 0x0
| 	bti	c
| 	< insn >
| 	ldr	x16, 0b
| 	cbz	x16, 1f
| 	br	x16
| 1:	ret

If we knew these were only called with IRQs enabled (and so we can take
an IPI to generate a context synchronization event), we could patch
<insn> to a RET and point the literal back at the BTI, e.g.

| 0:	.quad 0x0
| 	bti	c
| 	< insn >
| 	ldr	x16, 0b
| 	br	x16

... but I'm pretty sure there are CPUs that will never re-fetch <insn>
in that case, and will get stuck in an infinite loop.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 15:03             ` Peter Zijlstra
@ 2021-10-25 15:10               ` Ard Biesheuvel
  2021-10-26 10:36                 ` Mark Rutland
  0 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2021-10-25 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, James Morse, David Laight,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, 25 Oct 2021 at 17:05, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Oct 25, 2021 at 04:55:17PM +0200, Ard Biesheuvel wrote:
> > On Mon, 25 Oct 2021 at 16:47, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Perhaps a little something like so.. Shaves 2 instructions off each
> > > trampoline.
> > >
> > > --- a/arch/arm64/include/asm/static_call.h
> > > +++ b/arch/arm64/include/asm/static_call.h
> > > @@ -11,9 +11,7 @@
> > >             "   hint    34      /* BTI C */                             \n" \
> > >                 insn "                                                  \n" \
> > >             "   ldr     x16, 0b                                         \n" \
> > > -           "   cbz     x16, 1f                                         \n" \
> > >             "   br      x16                                             \n" \
> > > -           "1: ret                                                     \n" \
> > >             "   .popsection                                             \n")
> > >
> > >  #define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)                      \
> > > --- a/arch/arm64/kernel/patching.c
> > > +++ b/arch/arm64/kernel/patching.c
> > > @@ -90,6 +90,11 @@ int __kprobes aarch64_insn_write(void *a
> > >         return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
> > >  }
> > >
> > > +asm("__static_call_ret:                \n"
> > > +    "  ret                     \n")
> > > +
> >
> > This breaks BTI as it lacks the landing pad, and it will be called indirectly.
>
> Argh!
>
> > > +extern void __static_call_ret(void);
> > > +
> >
> > Better to have an ordinary C function here (with consistent linkage),
> > but we need to take the address in a way that works with Clang CFI.
>
> There is that.
>
> > As the two additional instructions are on an ice cold path anyway, I'm
> > not sure this is an obvious improvement tbh.
>
> For me it's both simpler -- by virtue of being more consistent, and
> smaller. So double win :-)
>
> That is; you're already relying on the literal being unconditionally
> updated for the normal B foo -> NOP path, and having the RET -> NOP path
> be handled differently is just confusing.
>
> At least, that's how I'm seeing it today...

Fair enough. I don't have a strong opinion either way, so I'll let
some other arm64 folks chime in as well.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:55           ` Ard Biesheuvel
  2021-10-25 15:03             ` Peter Zijlstra
@ 2021-10-25 15:03             ` David Laight
  1 sibling, 0 replies; 32+ messages in thread
From: David Laight @ 2021-10-25 15:03 UTC (permalink / raw)
  To: 'Ard Biesheuvel', Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, James Morse, Quentin Perret,
	Catalin Marinas, Will Deacon, Mark Rutland

From: Ard Biesheuvel
> Sent: 25 October 2021 15:55
> 
> On Mon, 25 Oct 2021 at 16:47, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Oct 25, 2021 at 04:19:16PM +0200, Peter Zijlstra wrote:
> > > On Mon, Oct 25, 2021 at 04:08:37PM +0200, Ard Biesheuvel wrote:
> >
> > > > > Ooohh, but what if you go from !func to NOP.
> > > > >
> > > > > assuming:
> > > > >
> > > > >         .literal = 0
> > > > >         BTI C
> > > > >         RET
> > > > >
> > > > > Then
> > > > >
> > > > >         CPU0                    CPU1
> > > > >
> > > > >         [S] literal = func      [I] NOP
> > > > >         [S] insn[1] = NOP       [L] x16 = literal (NULL)
> > > > >                                 b x16
> > > > >                                 *BANG*
> > > > >
> > > > > Is that possible? (total lack of memory ordering etc..)
> > > > >
> > > >
> > > > The CBZ will branch to the RET instruction if x16 == 0x0, so this
> > > > should not happen.
> > >
> > > Oooh, I missed that :/ I was about to suggest writing the address of a
> > > bare 'ret' trampoline instead of NULL into the literal.
> >
> > Perhaps a little something like so.. Shaves 2 instructions off each
> > trampoline.
> >
> > --- a/arch/arm64/include/asm/static_call.h
> > +++ b/arch/arm64/include/asm/static_call.h
> > @@ -11,9 +11,7 @@
> >             "   hint    34      /* BTI C */                             \n" \
> >                 insn "                                                  \n" \
> >             "   ldr     x16, 0b                                         \n" \
> > -           "   cbz     x16, 1f                                         \n" \
> >             "   br      x16                                             \n" \
> > -           "1: ret                                                     \n" \
> >             "   .popsection                                             \n")
> >
> >  #define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)                      \
> > --- a/arch/arm64/kernel/patching.c
> > +++ b/arch/arm64/kernel/patching.c
> > @@ -90,6 +90,11 @@ int __kprobes aarch64_insn_write(void *a
> >         return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
> >  }
> >
> > +asm("__static_call_ret:                \n"
> > +    "  ret                     \n")
> > +
> 
> This breaks BTI as it lacks the landing pad, and it will be called indirectly.
> 
> > +extern void __static_call_ret(void);
> > +
> 
> Better to have an ordinary C function here (with consistent linkage),
> but we need to take the address in a way that works with Clang CFI.
> 
> As the two additional instructions are on an ice cold path anyway, I'm
> not sure this is an obvious improvement tbh.

If my sums are correct the code block is exactly 32 bytes.
So no point saving an instruction.
But you could have:
		.long 1f
	label:
		bti  c
		nop/branch
		ldr  x16, 0b
		br   x16
	1:    bti  c
		ret

That is all self-contained.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:55           ` Ard Biesheuvel
@ 2021-10-25 15:03             ` Peter Zijlstra
  2021-10-25 15:10               ` Ard Biesheuvel
  2021-10-25 15:03             ` David Laight
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-10-25 15:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Frederic Weisbecker, LKML, James Morse, David Laight,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, Oct 25, 2021 at 04:55:17PM +0200, Ard Biesheuvel wrote:
> On Mon, 25 Oct 2021 at 16:47, Peter Zijlstra <peterz@infradead.org> wrote:

> > Perhaps a little something like so.. Shaves 2 instructions off each
> > trampoline.
> >
> > --- a/arch/arm64/include/asm/static_call.h
> > +++ b/arch/arm64/include/asm/static_call.h
> > @@ -11,9 +11,7 @@
> >             "   hint    34      /* BTI C */                             \n" \
> >                 insn "                                                  \n" \
> >             "   ldr     x16, 0b                                         \n" \
> > -           "   cbz     x16, 1f                                         \n" \
> >             "   br      x16                                             \n" \
> > -           "1: ret                                                     \n" \
> >             "   .popsection                                             \n")
> >
> >  #define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)                      \
> > --- a/arch/arm64/kernel/patching.c
> > +++ b/arch/arm64/kernel/patching.c
> > @@ -90,6 +90,11 @@ int __kprobes aarch64_insn_write(void *a
> >         return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
> >  }
> >
> > +asm("__static_call_ret:                \n"
> > +    "  ret                     \n")
> > +
> 
> This breaks BTI as it lacks the landing pad, and it will be called indirectly.

Argh!

> > +extern void __static_call_ret(void);
> > +
> 
> Better to have an ordinary C function here (with consistent linkage),
> but we need to take the address in a way that works with Clang CFI.

There is that.

> As the two additional instructions are on an ice cold path anyway, I'm
> not sure this is an obvious improvement tbh.

For me it's both simpler -- by virtue of being more consistent, and
smaller. So double win :-)

That is; you're already relying on the literal being unconditionally
updated for the normal B foo -> NOP path, and having the RET -> NOP path
be handled differently is just confusing.

At least, that's how I'm seeing it today...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:44         ` Peter Zijlstra
@ 2021-10-25 14:55           ` Ard Biesheuvel
  2021-10-25 15:03             ` Peter Zijlstra
  2021-10-25 15:03             ` David Laight
  0 siblings, 2 replies; 32+ messages in thread
From: Ard Biesheuvel @ 2021-10-25 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, James Morse, David Laight,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, 25 Oct 2021 at 16:47, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Oct 25, 2021 at 04:19:16PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 25, 2021 at 04:08:37PM +0200, Ard Biesheuvel wrote:
>
> > > > Ooohh, but what if you go from !func to NOP.
> > > >
> > > > assuming:
> > > >
> > > >         .literal = 0
> > > >         BTI C
> > > >         RET
> > > >
> > > > Then
> > > >
> > > >         CPU0                    CPU1
> > > >
> > > >         [S] literal = func      [I] NOP
> > > >         [S] insn[1] = NOP       [L] x16 = literal (NULL)
> > > >                                 b x16
> > > >                                 *BANG*
> > > >
> > > > Is that possible? (total lack of memory ordering etc..)
> > > >
> > >
> > > The CBZ will branch to the RET instruction if x16 == 0x0, so this
> > > should not happen.
> >
> > Oooh, I missed that :/ I was about to suggest writing the address of a
> > bare 'ret' trampoline instead of NULL into the literal.
>
> Perhaps a little something like so.. Shaves 2 instructions off each
> trampoline.
>
> --- a/arch/arm64/include/asm/static_call.h
> +++ b/arch/arm64/include/asm/static_call.h
> @@ -11,9 +11,7 @@
>             "   hint    34      /* BTI C */                             \n" \
>                 insn "                                                  \n" \
>             "   ldr     x16, 0b                                         \n" \
> -           "   cbz     x16, 1f                                         \n" \
>             "   br      x16                                             \n" \
> -           "1: ret                                                     \n" \
>             "   .popsection                                             \n")
>
>  #define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)                      \
> --- a/arch/arm64/kernel/patching.c
> +++ b/arch/arm64/kernel/patching.c
> @@ -90,6 +90,11 @@ int __kprobes aarch64_insn_write(void *a
>         return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
>  }
>
> +asm("__static_call_ret:                \n"
> +    "  ret                     \n")
> +

This breaks BTI as it lacks the landing pad, and it will be called indirectly.

> +extern void __static_call_ret(void);
> +

Better to have an ordinary C function here (with consistent linkage),
but we need to take the address in a way that works with Clang CFI.

As the two additional instructions are on an ice cold path anyway, I'm
not sure this is an obvious improvement tbh.

>  void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
>  {
>         /*
> @@ -97,9 +102,7 @@ void arch_static_call_transform(void *si
>          *  0x0 bti c           <--- trampoline entry point
>          *  0x4 <branch or nop>
>          *  0x8 ldr x16, <literal>
> -        *  0xc cbz x16, 20
> -        * 0x10 br x16
> -        * 0x14 ret
> +        *  0xc br x16
>          */
>         struct {
>                 u64     literal;
> @@ -113,6 +116,7 @@ void arch_static_call_transform(void *si
>         insns.insn[0] = cpu_to_le32(insn);
>
>         if (!func) {
> +               insns.literal = (unsigned long)&__static_call_ret;
>                 insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
>                                                    AARCH64_INSN_BRANCH_RETURN);
>         } else {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:19       ` Peter Zijlstra
@ 2021-10-25 14:44         ` Peter Zijlstra
  2021-10-25 14:55           ` Ard Biesheuvel
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-10-25 14:44 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Frederic Weisbecker, LKML, James Morse, David Laight,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, Oct 25, 2021 at 04:19:16PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 25, 2021 at 04:08:37PM +0200, Ard Biesheuvel wrote:

> > > Ooohh, but what if you go from !func to NOP.
> > >
> > > assuming:
> > >
> > >         .literal = 0
> > >         BTI C
> > >         RET
> > >
> > > Then
> > >
> > >         CPU0                    CPU1
> > >
> > >         [S] literal = func      [I] NOP
> > >         [S] insn[1] = NOP       [L] x16 = literal (NULL)
> > >                                 b x16
> > >                                 *BANG*
> > >
> > > Is that possible? (total lack of memory ordering etc..)
> > >
> > 
> > The CBZ will branch to the RET instruction if x16 == 0x0, so this
> > should not happen.
> 
> Oooh, I missed that :/ I was about to suggest writing the address of a
> bare 'ret' trampoline instead of NULL into the literal.

Perhaps a little something like so.. Shaves 2 instructions off each
trampoline.

--- a/arch/arm64/include/asm/static_call.h
+++ b/arch/arm64/include/asm/static_call.h
@@ -11,9 +11,7 @@
 	    "	hint 	34	/* BTI C */				\n" \
 		insn "							\n" \
 	    "	ldr	x16, 0b						\n" \
-	    "	cbz	x16, 1f						\n" \
 	    "	br	x16						\n" \
-	    "1:	ret							\n" \
 	    "	.popsection						\n")
 
 #define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)			\
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -90,6 +90,11 @@ int __kprobes aarch64_insn_write(void *a
 	return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
 }
 
+asm("__static_call_ret:		\n"
+    "	ret			\n")
+
+extern void __static_call_ret(void);
+
 void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
 {
 	/*
@@ -97,9 +102,7 @@ void arch_static_call_transform(void *si
 	 *  0x0	bti c		<--- trampoline entry point
 	 *  0x4	<branch or nop>
 	 *  0x8	ldr x16, <literal>
-	 *  0xc	cbz x16, 20
-	 * 0x10	br x16
-	 * 0x14	ret
+	 *  0xc	br x16
 	 */
 	struct {
 		u64	literal;
@@ -113,6 +116,7 @@ void arch_static_call_transform(void *si
 	insns.insn[0] = cpu_to_le32(insn);
 
 	if (!func) {
+		insns.literal = (unsigned long)&__static_call_ret;
 		insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
 						   AARCH64_INSN_BRANCH_RETURN);
 	} else {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:31     ` Ard Biesheuvel
@ 2021-10-25 14:38       ` David Laight
  0 siblings, 0 replies; 32+ messages in thread
From: David Laight @ 2021-10-25 14:38 UTC (permalink / raw)
  To: 'Ard Biesheuvel'
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, James Morse,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

From: Ard Biesheuvel
> Sent: 25 October 2021 15:32
...
> On arm64, we can only patch NOPs into branch instructions or vice
> versa, or we'd have to run the whole thing under stop_machine() to
> ensure that other cores don't fetch garbage.

Ok, I was thinking it would be safe to patch a single instruction.
Clearly you can't patch more than one without danger of 'garbage'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:25   ` David Laight
@ 2021-10-25 14:31     ` Ard Biesheuvel
  2021-10-25 14:38       ` David Laight
  0 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2021-10-25 14:31 UTC (permalink / raw)
  To: David Laight
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, James Morse,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, 25 Oct 2021 at 16:25, David Laight <David.Laight@aculab.com> wrote:
>
> From: Frederic Weisbecker
> > Sent: 25 October 2021 13:21
> >
> > Implement arm64 support for the 'unoptimized' static call variety, which
> > routes all calls through a single trampoline that is patched to perform a
> > tail call to the selected function.
> >
> > It is expected that the direct branch instruction will be able to cover
> > the common case. However, given that static call targets may be located
> > in modules loaded out of direct branching range, we need a fallback path
> > that loads the address into R16 and uses a branch-to-register (BR)
> > instruction to perform an indirect call.
> >
> ...
> > +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> > +{
> > +     /*
> > +      * -0x8 <literal>
> > +      *  0x0 bti c           <--- trampoline entry point
> > +      *  0x4 <branch or nop>
> > +      *  0x8 ldr x16, <literal>
> > +      *  0xc cbz x16, 20
> > +      * 0x10 br x16
> > +      * 0x14 ret
> > +      */
>
> Since the 'ldr x16, <literal>' is just a 32bit constant
> (for a pc-relative load).
>

I don't follow. Are you saying it is a 32-bit opcode? This applies to
all AArch64 opcodes.

> Can't you save a word by making offset 0x4 <branch or ldr x16, <literal>> ?
>
> Or am I missing something?
>

On arm64, we can only patch NOPs into branch instructions or vice
versa, or we'd have to run the whole thing under stop_machine() to
ensure that other cores don't fetch garbage.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 12:21 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
  2021-10-25 13:56   ` Peter Zijlstra
@ 2021-10-25 14:25   ` David Laight
  2021-10-25 14:31     ` Ard Biesheuvel
  1 sibling, 1 reply; 32+ messages in thread
From: David Laight @ 2021-10-25 14:25 UTC (permalink / raw)
  To: 'Frederic Weisbecker', Peter Zijlstra, Ard Biesheuvel
  Cc: LKML, James Morse, Quentin Perret, Catalin Marinas, Will Deacon,
	Mark Rutland

From: Frederic Weisbecker
> Sent: 25 October 2021 13:21
> 
> Implement arm64 support for the 'unoptimized' static call variety, which
> routes all calls through a single trampoline that is patched to perform a
> tail call to the selected function.
> 
> It is expected that the direct branch instruction will be able to cover
> the common case. However, given that static call targets may be located
> in modules loaded out of direct branching range, we need a fallback path
> that loads the address into R16 and uses a branch-to-register (BR)
> instruction to perform an indirect call.
> 
...
> +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> +{
> +	/*
> +	 * -0x8	<literal>
> +	 *  0x0	bti c		<--- trampoline entry point
> +	 *  0x4	<branch or nop>
> +	 *  0x8	ldr x16, <literal>
> +	 *  0xc	cbz x16, 20
> +	 * 0x10	br x16
> +	 * 0x14	ret
> +	 */

Since the 'ldr x16, <literal>' is just a 32bit constant
(for a pc-relative load).

Can't you save a word by making offset 0x4 <branch or ldr x16, <literal>> ?

Or am I missing something?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 14:08     ` Ard Biesheuvel
@ 2021-10-25 14:19       ` Peter Zijlstra
  2021-10-25 14:44         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-10-25 14:19 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Frederic Weisbecker, LKML, James Morse, David Laight,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, Oct 25, 2021 at 04:08:37PM +0200, Ard Biesheuvel wrote:
> On Mon, 25 Oct 2021 at 15:57, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Oct 25, 2021 at 02:21:00PM +0200, Frederic Weisbecker wrote:
> >
> > > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)                      \
> > > +     asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> > > +         "   .align          4                                       \n" \
> > > +         "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > > +         "0: .quad   0x0                                             \n" \
> > > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > > +         "   hint    34      /* BTI C */                             \n" \
> > > +             insn "                                                  \n" \
> > > +         "   ldr     x16, 0b                                         \n" \
> > > +         "   cbz     x16, 1f                                         \n" \
> > > +         "   br      x16                                             \n" \
> > > +         "1: ret                                                     \n" \
> > > +         "   .popsection                                             \n")
> >

> > OK, that's pretty magical...
> >
> > So you're writing the literal and the two instructions with 2 u64
> > stores. Relying on alignment to guarantee both are in a single page and
> > that copy_to_kernel_nofault() selects u64 writes.
> >
> 
> To be honest, it just seemed tidier and less likely to produce weird
> corner cases to put the literal and the patched insn in the smallest
> possible power-of-2 aligned window, as it ensures that the D-side view
> is always consistent.
> 
> However, the actual fetch of the instruction could still produce a
> stale value before the cache maintenance completes.
> 
> > By unconditionally writing the literal, you avoid there ever being an
> > stale value, which in turn avoids there being a race where you switch
> > from 'J @func' relative addressing to 'NOP; do-literal-thing' and cross
> > CPU execution gets the ordering inverted.
> >
> 
> Indeed.
> 
> > Ooohh, but what if you go from !func to NOP.
> >
> > assuming:
> >
> >         .literal = 0
> >         BTI C
> >         RET
> >
> > Then
> >
> >         CPU0                    CPU1
> >
> >         [S] literal = func      [I] NOP
> >         [S] insn[1] = NOP       [L] x16 = literal (NULL)
> >                                 b x16
> >                                 *BANG*
> >
> > Is that possible? (total lack of memory ordering etc..)
> >
> 
> The CBZ will branch to the RET instruction if x16 == 0x0, so this
> should not happen.

Oooh, I missed that :/ I was about to suggest writing the address of a
bare 'ret' trampoline instead of NULL into the literal.

> > On IRC you just alluded to the fact that this relies on it all being in
> > a single cacheline (i-fetch windows don't need to be cacheline sized,
> > but provided they're at least 16 bytes, this should still work given the
> > alignment).
> >
> > But is I$ and D$ coherent? One load is through I-fetch, the other is a
> > regular D-fetch.
> >
> > However, Will has previously expressed reluctance to rely on such
> > things.
> >
> 
> No they are not. That is why the CBZ is there. So the only issue we
> might see is where the branch instruction is out of sync with the
> literal, and so we may call the old function while switching to the
> new one and the I-cache maintenance hasn't completed yet.

OK, agreed. Perhaps put in a comment to explain some of this though. The
next poor sod trying to untangle this code is sure to have a question or
two :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 13:56   ` Peter Zijlstra
@ 2021-10-25 14:08     ` Ard Biesheuvel
  2021-10-25 14:19       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2021-10-25 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, James Morse, David Laight,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

On Mon, 25 Oct 2021 at 15:57, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Oct 25, 2021 at 02:21:00PM +0200, Frederic Weisbecker wrote:
>
> > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)                      \
> > +     asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> > +         "   .align          4                                       \n" \
> > +         "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > +         "0: .quad   0x0                                             \n" \
> > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > +         "   hint    34      /* BTI C */                             \n" \
> > +             insn "                                                  \n" \
> > +         "   ldr     x16, 0b                                         \n" \
> > +         "   cbz     x16, 1f                                         \n" \
> > +         "   br      x16                                             \n" \
> > +         "1: ret                                                     \n" \
> > +         "   .popsection                                             \n")
>
> > +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> > +{
> > +     /*
> > +      * -0x8 <literal>
> > +      *  0x0 bti c           <--- trampoline entry point
> > +      *  0x4 <branch or nop>
> > +      *  0x8 ldr x16, <literal>
> > +      *  0xc cbz x16, 20
> > +      * 0x10 br x16
> > +      * 0x14 ret
> > +      */
> > +     struct {
> > +             u64     literal;
> > +             __le32  insn[2];
> > +     } insns;
> > +     u32 insn;
> > +     int ret;
> > +
> > +     insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
> > +     insns.literal = (u64)func;
> > +     insns.insn[0] = cpu_to_le32(insn);
> > +
> > +     if (!func) {
> > +             insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
> > +                                                AARCH64_INSN_BRANCH_RETURN);
> > +     } else {
> > +             insn = aarch64_insn_gen_branch_imm((u64)tramp + 4, (u64)func,
> > +                                                AARCH64_INSN_BRANCH_NOLINK);
> > +
> > +             /*
> > +              * Use a NOP if the branch target is out of range, and rely on
> > +              * the indirect call instead.
> > +              */
> > +             if (insn == AARCH64_BREAK_FAULT)
> > +                     insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
> > +     }
> > +     insns.insn[1] = cpu_to_le32(insn);
> > +
> > +     ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));
>
> OK, that's pretty magical...
>
> So you're writing the literal and the two instructions with 2 u64
> stores. Relying on alignment to guarantee both are in a single page and
> that copy_to_kernel_nofault() selects u64 writes.
>

To be honest, it just seemed tidier and less likely to produce weird
corner cases to put the literal and the patched insn in the smallest
possible power-of-2 aligned window, as it ensures that the D-side view
is always consistent.

However, the actual fetch of the instruction could still produce a
stale value before the cache maintenance completes.

> By unconditionally writing the literal, you avoid there ever being an
> stale value, which in turn avoids there being a race where you switch
> from 'J @func' relative addressing to 'NOP; do-literal-thing' and cross
> CPU execution gets the ordering inverted.
>

Indeed.

> Ooohh, but what if you go from !func to NOP.
>
> assuming:
>
>         .literal = 0
>         BTI C
>         RET
>
> Then
>
>         CPU0                    CPU1
>
>         [S] literal = func      [I] NOP
>         [S] insn[1] = NOP       [L] x16 = literal (NULL)
>                                 b x16
>                                 *BANG*
>
> Is that possible? (total lack of memory ordering etc..)
>

The CBZ will branch to the RET instruction if x16 == 0x0, so this
should not happen.

> On IRC you just alluded to the fact that this relies on it all being in
> a single cacheline (i-fetch windows don't need to be cacheline sized,
> but provided they're at least 16 bytes, this should still work given the
> alignment).
>
> But is I$ and D$ coherent? One load is through I-fetch, the other is a
> regular D-fetch.
>
> However, Will has previously expressed reluctance to rely on such
> things.
>

No they are not. That is why the CBZ is there. So the only issue we
might see is where the branch instruction is out of sync with the
literal, and so we may call the old function while switching to the
new one and the I-cache maintenance hasn't completed yet.

> > +     if (!WARN_ON(ret))
> > +             caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
> >  }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 12:21 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
@ 2021-10-25 13:56   ` Peter Zijlstra
  2021-10-25 14:08     ` Ard Biesheuvel
  2021-10-25 14:25   ` David Laight
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-10-25 13:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ard Biesheuvel, LKML, James Morse, David Laight, Quentin Perret,
	Catalin Marinas, Will Deacon, Mark Rutland

On Mon, Oct 25, 2021 at 02:21:00PM +0200, Frederic Weisbecker wrote:

> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)			    \
> +	asm("	.pushsection	.static_call.text, \"ax\"		\n" \
> +	    "	.align		4					\n" \
> +	    "	.globl		" STATIC_CALL_TRAMP_STR(name) "		\n" \
> +	    "0:	.quad	0x0						\n" \
> +	    STATIC_CALL_TRAMP_STR(name) ":				\n" \
> +	    "	hint 	34	/* BTI C */				\n" \
> +		insn "							\n" \
> +	    "	ldr	x16, 0b						\n" \
> +	    "	cbz	x16, 1f						\n" \
> +	    "	br	x16						\n" \
> +	    "1:	ret							\n" \
> +	    "	.popsection						\n")

> +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> +{
> +	/*
> +	 * -0x8	<literal>
> +	 *  0x0	bti c		<--- trampoline entry point
> +	 *  0x4	<branch or nop>
> +	 *  0x8	ldr x16, <literal>
> +	 *  0xc	cbz x16, 20
> +	 * 0x10	br x16
> +	 * 0x14	ret
> +	 */
> +	struct {
> +		u64	literal;
> +		__le32	insn[2];
> +	} insns;
> +	u32 insn;
> +	int ret;
> +
> +	insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
> +	insns.literal = (u64)func;
> +	insns.insn[0] = cpu_to_le32(insn);
> +
> +	if (!func) {
> +		insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
> +						   AARCH64_INSN_BRANCH_RETURN);
> +	} else {
> +		insn = aarch64_insn_gen_branch_imm((u64)tramp + 4, (u64)func,
> +						   AARCH64_INSN_BRANCH_NOLINK);
> +
> +		/*
> +		 * Use a NOP if the branch target is out of range, and rely on
> +		 * the indirect call instead.
> +		 */
> +		if (insn == AARCH64_BREAK_FAULT)
> +			insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
> +	}
> +	insns.insn[1] = cpu_to_le32(insn);
> +
> +	ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));

OK, that's pretty magical...

So you're writing the literal and the two instructions with 2 u64
stores. Relying on alignment to guarantee both are in a single page and
that copy_to_kernel_nofault() selects u64 writes.

By unconditionally writing the literal, you avoid there ever being an
stale value, which in turn avoids there being a race where you switch
from 'J @func' relative addressing to 'NOP; do-literal-thing' and cross
CPU execution gets the ordering inverted.

Ooohh, but what if you go from !func to NOP.

assuming:

	.literal = 0
	BTI C
	RET

Then

	CPU0			CPU1

	[S] literal = func	[I] NOP
	[S] insn[1] = NOP	[L] x16 = literal (NULL)
				b x16
				*BANG*

Is that possible? (total lack of memory ordering etc..)

On IRC you just alluded to the fact that this relies on it all being in
a single cacheline (i-fetch windows don't need to be cacheline sized,
but provided they're at least 16 bytes, this should still work given the
alignment).

But is I$ and D$ coherent? One load is through I-fetch, the other is a
regular D-fetch.

However, Will has previously expressed reluctance to rely on such
things.

> +	if (!WARN_ON(ret))
> +		caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
>  }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/4] arm64: implement support for static call trampolines
  2021-10-25 12:20 [PATCH 0/4] arm64: Support dynamic preemption v2 Frederic Weisbecker
@ 2021-10-25 12:21 ` Frederic Weisbecker
  2021-10-25 13:56   ` Peter Zijlstra
  2021-10-25 14:25   ` David Laight
  0 siblings, 2 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2021-10-25 12:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ard Biesheuvel
  Cc: LKML, James Morse, David Laight, Frederic Weisbecker,
	Quentin Perret, Catalin Marinas, Will Deacon, Mark Rutland

From: Ard Biesheuvel <ardb@kernel.org>

Implement arm64 support for the 'unoptimized' static call variety, which
routes all calls through a single trampoline that is patched to perform a
tail call to the selected function.

It is expected that the direct branch instruction will be able to cover
the common case. However, given that static call targets may be located
in modules loaded out of direct branching range, we need a fallback path
that loads the address into R16 and uses a branch-to-register (BR)
instruction to perform an indirect call.

Unlike on x86, there is no pressing need on arm64 to avoid indirect
calls at all cost, but hiding it from the compiler as is done here does
have some benefits:
- the literal is located in .text, which gives us the same robustness
  advantage that code patching does;
- no performance hit on CFI enabled Clang builds that decorate compiler
  emitted indirect calls with branch target validity checks.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/static_call.h | 25 ++++++++++++++
 arch/arm64/kernel/patching.c         | 51 ++++++++++++++++++++++++++--
 arch/arm64/kernel/vmlinux.lds.S      |  1 +
 4 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/static_call.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d13677f4731d..34b175b1e247 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -192,6 +192,7 @@ config ARM64
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_STATIC_CALL
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_FUTEX_CMPXCHG if FUTEX
 	select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
new file mode 100644
index 000000000000..4871374d584b
--- /dev/null
+++ b/arch/arm64/include/asm/static_call.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)			    \
+	asm("	.pushsection	.static_call.text, \"ax\"		\n" \
+	    "	.align		4					\n" \
+	    "	.globl		" STATIC_CALL_TRAMP_STR(name) "		\n" \
+	    "0:	.quad	0x0						\n" \
+	    STATIC_CALL_TRAMP_STR(name) ":				\n" \
+	    "	hint 	34	/* BTI C */				\n" \
+		insn "							\n" \
+	    "	ldr	x16, 0b						\n" \
+	    "	cbz	x16, 1f						\n" \
+	    "	br	x16						\n" \
+	    "1:	ret							\n" \
+	    "	.popsection						\n")
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)			\
+	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "b " #func)
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)			\
+	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "ret")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 771f543464e0..f98127d92e1f 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -66,7 +66,7 @@ int __kprobes aarch64_insn_read(void *addr, u32 *insnp)
 	return ret;
 }
 
-static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
+static int __kprobes __aarch64_insn_write(void *addr, void *insn, int size)
 {
 	void *waddr = addr;
 	unsigned long flags = 0;
@@ -75,7 +75,7 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
 	raw_spin_lock_irqsave(&patch_lock, flags);
 	waddr = patch_map(addr, FIX_TEXT_POKE0);
 
-	ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
+	ret = copy_to_kernel_nofault(waddr, insn, size);
 
 	patch_unmap(FIX_TEXT_POKE0);
 	raw_spin_unlock_irqrestore(&patch_lock, flags);
@@ -85,7 +85,52 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
 
 int __kprobes aarch64_insn_write(void *addr, u32 insn)
 {
-	return __aarch64_insn_write(addr, cpu_to_le32(insn));
+	__le32 i = cpu_to_le32(insn);
+
+	return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
+}
+
+void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
+{
+	/*
+	 * -0x8	<literal>
+	 *  0x0	bti c		<--- trampoline entry point
+	 *  0x4	<branch or nop>
+	 *  0x8	ldr x16, <literal>
+	 *  0xc	cbz x16, 20
+	 * 0x10	br x16
+	 * 0x14	ret
+	 */
+	struct {
+		u64	literal;
+		__le32	insn[2];
+	} insns;
+	u32 insn;
+	int ret;
+
+	insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
+	insns.literal = (u64)func;
+	insns.insn[0] = cpu_to_le32(insn);
+
+	if (!func) {
+		insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
+						   AARCH64_INSN_BRANCH_RETURN);
+	} else {
+		insn = aarch64_insn_gen_branch_imm((u64)tramp + 4, (u64)func,
+						   AARCH64_INSN_BRANCH_NOLINK);
+
+		/*
+		 * Use a NOP if the branch target is out of range, and rely on
+		 * the indirect call instead.
+		 */
+		if (insn == AARCH64_BREAK_FAULT)
+			insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
+	}
+	insns.insn[1] = cpu_to_le32(insn);
+
+	ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));
+	if (!WARN_ON(ret))
+		caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
 }
 
 int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index f6b1a88245db..ceb35c35192c 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -161,6 +161,7 @@ SECTIONS
 			IDMAP_TEXT
 			HIBERNATE_TEXT
 			TRAMP_TEXT
+			STATIC_CALL_TEXT
 			*(.fixup)
 			*(.gnu.warning)
 		. = ALIGN(16);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-10-27 12:47 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-20 23:32 [PATCH 0/4] arm64: Support dynamic preemption Frederic Weisbecker
2021-09-20 23:32 ` [PATCH 1/4] sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY " Frederic Weisbecker
2021-09-21  7:10   ` Peter Zijlstra
2021-09-21 13:50     ` Mark Rutland
2021-09-20 23:32 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
2021-09-21  7:09   ` Peter Zijlstra
2021-09-21 14:44     ` Ard Biesheuvel
2021-09-21 15:08       ` Peter Zijlstra
2021-09-21 15:33       ` Mark Rutland
2021-09-21 15:55         ` Ard Biesheuvel
2021-09-21 16:28           ` Mark Rutland
2021-09-25 17:46             ` David Laight
2021-09-27  8:58               ` Mark Rutland
2021-09-21 16:10   ` Ard Biesheuvel
2021-09-20 23:32 ` [PATCH 3/4] arm64: Implement IRQ exit preemption static call for dynamic preemption Frederic Weisbecker
2021-09-20 23:32 ` [PATCH 4/4] arm64: Implement HAVE_PREEMPT_DYNAMIC Frederic Weisbecker
2021-10-25 12:20 [PATCH 0/4] arm64: Support dynamic preemption v2 Frederic Weisbecker
2021-10-25 12:21 ` [PATCH 2/4] arm64: implement support for static call trampolines Frederic Weisbecker
2021-10-25 13:56   ` Peter Zijlstra
2021-10-25 14:08     ` Ard Biesheuvel
2021-10-25 14:19       ` Peter Zijlstra
2021-10-25 14:44         ` Peter Zijlstra
2021-10-25 14:55           ` Ard Biesheuvel
2021-10-25 15:03             ` Peter Zijlstra
2021-10-25 15:10               ` Ard Biesheuvel
2021-10-26 10:36                 ` Mark Rutland
2021-10-26 10:45                   ` Peter Zijlstra
2021-10-26 11:06                   ` David Laight
2021-10-27 12:47                     ` Mark Rutland
2021-10-25 15:03             ` David Laight
2021-10-25 14:25   ` David Laight
2021-10-25 14:31     ` Ard Biesheuvel
2021-10-25 14:38       ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.