linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Static calls
@ 2019-01-09 22:59 Josh Poimboeuf
  2019-01-09 22:59 ` [PATCH v3 1/6] compiler.h: Make __ADDRESSABLE() symbol truly unique Josh Poimboeuf
                   ` (8 more replies)
  0 siblings, 9 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

With this version, I stopped trying to use text_poke_bp(), and instead
went with a different approach: if the call site destination doesn't
cross a cacheline boundary, just do an atomic write.  Otherwise, keep
using the trampoline indefinitely.

NOTE: At least experimentally, the call destination writes seem to be
atomic with respect to instruction fetching.  On Nehalem I can easily
trigger crashes when writing a call destination across cachelines while
reading the instruction on other CPU; but I get no such crashes when
respecting cacheline boundaries.

BUT, the SDM doesn't document this approach, so it would be great if any
CPU people can confirm that it's safe!


v3:
- Split up the patches a bit more so that out-of-line static calls can
  be separately mergeable.  Inline is more controversial, and other
  approaches or improvements might be considered.  For example, Nadav is
  looking at implementing it with the help of a GCC plugin to ensure the
  call sites don't cross cacheline boundaries.

- Get rid of the use of text_poke_bp(), in favor of atomic writes.
  Out-of-line calls will be promoted to inline only if the call sites
  don't cross cache line boundaries. [Linus/Andy]

- Converge the inline and out-of-line trampolines into a single
  implementation, which uses a direct jump.  This was made possible by
  making static_call_update() safe to be called during early boot.

- Remove trampoline poisoning for now, since trampolines may still be
  needed for call sites which cross cache line boundaries.

- Rename CONFIG_HAVE_STATIC_CALL_OUTLINE -> CONFIG_HAVE_STATIC_CALL [Steven]

- Add missing __static_call_update() call to static_call_update() [Edward]

- Add missing key->func update in __static_call_update() [Edward]

- Put trampoline in a separate section to prevent 2-byte tail calls [Linus]


v2:
- fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
- rename optimized/unoptimized -> inline/out-of-line [Ard]
- tweak arch interfaces for PLT and add key->tramp field [Ard]
- rename 'poison' to 'defuse' and do it after all sites have been patched [Ard]
- fix .init handling [Ard, Steven]
- add CONFIG_HAVE_STATIC_CALL [Steven]
- make interfaces more consistent across configs to allow tracepoints to
  use them [Steven]
- move __ADDRESSABLE() to static_call() macro [Steven]
- prevent 2-byte jumps [Steven]
- add offset to asm-offsets.c instead of hard coding key->func offset
- add kernel_text_address() sanity check
- make __ADDRESSABLE() symbols truly unique


Static calls are a replacement for global function pointers.  They use
code patching to allow direct calls to be used instead of indirect
calls.  They give the flexibility of function pointers, but with
improved performance.  This is especially important for cases where
retpolines would otherwise be used, as retpolines can significantly
impact performance.

The concept and code are an extension of previous work done by Ard
Biesheuvel and Steven Rostedt:

  https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheuvel@linaro.org
  https://lkml.kernel.org/r/20181006015110.653946300@goodmis.org

There are three implementations, depending on arch support:

1) basic function pointers
2) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
3) inline: patched call sites (CONFIG_HAVE_STATIC_CALL_INLINE)



Josh Poimboeuf (6):
  compiler.h: Make __ADDRESSABLE() symbol truly unique
  static_call: Add basic static call infrastructure
  x86/static_call: Add out-of-line static call implementation
  static_call: Add inline static call infrastructure
  x86/alternative: Use a single access in text_poke() where possible
  x86/static_call: Add inline static call implementation for x86-64

 arch/Kconfig                                  |   7 +
 arch/x86/Kconfig                              |   4 +-
 arch/x86/include/asm/static_call.h            |  33 ++
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/alternative.c                 |  31 +-
 arch/x86/kernel/static_call.c                 |  57 ++++
 arch/x86/kernel/vmlinux.lds.S                 |   1 +
 include/asm-generic/vmlinux.lds.h             |  15 +
 include/linux/compiler.h                      |   2 +-
 include/linux/module.h                        |  10 +
 include/linux/static_call.h                   | 196 +++++++++++
 include/linux/static_call_types.h             |  22 ++
 kernel/Makefile                               |   1 +
 kernel/module.c                               |   5 +
 kernel/static_call.c                          | 316 ++++++++++++++++++
 scripts/Makefile.build                        |   3 +
 tools/objtool/Makefile                        |   3 +-
 tools/objtool/builtin-check.c                 |   3 +-
 tools/objtool/builtin.h                       |   2 +-
 tools/objtool/check.c                         | 131 +++++++-
 tools/objtool/check.h                         |   2 +
 tools/objtool/elf.h                           |   1 +
 .../objtool/include/linux/static_call_types.h |  22 ++
 tools/objtool/sync-check.sh                   |   1 +
 24 files changed, 860 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/include/asm/static_call.h
 create mode 100644 arch/x86/kernel/static_call.c
 create mode 100644 include/linux/static_call.h
 create mode 100644 include/linux/static_call_types.h
 create mode 100644 kernel/static_call.c
 create mode 100644 tools/objtool/include/linux/static_call_types.h

-- 
2.17.2


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 1/6] compiler.h: Make __ADDRESSABLE() symbol truly unique
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
@ 2019-01-09 22:59 ` Josh Poimboeuf
  2019-01-09 22:59 ` [PATCH v3 2/6] static_call: Add basic static call infrastructure Josh Poimboeuf
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

The __ADDRESSABLE() macro uses the __LINE__ macro to create a temporary
symbol which has a unique name.  However, if the macro is used multiple
times from within another macro, the line number will always be the
same, resulting in duplicate symbols.

Make the temporary symbols truly unique by using __UNIQUE_ID instead of
__LINE__.

Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index fc5004a4b07d..5e45bc988b1e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -291,7 +291,7 @@ unsigned long read_word_at_a_time(const void *addr)
  */
 #define __ADDRESSABLE(sym) \
 	static void * __section(".discard.addressable") __used \
-		__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
+		__UNIQUE_ID(__addressable_##sym) = (void *)&sym;
 
 /**
  * offset_to_ptr - convert a relative memory offset to an absolute pointer
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 2/6] static_call: Add basic static call infrastructure
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
  2019-01-09 22:59 ` [PATCH v3 1/6] compiler.h: Make __ADDRESSABLE() symbol truly unique Josh Poimboeuf
@ 2019-01-09 22:59 ` Josh Poimboeuf
  2019-01-10 14:03   ` Edward Cree
  2019-01-09 22:59 ` [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation Josh Poimboeuf
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Static calls are a replacement for global function pointers.  They use
code patching to allow direct calls to be used instead of indirect
calls.  They give the flexibility of function pointers, but with
improved performance.  This is especially important for cases where
retpolines would otherwise be used, as retpolines can significantly
impact performance.

The concept and code are an extension of previous work done by Ard
Biesheuvel and Steven Rostedt:

  https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheuvel@linaro.org
  https://lkml.kernel.org/r/20181006015110.653946300@goodmis.org

There are two implementations, depending on arch support:

1) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
2) basic function pointers

For more details, see the comments in include/linux/static_call.h.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/Kconfig                      |   3 +
 include/linux/static_call.h       | 135 ++++++++++++++++++++++++++++++
 include/linux/static_call_types.h |  13 +++
 3 files changed, 151 insertions(+)
 create mode 100644 include/linux/static_call.h
 create mode 100644 include/linux/static_call_types.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..7e469a693da3 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -885,6 +885,9 @@ config HAVE_ARCH_PREL32_RELOCATIONS
 	  architectures, and don't require runtime relocation on relocatable
 	  kernels.
 
+config HAVE_STATIC_CALL
+	bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
new file mode 100644
index 000000000000..9e85c479cd11
--- /dev/null
+++ b/include/linux/static_call.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_STATIC_CALL_H
+#define _LINUX_STATIC_CALL_H
+
+/*
+ * Static call support
+ *
+ * Static calls use code patching to hard-code function pointers into direct
+ * branch instructions.  They give the flexibility of function pointers, but
+ * with improved performance.  This is especially important for cases where
+ * retpolines would otherwise be used, as retpolines can significantly impact
+ * performance.
+ *
+ *
+ * API overview:
+ *
+ *   DECLARE_STATIC_CALL(key, func);
+ *   DEFINE_STATIC_CALL(key, func);
+ *   static_call(key, args...);
+ *   static_call_update(key, func);
+ *
+ *
+ * Usage example:
+ *
+ *   # Start with the following functions (with identical prototypes):
+ *   int func_a(int arg1, int arg2);
+ *   int func_b(int arg1, int arg2);
+ *
+ *   # Define a 'my_key' reference, associated with func_a() by default
+ *   DEFINE_STATIC_CALL(my_key, func_a);
+ *
+ *   # Call func_a()
+ *   static_call(my_key, arg1, arg2);
+ *
+ *   # Update 'my_key' to point to func_b()
+ *   static_call_update(my_key, func_b);
+ *
+ *   # Call func_b()
+ *   static_call(my_key, arg1, arg2);
+ *
+ *
+ * Implementation details:
+ *
+ *    This requires some arch-specific code (CONFIG_HAVE_STATIC_CALL).
+ *    Otherwise basic indirect calls are used (with function pointers).
+ *
+ *    Each static_call() site calls into a trampoline associated with the key.
+ *    The trampoline has a direct branch to the default function.  Updates to a
+ *    key will modify the trampoline's branch destination.
+ */
+
+#include <linux/types.h>
+#include <linux/cpu.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_HAVE_STATIC_CALL
+#include <asm/static_call.h>
+extern void arch_static_call_transform(void *site, void *tramp, void *func);
+#endif
+
+
+#define DECLARE_STATIC_CALL(key, func)					\
+	extern struct static_call_key key;				\
+	extern typeof(func) STATIC_CALL_TRAMP(key)
+
+
+#if defined(CONFIG_HAVE_STATIC_CALL)
+
+struct static_call_key {
+	void *func, *tramp;
+};
+
+#define DEFINE_STATIC_CALL(key, _func)					\
+	DECLARE_STATIC_CALL(key, _func);				\
+	struct static_call_key key = {					\
+		.func = _func,						\
+		.tramp = STATIC_CALL_TRAMP(key),			\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
+
+#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
+
+static inline void __static_call_update(struct static_call_key *key, void *func)
+{
+	cpus_read_lock();
+	WRITE_ONCE(key->func, func);
+	arch_static_call_transform(NULL, key->tramp, func);
+	cpus_read_unlock();
+}
+
+#define static_call_update(key, func)					\
+({									\
+	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
+	__static_call_update(&key, func);				\
+})
+
+#define EXPORT_STATIC_CALL(key)						\
+	EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key)					\
+	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#else /* Generic implementation */
+
+struct static_call_key {
+	void *func;
+};
+
+#define DEFINE_STATIC_CALL(key, _func)					\
+	DECLARE_STATIC_CALL(key, _func);				\
+	struct static_call_key key = {					\
+		.func = _func,						\
+	}
+
+#define static_call(key, args...)					\
+	((typeof(STATIC_CALL_TRAMP(key))*)(key.func))(args)
+
+static inline void __static_call_update(struct static_call_key *key, void *func)
+{
+	WRITE_ONCE(key->func, func);
+}
+
+#define static_call_update(key, func)					\
+({									\
+	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
+	__static_call_update(&key, func);				\
+})
+
+#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
+#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
+
+#endif /* CONFIG_HAVE_STATIC_CALL_INLINE */
+
+#endif /* _LINUX_STATIC_CALL_H */
diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
new file mode 100644
index 000000000000..0baaf3f02476
--- /dev/null
+++ b/include/linux/static_call_types.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+#endif /* _STATIC_CALL_TYPES_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
  2019-01-09 22:59 ` [PATCH v3 1/6] compiler.h: Make __ADDRESSABLE() symbol truly unique Josh Poimboeuf
  2019-01-09 22:59 ` [PATCH v3 2/6] static_call: Add basic static call infrastructure Josh Poimboeuf
@ 2019-01-09 22:59 ` Josh Poimboeuf
  2019-01-10  0:16   ` Nadav Amit
  2019-01-09 22:59 ` [PATCH v3 4/6] static_call: Add inline static call infrastructure Josh Poimboeuf
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Add the x86 out-of-line static call implementation.  For each key, a
permanent trampoline is created which is the destination for all static
calls for the given key.  The trampoline has a direct jump which gets
patched by static_call_update() when the destination function changes.

This relies on the fact that call destinations can be atomically updated
as long as they don't cross cache line boundaries.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/static_call.h | 27 +++++++++++++++++++++
 arch/x86/kernel/Makefile           |  1 +
 arch/x86/kernel/static_call.c      | 38 ++++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |  1 +
 include/asm-generic/vmlinux.lds.h  | 15 ++++++++++++
 6 files changed, 83 insertions(+)
 create mode 100644 arch/x86/include/asm/static_call.h
 create mode 100644 arch/x86/kernel/static_call.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6185d4f33296..421097322f1b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -190,6 +190,7 @@ config X86
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
 	select HAVE_STACK_VALIDATION		if X86_64
+	select HAVE_STATIC_CALL
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
new file mode 100644
index 000000000000..fab5facade03
--- /dev/null
+++ b/arch/x86/include/asm/static_call.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+#include <asm/asm-offsets.h>
+
+/*
+ * This trampoline is used for out-of-line static calls.  It has a direct jump
+ * which gets patched by static_call_update().
+ *
+ * Trampolines are placed in the .static_call.text section to prevent two-byte
+ * tail calls to the trampoline and two-byte jumps from the trampoline.
+ *
+ * IMPORTANT: The JMP instruction's 4-byte destination must never cross
+ *            cacheline boundaries!  The patching code relies on that to ensure
+ *            atomic updates.
+ */
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)			\
+	asm(".pushsection .static_call.text, \"ax\"		\n"	\
+	    ".align 8						\n"	\
+	    ".globl " STATIC_CALL_TRAMP_STR(key) "		\n"	\
+	    ".type " STATIC_CALL_TRAMP_STR(key) ", @function	\n"	\
+	    STATIC_CALL_TRAMP_STR(key) ":			\n"	\
+	    "jmp " #func "					\n"	\
+	    ".popsection					\n")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 00b7e27bc2b7..f1329a79fd3b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -63,6 +63,7 @@ obj-y			+= tsc.o tsc_msr.o io_delay.o rtc.o
 obj-y			+= pci-iommu_table.o
 obj-y			+= resource.o
 obj-y			+= irqflags.o
+obj-y			+= static_call.o
 
 obj-y				+= process.o
 obj-y				+= fpu/
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
new file mode 100644
index 000000000000..e6ef53fbce20
--- /dev/null
+++ b/arch/x86/kernel/static_call.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/static_call.h>
+#include <linux/memory.h>
+#include <linux/bug.h>
+#include <asm/text-patching.h>
+#include <asm/nospec-branch.h>
+
+#define CALL_INSN_SIZE 5
+
+void __ref arch_static_call_transform(void *site, void *tramp, void *func)
+{
+	s32 dest_relative;
+	unsigned char opcode;
+	void *(*poker)(void *, const void *, size_t);
+	void *insn = tramp;
+
+	mutex_lock(&text_mutex);
+
+	/*
+	 * For x86-64, a 32-bit cross-modifying write to a call destination is
+	 * safe as long as it's within a cache line.
+	 */
+	opcode = *(unsigned char *)insn;
+	if (opcode != 0xe8 && opcode != 0xe9) {
+		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
+			  opcode, insn);
+		goto done;
+	}
+
+	dest_relative = (long)(func) - (long)(insn + CALL_INSN_SIZE);
+
+	poker = early_boot_irqs_disabled ? text_poke_early : text_poke;
+	poker(insn + 1, &dest_relative, sizeof(dest_relative));
+
+done:
+	mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 0d618ee634ac..17470e88ac40 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -128,6 +128,7 @@ SECTIONS
 		CPUIDLE_TEXT
 		LOCK_TEXT
 		KPROBES_TEXT
+		STATIC_CALL_TEXT
 		ALIGN_ENTRY_TEXT_BEGIN
 		ENTRY_TEXT
 		IRQENTRY_TEXT
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..f2981a0161f2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -320,6 +320,7 @@
 	__start_ro_after_init = .;					\
 	*(.data..ro_after_init)						\
 	JUMP_TABLE_DATA							\
+	STATIC_CALL_SITES						\
 	__end_ro_after_init = .;
 #endif
 
@@ -530,6 +531,10 @@
 		*(.kprobes.text)					\
 		__kprobes_text_end = .;
 
+#define STATIC_CALL_TEXT						\
+		ALIGN_FUNCTION();					\
+		*(.static_call.text)
+
 #define ENTRY_TEXT							\
 		ALIGN_FUNCTION();					\
 		__entry_text_start = .;					\
@@ -725,6 +730,16 @@
 #define BUG_TABLE
 #endif
 
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+#define STATIC_CALL_SITES						\
+	. = ALIGN(8);							\
+	__start_static_call_sites = .;					\
+	KEEP(*(.static_call_sites))					\
+	__stop_static_call_sites = .;
+#else
+#define STATIC_CALL_SITES
+#endif
+
 #ifdef CONFIG_UNWINDER_ORC
 #define ORC_UNWIND_TABLE						\
 	. = ALIGN(4);							\
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 4/6] static_call: Add inline static call infrastructure
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2019-01-09 22:59 ` [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation Josh Poimboeuf
@ 2019-01-09 22:59 ` Josh Poimboeuf
  2019-01-09 22:59 ` [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible Josh Poimboeuf
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Add infrastructure for an arch-specific CONFIG_HAVE_STATIC_CALL_INLINE
option, which is a faster version of CONFIG_HAVE_STATIC_CALL.  At
runtime, the static call sites are patched directly, rather than using
the out-of-line trampolines.

Compared to out-of-line static calls, the performance benefits are more
modest, but still measurable.  Steven Rostedt did some tracepoint
measurements:

  https://lkml.kernel.org/r/20181126155405.72b4f718@gandalf.local.home

This code is heavily inspired by the jump label code (aka "static
jumps"), as some of the concepts are very similar.

For more details, see the comments in include/linux/static_call.h.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/Kconfig                      |   4 +
 include/linux/module.h            |  10 +
 include/linux/static_call.h       |  63 +++++-
 include/linux/static_call_types.h |   9 +
 kernel/Makefile                   |   1 +
 kernel/module.c                   |   5 +
 kernel/static_call.c              | 316 ++++++++++++++++++++++++++++++
 7 files changed, 407 insertions(+), 1 deletion(-)
 create mode 100644 kernel/static_call.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 7e469a693da3..173f2f564ef9 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -888,6 +888,10 @@ config HAVE_ARCH_PREL32_RELOCATIONS
 config HAVE_STATIC_CALL
 	bool
 
+config HAVE_STATIC_CALL_INLINE
+	bool
+	depends on HAVE_STATIC_CALL
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/linux/module.h b/include/linux/module.h
index 9a21fe3509af..7af718767ba3 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -21,6 +21,7 @@
 #include <linux/rbtree_latch.h>
 #include <linux/error-injection.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/static_call_types.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -454,6 +455,10 @@ struct module {
 	unsigned int num_ftrace_callsites;
 	unsigned long *ftrace_callsites;
 #endif
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+	int num_static_call_sites;
+	struct static_call_site *static_call_sites;
+#endif
 
 #ifdef CONFIG_LIVEPATCH
 	bool klp; /* Is this a livepatch module? */
@@ -693,6 +698,11 @@ static inline bool is_module_text_address(unsigned long addr)
 	return false;
 }
 
+static inline bool within_module_init(unsigned long addr, const struct module *mod)
+{
+	return false;
+}
+
 /* Get/put a kernel symbol (calls should be symmetric) */
 #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
 #define symbol_put(x) do { } while (0)
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 9e85c479cd11..b641ce40af1d 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -47,6 +47,12 @@
  *    Each static_call() site calls into a trampoline associated with the key.
  *    The trampoline has a direct branch to the default function.  Updates to a
  *    key will modify the trampoline's branch destination.
+ *
+ *    If the arch has CONFIG_HAVE_STATIC_CALL_INLINE, then the call sites
+ *    themselves will be patched at runtime to call the functions directly,
+ *    rather than calling through the trampoline.  This requires objtool or a
+ *    compiler plugin to detect all the static_call() sites and annotate them
+ *    in the .static_call_sites section.
  */
 
 #include <linux/types.h>
@@ -64,7 +70,62 @@ extern void arch_static_call_transform(void *site, void *tramp, void *func);
 	extern typeof(func) STATIC_CALL_TRAMP(key)
 
 
-#if defined(CONFIG_HAVE_STATIC_CALL)
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+
+struct static_call_key {
+	void *func, *tramp;
+	/*
+	 * List of modules (including vmlinux) and their call sites associated
+	 * with this key.
+	 */
+	struct list_head site_mods;
+};
+
+struct static_call_mod {
+	struct list_head list;
+	struct module *mod; /* for vmlinux, mod == NULL */
+	struct static_call_site *sites;
+};
+
+extern void __static_call_update(struct static_call_key *key, void *func);
+extern int static_call_mod_init(struct module *mod);
+
+#define DEFINE_STATIC_CALL(key, _func)					\
+	DECLARE_STATIC_CALL(key, _func);				\
+	struct static_call_key key = {					\
+		.func = _func,						\
+		.tramp = STATIC_CALL_TRAMP(key),			\
+		.site_mods = LIST_HEAD_INIT(key.site_mods),		\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
+
+/*
+ * __ADDRESSABLE() is used to ensure the key symbol doesn't get stripped from
+ * the symbol table so objtool can reference it when it generates the
+ * static_call_site structs.
+ */
+#define static_call(key, args...)					\
+({									\
+	__ADDRESSABLE(key);						\
+	STATIC_CALL_TRAMP(key)(args);					\
+})
+
+#define static_call_update(key, func)					\
+({									\
+	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
+	__static_call_update(&key, func);				\
+})
+
+#define EXPORT_STATIC_CALL(key)						\
+	EXPORT_SYMBOL(key);						\
+	EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key)					\
+	EXPORT_SYMBOL_GPL(key);						\
+	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#elif defined(CONFIG_HAVE_STATIC_CALL)
 
 struct static_call_key {
 	void *func, *tramp;
diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
index 0baaf3f02476..09b0a1db7a51 100644
--- a/include/linux/static_call_types.h
+++ b/include/linux/static_call_types.h
@@ -10,4 +10,13 @@
 #define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
 #define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
 
+/*
+ * The static call site table needs to be created by external tooling (objtool
+ * or a compiler plugin).
+ */
+struct static_call_site {
+	s32 addr;
+	s32 key;
+};
+
 #endif /* _STATIC_CALL_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 6aa7543bcdb2..8e1c6ca0f6e7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/module.c b/kernel/module.c
index 2ad1b5239910..c09e3a868d4c 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3130,6 +3130,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 	mod->ei_funcs = section_objs(info, "_error_injection_whitelist",
 					    sizeof(*mod->ei_funcs),
 					    &mod->num_ei_funcs);
+#endif
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+	mod->static_call_sites = section_objs(info, ".static_call_sites",
+					      sizeof(*mod->static_call_sites),
+					      &mod->num_static_call_sites);
 #endif
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);
diff --git a/kernel/static_call.c b/kernel/static_call.c
new file mode 100644
index 000000000000..b9fdf6fab4d1
--- /dev/null
+++ b/kernel/static_call.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/init.h>
+#include <linux/static_call.h>
+#include <linux/bug.h>
+#include <linux/smp.h>
+#include <linux/sort.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/processor.h>
+#include <asm/sections.h>
+
+extern struct static_call_site __start_static_call_sites[],
+			       __stop_static_call_sites[];
+
+static bool static_call_initialized;
+
+#define STATIC_CALL_INIT 1UL
+
+/* mutex to protect key modules/sites */
+static DEFINE_MUTEX(static_call_mutex);
+
+static void static_call_lock(void)
+{
+	mutex_lock(&static_call_mutex);
+}
+
+static void static_call_unlock(void)
+{
+	mutex_unlock(&static_call_mutex);
+}
+
+static inline void *static_call_addr(struct static_call_site *site)
+{
+	return (void *)((long)site->addr + (long)&site->addr);
+}
+
+
+static inline struct static_call_key *static_call_key(const struct static_call_site *site)
+{
+	return (struct static_call_key *)
+		(((long)site->key + (long)&site->key) & ~STATIC_CALL_INIT);
+}
+
+/* These assume the key is word-aligned. */
+static inline bool static_call_is_init(struct static_call_site *site)
+{
+	return ((long)site->key + (long)&site->key) & STATIC_CALL_INIT;
+}
+
+static inline void static_call_set_init(struct static_call_site *site)
+{
+	site->key = ((long)static_call_key(site) | STATIC_CALL_INIT) -
+		    (long)&site->key;
+}
+
+static int static_call_site_cmp(const void *_a, const void *_b)
+{
+	const struct static_call_site *a = _a;
+	const struct static_call_site *b = _b;
+	const struct static_call_key *key_a = static_call_key(a);
+	const struct static_call_key *key_b = static_call_key(b);
+
+	if (key_a < key_b)
+		return -1;
+
+	if (key_a > key_b)
+		return 1;
+
+	return 0;
+}
+
+static void static_call_site_swap(void *_a, void *_b, int size)
+{
+	long delta = (unsigned long)_a - (unsigned long)_b;
+	struct static_call_site *a = _a;
+	struct static_call_site *b = _b;
+	struct static_call_site tmp = *a;
+
+	a->addr = b->addr  - delta;
+	a->key  = b->key   - delta;
+
+	b->addr = tmp.addr + delta;
+	b->key  = tmp.key  + delta;
+}
+
+static inline void static_call_sort_entries(struct static_call_site *start,
+					    struct static_call_site *stop)
+{
+	sort(start, stop - start, sizeof(struct static_call_site),
+	     static_call_site_cmp, static_call_site_swap);
+}
+
+void __static_call_update(struct static_call_key *key, void *func)
+{
+	struct static_call_mod *site_mod;
+	struct static_call_site *site, *stop;
+
+	cpus_read_lock();
+	static_call_lock();
+
+	if (key->func == func)
+		goto done;
+
+	key->func = func;
+
+	/*
+	 * If called before init, leave the call sites unpatched for now.
+	 * In the meantime they'll continue to call the temporary trampoline.
+	 */
+	if (!static_call_initialized)
+		goto done;
+
+	list_for_each_entry(site_mod, &key->site_mods, list) {
+		if (!site_mod->sites) {
+			/*
+			 * This can happen if the static call key is defined in
+			 * a module which doesn't use it.
+			 */
+			continue;
+		}
+
+		stop = __stop_static_call_sites;
+
+#ifdef CONFIG_MODULES
+		if (site_mod->mod) {
+			stop = site_mod->mod->static_call_sites +
+			       site_mod->mod->num_static_call_sites;
+		}
+#endif
+
+		for (site = site_mod->sites;
+		     site < stop && static_call_key(site) == key; site++) {
+			void *site_addr = static_call_addr(site);
+			struct module *mod = site_mod->mod;
+
+			if (static_call_is_init(site)) {
+				/*
+				 * Don't write to call sites which were in
+				 * initmem and have since been freed.
+				 */
+				if (!mod && system_state >= SYSTEM_RUNNING)
+					continue;
+				if (mod && !within_module_init((unsigned long)site_addr, mod))
+					continue;
+			}
+
+			if (!kernel_text_address((unsigned long)site_addr)) {
+				WARN_ONCE(1, "can't patch static call site at %pS",
+					  site_addr);
+				continue;
+			}
+
+			arch_static_call_transform(site_addr, key->tramp, func);
+		}
+	}
+
+done:
+	static_call_unlock();
+	cpus_read_unlock();
+}
+EXPORT_SYMBOL_GPL(__static_call_update);
+
+#ifdef CONFIG_MODULES
+
+static int static_call_add_module(struct module *mod)
+{
+	struct static_call_site *start = mod->static_call_sites;
+	struct static_call_site *stop = mod->static_call_sites +
+					mod->num_static_call_sites;
+	struct static_call_site *site;
+	struct static_call_key *key, *prev_key = NULL;
+	struct static_call_mod *site_mod;
+
+	if (start == stop)
+		return 0;
+
+	static_call_sort_entries(start, stop);
+
+	for (site = start; site < stop; site++) {
+		void *site_addr = static_call_addr(site);
+
+		if (within_module_init((unsigned long)site_addr, mod))
+			static_call_set_init(site);
+
+		key = static_call_key(site);
+		if (key != prev_key) {
+			prev_key = key;
+
+			site_mod = kzalloc(sizeof(*site_mod), GFP_KERNEL);
+			if (!site_mod)
+				return -ENOMEM;
+
+			site_mod->mod = mod;
+			site_mod->sites = site;
+			list_add_tail(&site_mod->list, &key->site_mods);
+		}
+
+		arch_static_call_transform(site_addr, key->tramp, key->func);
+	}
+
+	return 0;
+}
+
+static void static_call_del_module(struct module *mod)
+{
+	struct static_call_site *start = mod->static_call_sites;
+	struct static_call_site *stop = mod->static_call_sites +
+					mod->num_static_call_sites;
+	struct static_call_site *site;
+	struct static_call_key *key, *prev_key = NULL;
+	struct static_call_mod *site_mod;
+
+	for (site = start; site < stop; site++) {
+		key = static_call_key(site);
+		if (key == prev_key)
+			continue;
+		prev_key = key;
+
+		list_for_each_entry(site_mod, &key->site_mods, list) {
+			if (site_mod->mod == mod) {
+				list_del(&site_mod->list);
+				kfree(site_mod);
+				break;
+			}
+		}
+	}
+}
+
+static int static_call_module_notify(struct notifier_block *nb,
+				     unsigned long val, void *data)
+{
+	struct module *mod = data;
+	int ret = 0;
+
+	cpus_read_lock();
+	static_call_lock();
+
+	switch (val) {
+	case MODULE_STATE_COMING:
+		module_disable_ro(mod);
+		ret = static_call_add_module(mod);
+		module_enable_ro(mod, false);
+		if (ret) {
+			WARN(1, "Failed to allocate memory for static calls");
+			static_call_del_module(mod);
+		}
+		break;
+	case MODULE_STATE_GOING:
+		static_call_del_module(mod);
+		break;
+	}
+
+	static_call_unlock();
+	cpus_read_unlock();
+
+	return notifier_from_errno(ret);
+}
+
+static struct notifier_block static_call_module_nb = {
+	.notifier_call = static_call_module_notify,
+};
+
+#endif /* CONFIG_MODULES */
+
+static void __init static_call_init(void)
+{
+	struct static_call_site *start = __start_static_call_sites;
+	struct static_call_site *stop  = __stop_static_call_sites;
+	struct static_call_site *site;
+
+	if (start == stop) {
+		pr_warn("WARNING: empty static call table\n");
+		return;
+	}
+
+	cpus_read_lock();
+	static_call_lock();
+
+	static_call_sort_entries(start, stop);
+
+	for (site = start; site < stop; site++) {
+		struct static_call_key *key = static_call_key(site);
+		void *site_addr = static_call_addr(site);
+
+		if (init_section_contains(site_addr, 1))
+			static_call_set_init(site);
+
+		if (list_empty(&key->site_mods)) {
+			struct static_call_mod *site_mod;
+
+			site_mod = kzalloc(sizeof(*site_mod), GFP_KERNEL);
+			if (!site_mod) {
+				WARN(1, "Failed to allocate memory for static calls");
+				goto done;
+			}
+
+			site_mod->sites = site;
+			list_add_tail(&site_mod->list, &key->site_mods);
+		}
+
+		arch_static_call_transform(site_addr, key->tramp, key->func);
+	}
+
+	static_call_initialized = true;
+
+done:
+	static_call_unlock();
+	cpus_read_unlock();
+
+#ifdef CONFIG_MODULES
+	if (static_call_initialized)
+		register_module_notifier(&static_call_module_nb);
+#endif
+}
+early_initcall(static_call_init);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2019-01-09 22:59 ` [PATCH v3 4/6] static_call: Add inline static call infrastructure Josh Poimboeuf
@ 2019-01-09 22:59 ` Josh Poimboeuf
  2019-01-10  9:32   ` Nadav Amit
  2019-01-09 22:59 ` [PATCH v3 6/6] x86/static_call: Add inline static call implementation for x86-64 Josh Poimboeuf
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Static call inline patching will need to use single 32-bit writes.
Change text_poke() to do so where possible.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/alternative.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..607f48a90097 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -692,7 +692,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 void *text_poke(void *addr, const void *opcode, size_t len)
 {
 	unsigned long flags;
-	char *vaddr;
+	unsigned long vaddr;
 	struct page *pages[2];
 	int i;
 
@@ -714,14 +714,39 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	}
 	BUG_ON(!pages[0]);
 	local_irq_save(flags);
+
 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
 	if (pages[1])
 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
-	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
+
+	vaddr = fix_to_virt(FIX_TEXT_POKE0) + ((unsigned long)addr & ~PAGE_MASK);
+
+	/*
+	 * Use a single access where possible.  Note that a single unaligned
+	 * multi-byte write will not necessarily be atomic on x86-32, or if the
+	 * address crosses a cache line boundary.
+	 */
+	switch (len) {
+	case 1:
+		WRITE_ONCE(*(u8 *)vaddr, *(u8 *)opcode);
+		break;
+	case 2:
+		WRITE_ONCE(*(u16 *)vaddr, *(u16 *)opcode);
+		break;
+	case 4:
+		WRITE_ONCE(*(u32 *)vaddr, *(u32 *)opcode);
+		break;
+	case 8:
+		WRITE_ONCE(*(u64 *)vaddr, *(u64 *)opcode);
+		break;
+	default:
+		memcpy((void *)vaddr, opcode, len);
+	}
+
 	clear_fixmap(FIX_TEXT_POKE0);
 	if (pages[1])
 		clear_fixmap(FIX_TEXT_POKE1);
+
 	local_flush_tlb();
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 6/6] x86/static_call: Add inline static call implementation for x86-64
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2019-01-09 22:59 ` [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible Josh Poimboeuf
@ 2019-01-09 22:59 ` Josh Poimboeuf
  2019-01-10  1:21 ` [PATCH v3 0/6] Static calls Nadav Amit
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-09 22:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Add an inline static call implementation for x86-64.  Use objtool to
detect all the call sites and annotate them in the .static_call_sites
section.

During boot (and module init), the call sites are patched to call the
destination directly rather than the out-of-line trampoline.

To avoid the complexity of emulating a call instruction in
text_poke_bp(), instead just do atomic writes of the call destinations.
Only the call sites whose destinations *don't* cross cache line
boundaries can be safely patched.

The small percentage of sites which happen to cross cache line
boundaries are out of luck -- they will continue to use the out-of-line
trampoline.

The cross-modifying code implementation was suggested by Andy
Lutomirski, thanks to guidance from Linus Torvalds.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/Kconfig                              |   3 +-
 arch/x86/include/asm/static_call.h            |   6 +
 arch/x86/kernel/static_call.c                 |  23 ++-
 scripts/Makefile.build                        |   3 +
 tools/objtool/Makefile                        |   3 +-
 tools/objtool/builtin-check.c                 |   3 +-
 tools/objtool/builtin.h                       |   2 +-
 tools/objtool/check.c                         | 131 +++++++++++++++++-
 tools/objtool/check.h                         |   2 +
 tools/objtool/elf.h                           |   1 +
 .../objtool/include/linux/static_call_types.h |  22 +++
 tools/objtool/sync-check.sh                   |   1 +
 12 files changed, 193 insertions(+), 7 deletions(-)
 create mode 100644 tools/objtool/include/linux/static_call_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 421097322f1b..b306c0ca8d92 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -191,6 +191,7 @@ config X86
 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
 	select HAVE_STACK_VALIDATION		if X86_64
 	select HAVE_STATIC_CALL
+	select HAVE_STATIC_CALL_INLINE		if HAVE_STACK_VALIDATION
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
@@ -205,6 +206,7 @@ config X86
 	select RTC_MC146818_LIB
 	select SPARSE_IRQ
 	select SRCU
+	select STACK_VALIDATION			if HAVE_STACK_VALIDATION && (HAVE_STATIC_CALL_INLINE || RETPOLINE)
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
@@ -440,7 +442,6 @@ config GOLDFISH
 config RETPOLINE
 	bool "Avoid speculative indirect branches in kernel"
 	default y
-	select STACK_VALIDATION if HAVE_STACK_VALIDATION
 	help
 	  Compile kernel with the retpoline compiler options to guard against
 	  kernel-to-user data leaks by avoiding speculative indirect
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
index fab5facade03..3286c2fa83f7 100644
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -8,6 +8,12 @@
  * This trampoline is used for out-of-line static calls.  It has a direct jump
  * which gets patched by static_call_update().
  *
+ * With CONFIG_HAVE_STATIC_CALL_INLINE enabled, if a call site fits within a
+ * cache line, it gets promoted to an inline static call and the trampoline is
+ * no longer used for that site.  In this case the name of this trampoline has
+ * a magical aspect: objtool uses it to find static call sites so it can create
+ * the .static_call_sites section.
+ *
  * Trampolines are placed in the .static_call.text section to prevent two-byte
  * tail calls to the trampoline and two-byte jumps from the trampoline.
  *
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index e6ef53fbce20..019aafc3c7f9 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -7,19 +7,38 @@
 
 #define CALL_INSN_SIZE 5
 
+static inline bool within_cache_line(void *addr, int len)
+{
+	unsigned long a = (unsigned long)addr;
+
+	return (a >> L1_CACHE_SHIFT) == ((a + len) >> L1_CACHE_SHIFT);
+}
+
 void __ref arch_static_call_transform(void *site, void *tramp, void *func)
 {
 	s32 dest_relative;
 	unsigned char opcode;
 	void *(*poker)(void *, const void *, size_t);
-	void *insn = tramp;
+	void *insn;
 
 	mutex_lock(&text_mutex);
 
 	/*
 	 * For x86-64, a 32-bit cross-modifying write to a call destination is
-	 * safe as long as it's within a cache line.
+	 * safe as long as it's within a cache line.  In the inline case, if
+	 * the call destination is not within a cache line, fall back to using
+	 * the out-of-line trampoline.
+	 *
+	 * We could instead use text_poke_bp() here, which would allow all
+	 * static calls to be promoted to inline, but that would require some
+	 * trickery to fake a call instruction in the BP handler.
 	 */
+	if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) &&
+	    within_cache_line(site + 1, sizeof(dest_relative)))
+		insn = site;
+	else
+		insn = tramp;
+
 	opcode = *(unsigned char *)insn;
 	if (opcode != 0xe8 && opcode != 0xe9) {
 		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index fd03d60f6c5a..850f444de56f 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -223,6 +223,9 @@ endif
 ifdef CONFIG_RETPOLINE
   objtool_args += --retpoline
 endif
+ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+  objtool_args += --static-call
+endif
 
 # 'OBJECT_FILES_NON_STANDARD := y': skip objtool checking for a directory
 # 'OBJECT_FILES_NON_STANDARD_foo.o := 'y': skip objtool checking for a file
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index c9d038f91af6..fb1afa34f10d 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -29,7 +29,8 @@ all: $(OBJTOOL)
 
 INCLUDES := -I$(srctree)/tools/include \
 	    -I$(srctree)/tools/arch/$(HOSTARCH)/include/uapi \
-	    -I$(srctree)/tools/objtool/arch/$(ARCH)/include
+	    -I$(srctree)/tools/objtool/arch/$(ARCH)/include \
+	    -I$(srctree)/tools/objtool/include
 WARNINGS := $(EXTRA_WARNINGS) -Wno-switch-default -Wno-switch-enum -Wno-packed
 CFLAGS   += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES)
 LDFLAGS  += -lelf $(LIBSUBCMD) $(KBUILD_HOSTLDFLAGS)
diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
index 694abc628e9b..c480f49571d6 100644
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -29,7 +29,7 @@
 #include "builtin.h"
 #include "check.h"
 
-bool no_fp, no_unreachable, retpoline, module;
+bool no_fp, no_unreachable, retpoline, module, static_call;
 
 static const char * const check_usage[] = {
 	"objtool check [<options>] file.o",
@@ -41,6 +41,7 @@ const struct option check_options[] = {
 	OPT_BOOLEAN('u', "no-unreachable", &no_unreachable, "Skip 'unreachable instruction' warnings"),
 	OPT_BOOLEAN('r', "retpoline", &retpoline, "Validate retpoline assumptions"),
 	OPT_BOOLEAN('m', "module", &module, "Indicates the object will be part of a kernel module"),
+	OPT_BOOLEAN('s', "static-call", &static_call, "Create static call table"),
 	OPT_END(),
 };
 
diff --git a/tools/objtool/builtin.h b/tools/objtool/builtin.h
index 28ff40e19a14..7b59163a293e 100644
--- a/tools/objtool/builtin.h
+++ b/tools/objtool/builtin.h
@@ -20,7 +20,7 @@
 #include <subcmd/parse-options.h>
 
 extern const struct option check_options[];
-extern bool no_fp, no_unreachable, retpoline, module;
+extern bool no_fp, no_unreachable, retpoline, module, static_call;
 
 extern int cmd_check(int argc, const char **argv);
 extern int cmd_orc(int argc, const char **argv);
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0414a0d52262..c3de329f22f0 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -27,6 +27,7 @@
 
 #include <linux/hashtable.h>
 #include <linux/kernel.h>
+#include <linux/static_call_types.h>
 
 struct alternative {
 	struct list_head list;
@@ -165,6 +166,7 @@ static int __dead_end_function(struct objtool_file *file, struct symbol *func,
 		"fortify_panic",
 		"usercopy_abort",
 		"machine_real_restart",
+		"rewind_stack_do_exit",
 	};
 
 	if (func->bind == STB_WEAK)
@@ -525,6 +527,10 @@ static int add_jump_destinations(struct objtool_file *file)
 		} else {
 			/* sibling call */
 			insn->jump_dest = 0;
+			if (rela->sym->static_call_tramp) {
+				list_add_tail(&insn->static_call_node,
+					      &file->static_call_list);
+			}
 			continue;
 		}
 
@@ -1202,6 +1208,26 @@ static int read_retpoline_hints(struct objtool_file *file)
 	return 0;
 }
 
+static int read_static_call_tramps(struct objtool_file *file)
+{
+	struct section *sec;
+	struct symbol *func;
+
+	if (!static_call)
+		return 0;
+
+	for_each_sec(file, sec) {
+		list_for_each_entry(func, &sec->symbol_list, list) {
+			if (func->bind == STB_GLOBAL &&
+			    !strncmp(func->name, STATIC_CALL_TRAMP_PREFIX_STR,
+				     strlen(STATIC_CALL_TRAMP_PREFIX_STR)))
+				func->static_call_tramp = true;
+		}
+	}
+
+	return 0;
+}
+
 static void mark_rodata(struct objtool_file *file)
 {
 	struct section *sec;
@@ -1267,6 +1293,10 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	ret = read_static_call_tramps(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -1920,6 +1950,11 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 			if (is_fentry_call(insn))
 				break;
 
+			if (insn->call_dest->static_call_tramp) {
+				list_add_tail(&insn->static_call_node,
+					      &file->static_call_list);
+			}
+
 			ret = dead_end_function(file, insn->call_dest);
 			if (ret == 1)
 				return 0;
@@ -2167,6 +2202,92 @@ static int validate_reachable_instructions(struct objtool_file *file)
 	return 0;
 }
 
+static int create_static_call_sections(struct objtool_file *file)
+{
+	struct section *sec, *rela_sec;
+	struct rela *rela;
+	struct static_call_site *site;
+	struct instruction *insn;
+	char *key_name;
+	struct symbol *key_sym;
+	int idx;
+
+	if (!static_call)
+		return 0;
+
+	sec = find_section_by_name(file->elf, ".static_call_sites");
+	if (sec) {
+		WARN("file already has .static_call_sites section, skipping");
+		return 0;
+	}
+
+	if (list_empty(&file->static_call_list))
+		return 0;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->static_call_list, static_call_node)
+		idx++;
+
+	sec = elf_create_section(file->elf, ".static_call_sites",
+				 sizeof(struct static_call_site), idx);
+	if (!sec)
+		return -1;
+
+	rela_sec = elf_create_rela_section(file->elf, sec);
+	if (!rela_sec)
+		return -1;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->static_call_list, static_call_node) {
+
+		site = (struct static_call_site *)sec->data->d_buf + idx;
+		memset(site, 0, sizeof(struct static_call_site));
+
+		/* populate rela for 'addr' */
+		rela = malloc(sizeof(*rela));
+		if (!rela) {
+			perror("malloc");
+			return -1;
+		}
+		memset(rela, 0, sizeof(*rela));
+		rela->sym = insn->sec->sym;
+		rela->addend = insn->offset;
+		rela->type = R_X86_64_PC32;
+		rela->offset = idx * sizeof(struct static_call_site);
+		list_add_tail(&rela->list, &rela_sec->rela_list);
+		hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+		/* find key symbol */
+		key_name = insn->call_dest->name + strlen(STATIC_CALL_TRAMP_PREFIX_STR);
+		key_sym = find_symbol_by_name(file->elf, key_name);
+		if (!key_sym) {
+			WARN("can't find static call key symbol: %s", key_name);
+			return -1;
+		}
+
+		/* populate rela for 'key' */
+		rela = malloc(sizeof(*rela));
+		if (!rela) {
+			perror("malloc");
+			return -1;
+		}
+		memset(rela, 0, sizeof(*rela));
+		rela->sym = key_sym;
+		rela->addend = 0;
+		rela->type = R_X86_64_PC32;
+		rela->offset = idx * sizeof(struct static_call_site) + 4;
+		list_add_tail(&rela->list, &rela_sec->rela_list);
+		hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+		idx++;
+	}
+
+	if (elf_rebuild_rela_section(rela_sec))
+		return -1;
+
+	return 0;
+}
+
 static void cleanup(struct objtool_file *file)
 {
 	struct instruction *insn, *tmpinsn;
@@ -2191,12 +2312,13 @@ int check(const char *_objname, bool orc)
 
 	objname = _objname;
 
-	file.elf = elf_open(objname, orc ? O_RDWR : O_RDONLY);
+	file.elf = elf_open(objname, O_RDWR);
 	if (!file.elf)
 		return 1;
 
 	INIT_LIST_HEAD(&file.insn_list);
 	hash_init(file.insn_hash);
+	INIT_LIST_HEAD(&file.static_call_list);
 	file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard");
 	file.c_file = find_section_by_name(file.elf, ".comment");
 	file.ignore_unreachables = no_unreachable;
@@ -2236,6 +2358,11 @@ int check(const char *_objname, bool orc)
 		warnings += ret;
 	}
 
+	ret = create_static_call_sections(&file);
+	if (ret < 0)
+		goto out;
+	warnings += ret;
+
 	if (orc) {
 		ret = create_orc(&file);
 		if (ret < 0)
@@ -2244,7 +2371,9 @@ int check(const char *_objname, bool orc)
 		ret = create_orc_sections(&file);
 		if (ret < 0)
 			goto out;
+	}
 
+	if (orc || !list_empty(&file.static_call_list)) {
 		ret = elf_write(file.elf);
 		if (ret < 0)
 			goto out;
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index e6e8a655b556..56b8b7fb1bd1 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -39,6 +39,7 @@ struct insn_state {
 struct instruction {
 	struct list_head list;
 	struct hlist_node hash;
+	struct list_head static_call_node;
 	struct section *sec;
 	unsigned long offset;
 	unsigned int len;
@@ -60,6 +61,7 @@ struct objtool_file {
 	struct elf *elf;
 	struct list_head insn_list;
 	DECLARE_HASHTABLE(insn_hash, 16);
+	struct list_head static_call_list;
 	struct section *whitelist;
 	bool ignore_unreachables, c_file, hints, rodata;
 };
diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h
index bc97ed86b9cd..3cf44d7cc3ac 100644
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -62,6 +62,7 @@ struct symbol {
 	unsigned long offset;
 	unsigned int len;
 	struct symbol *pfunc, *cfunc;
+	bool static_call_tramp;
 };
 
 struct rela {
diff --git a/tools/objtool/include/linux/static_call_types.h b/tools/objtool/include/linux/static_call_types.h
new file mode 100644
index 000000000000..09b0a1db7a51
--- /dev/null
+++ b/tools/objtool/include/linux/static_call_types.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+/*
+ * The static call site table needs to be created by external tooling (objtool
+ * or a compiler plugin).
+ */
+struct static_call_site {
+	s32 addr;
+	s32 key;
+};
+
+#endif /* _STATIC_CALL_TYPES_H */
diff --git a/tools/objtool/sync-check.sh b/tools/objtool/sync-check.sh
index 1470e74e9d66..e1a204bf3556 100755
--- a/tools/objtool/sync-check.sh
+++ b/tools/objtool/sync-check.sh
@@ -10,6 +10,7 @@ arch/x86/include/asm/insn.h
 arch/x86/include/asm/inat.h
 arch/x86/include/asm/inat_types.h
 arch/x86/include/asm/orc_types.h
+include/linux/static_call_types.h
 '
 
 check()
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation
  2019-01-09 22:59 ` [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation Josh Poimboeuf
@ 2019-01-10  0:16   ` Nadav Amit
  2019-01-10 16:28     ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-10  0:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 

<snip>

> +
> +void __ref arch_static_call_transform(void *site, void *tramp, void *func)
> +{
> +	s32 dest_relative;
> +	unsigned char opcode;
> +	void *(*poker)(void *, const void *, size_t);
> +	void *insn = tramp;
> +
> +	mutex_lock(&text_mutex);
> +
> +	/*
> +	 * For x86-64, a 32-bit cross-modifying write to a call destination is
> +	 * safe as long as it's within a cache line.
> +	 */
> +	opcode = *(unsigned char *)insn;
> +	if (opcode != 0xe8 && opcode != 0xe9) {
> +		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> +			  opcode, insn);
> +		goto done;
> +	}
> +
> +	dest_relative = (long)(func) - (long)(insn + CALL_INSN_SIZE);
> +
> +	poker = early_boot_irqs_disabled ? text_poke_early : text_poke;
> +	poker(insn + 1, &dest_relative, sizeof(dest_relative));
> +
> +done:
> +	mutex_unlock(&text_mutex);
> +}
> +EXPORT_SYMBOL_GPL(arch_static_call_transform);

Err… I was rewriting __jump_label_transform(), so if this code duplication can
be avoided, this would be great.

(See https://lkml.org/lkml/2018/11/14/72 )

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2019-01-09 22:59 ` [PATCH v3 6/6] x86/static_call: Add inline static call implementation for x86-64 Josh Poimboeuf
@ 2019-01-10  1:21 ` Nadav Amit
  2019-01-10 16:44   ` Josh Poimboeuf
  2019-01-10 17:31 ` Linus Torvalds
  2019-01-10 20:30 ` Peter Zijlstra
  8 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-10  1:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> using the trampoline indefinitely.
> 
> NOTE: At least experimentally, the call destination writes seem to be
> atomic with respect to instruction fetching.  On Nehalem I can easily
> trigger crashes when writing a call destination across cachelines while
> reading the instruction on other CPU; but I get no such crashes when
> respecting cacheline boundaries.
> 
> BUT, the SDM doesn't document this approach, so it would be great if any
> CPU people can confirm that it's safe!
> 

I (still) think that having a compiler plugin can make things much cleaner
(as done in [1]). The callers would not need to be changed, and the key can
be provided through an attribute.

Using a plugin should also allow to use Steven’s proposal for doing
text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
by the plugin, you can be guaranteed that registers are clobbered. Then, you
can store in the assembly block the return address in one of these
registers.

[1] https://lkml.org/lkml/2018/12/31/25





^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-09 22:59 ` [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible Josh Poimboeuf
@ 2019-01-10  9:32   ` Nadav Amit
  2019-01-10 17:20     ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-10  9:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> Static call inline patching will need to use single 32-bit writes.
> Change text_poke() to do so where possible.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
> arch/x86/kernel/alternative.c | 31 ++++++++++++++++++++++++++++---
> 1 file changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ebeac487a20c..607f48a90097 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -692,7 +692,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
> void *text_poke(void *addr, const void *opcode, size_t len)
> {
> 	unsigned long flags;
> -	char *vaddr;
> +	unsigned long vaddr;
> 	struct page *pages[2];
> 	int i;
> 
> @@ -714,14 +714,39 @@ void *text_poke(void *addr, const void *opcode, size_t len)
> 	}
> 	BUG_ON(!pages[0]);
> 	local_irq_save(flags);
> +
> 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> 	if (pages[1])
> 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> -	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> -	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> +
> +	vaddr = fix_to_virt(FIX_TEXT_POKE0) + ((unsigned long)addr & ~PAGE_MASK);
> +
> +	/*
> +	 * Use a single access where possible.  Note that a single unaligned
> +	 * multi-byte write will not necessarily be atomic on x86-32, or if the
> +	 * address crosses a cache line boundary.
> +	 */
> +	switch (len) {
> +	case 1:
> +		WRITE_ONCE(*(u8 *)vaddr, *(u8 *)opcode);
> +		break;
> +	case 2:
> +		WRITE_ONCE(*(u16 *)vaddr, *(u16 *)opcode);
> +		break;
> +	case 4:
> +		WRITE_ONCE(*(u32 *)vaddr, *(u32 *)opcode);
> +		break;
> +	case 8:
> +		WRITE_ONCE(*(u64 *)vaddr, *(u64 *)opcode);
> +		break;
> +	default:
> +		memcpy((void *)vaddr, opcode, len);
> +	}
> +

Even if Intel and AMD CPUs are guaranteed to run instructions from L1
atomically, this may break instruction emulators, such as those that
hypervisors use. They might not read instructions atomically if on SMP VMs
when the VM's text_poke() races with the emulated instruction fetch.

While I can't find a reason for hypervisors to emulate this instruction,
smarter people might find ways to turn it into a security exploit.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 2/6] static_call: Add basic static call infrastructure
  2019-01-09 22:59 ` [PATCH v3 2/6] static_call: Add basic static call infrastructure Josh Poimboeuf
@ 2019-01-10 14:03   ` Edward Cree
  2019-01-10 18:37     ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Edward Cree @ 2019-01-10 14:03 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Daniel Bristot de Oliveira

On 09/01/19 22:59, Josh Poimboeuf wrote:
> Static calls are a replacement for global function pointers.  They use
> code patching to allow direct calls to be used instead of indirect
> calls.  They give the flexibility of function pointers, but with
> improved performance.  This is especially important for cases where
> retpolines would otherwise be used, as retpolines can significantly
> impact performance.
>
> The concept and code are an extension of previous work done by Ard
> Biesheuvel and Steven Rostedt:
>
>   https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheuvel@linaro.org
>   https://lkml.kernel.org/r/20181006015110.653946300@goodmis.org
>
> There are two implementations, depending on arch support:
>
> 1) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
> 2) basic function pointers
>
> For more details, see the comments in include/linux/static_call.h.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  arch/Kconfig                      |   3 +
>  include/linux/static_call.h       | 135 ++++++++++++++++++++++++++++++
>  include/linux/static_call_types.h |  13 +++
>  3 files changed, 151 insertions(+)
>  create mode 100644 include/linux/static_call.h
>  create mode 100644 include/linux/static_call_types.h
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 4cfb6de48f79..7e469a693da3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -885,6 +885,9 @@ config HAVE_ARCH_PREL32_RELOCATIONS
>  	  architectures, and don't require runtime relocation on relocatable
>  	  kernels.
>  
> +config HAVE_STATIC_CALL
> +	bool
> +
>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> new file mode 100644
> index 000000000000..9e85c479cd11
> --- /dev/null
> +++ b/include/linux/static_call.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_STATIC_CALL_H
> +#define _LINUX_STATIC_CALL_H
> +
> +/*
> + * Static call support
> + *
> + * Static calls use code patching to hard-code function pointers into direct
> + * branch instructions.  They give the flexibility of function pointers, but
> + * with improved performance.  This is especially important for cases where
> + * retpolines would otherwise be used, as retpolines can significantly impact
> + * performance.
> + *
> + *
> + * API overview:
> + *
> + *   DECLARE_STATIC_CALL(key, func);
> + *   DEFINE_STATIC_CALL(key, func);
> + *   static_call(key, args...);
> + *   static_call_update(key, func);
> + *
> + *
> + * Usage example:
> + *
> + *   # Start with the following functions (with identical prototypes):
> + *   int func_a(int arg1, int arg2);
> + *   int func_b(int arg1, int arg2);
> + *
> + *   # Define a 'my_key' reference, associated with func_a() by default
> + *   DEFINE_STATIC_CALL(my_key, func_a);
> + *
> + *   # Call func_a()
> + *   static_call(my_key, arg1, arg2);
> + *
> + *   # Update 'my_key' to point to func_b()
> + *   static_call_update(my_key, func_b);
> + *
> + *   # Call func_b()
> + *   static_call(my_key, arg1, arg2);
> + *
> + *
> + * Implementation details:
> + *
> + *    This requires some arch-specific code (CONFIG_HAVE_STATIC_CALL).
> + *    Otherwise basic indirect calls are used (with function pointers).
> + *
> + *    Each static_call() site calls into a trampoline associated with the key.
> + *    The trampoline has a direct branch to the default function.  Updates to a
> + *    key will modify the trampoline's branch destination.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/cpu.h>
> +#include <linux/static_call_types.h>
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL
> +#include <asm/static_call.h>
> +extern void arch_static_call_transform(void *site, void *tramp, void *func);
> +#endif
> +
> +
> +#define DECLARE_STATIC_CALL(key, func)					\
> +	extern struct static_call_key key;				\
> +	extern typeof(func) STATIC_CALL_TRAMP(key)
> +
> +
> +#if defined(CONFIG_HAVE_STATIC_CALL)
> +
> +struct static_call_key {
> +	void *func, *tramp;
> +};
> +
> +#define DEFINE_STATIC_CALL(key, _func)					\
> +	DECLARE_STATIC_CALL(key, _func);				\
> +	struct static_call_key key = {					\
> +		.func = _func,						\
> +		.tramp = STATIC_CALL_TRAMP(key),			\
> +	};								\
> +	ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
> +
> +#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
> +
> +static inline void __static_call_update(struct static_call_key *key, void *func)
> +{
> +	cpus_read_lock();
> +	WRITE_ONCE(key->func, func);
> +	arch_static_call_transform(NULL, key->tramp, func);
> +	cpus_read_unlock();
> +}
> +
> +#define static_call_update(key, func)					\
> +({									\
> +	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
> +	__static_call_update(&key, func);				\
> +})
> +
> +#define EXPORT_STATIC_CALL(key)						\
> +	EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
> +
> +#define EXPORT_STATIC_CALL_GPL(key)					\
> +	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
> +
> +
> +#else /* Generic implementation */
> +
> +struct static_call_key {
> +	void *func;
> +};
> +
> +#define DEFINE_STATIC_CALL(key, _func)					\
> +	DECLARE_STATIC_CALL(key, _func);				\
> +	struct static_call_key key = {					\
> +		.func = _func,						\
> +	}
> +
> +#define static_call(key, args...)					\
> +	((typeof(STATIC_CALL_TRAMP(key))*)(key.func))(args)
> +
> +static inline void __static_call_update(struct static_call_key *key, void *func)
> +{
> +	WRITE_ONCE(key->func, func);
> +}
> +
> +#define static_call_update(key, func)					\
> +({									\
> +	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
> +	__static_call_update(&key, func);				\
> +})
> +
> +#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
> +#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
> +
> +#endif /* CONFIG_HAVE_STATIC_CALL_INLINE */
CONFIG_HAVE_STATIC_CALL?

-Ed
> +
> +#endif /* _LINUX_STATIC_CALL_H */
> diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
> new file mode 100644
> index 000000000000..0baaf3f02476
> --- /dev/null
> +++ b/include/linux/static_call_types.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _STATIC_CALL_TYPES_H
> +#define _STATIC_CALL_TYPES_H
> +
> +#include <linux/stringify.h>
> +
> +#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
> +#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
> +
> +#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
> +#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
> +
> +#endif /* _STATIC_CALL_TYPES_H */


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation
  2019-01-10  0:16   ` Nadav Amit
@ 2019-01-10 16:28     ` Josh Poimboeuf
  0 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 16:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 12:16:34AM +0000, Nadav Amit wrote:
> > On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> 
> <snip>
> 
> > +
> > +void __ref arch_static_call_transform(void *site, void *tramp, void *func)
> > +{
> > +	s32 dest_relative;
> > +	unsigned char opcode;
> > +	void *(*poker)(void *, const void *, size_t);
> > +	void *insn = tramp;
> > +
> > +	mutex_lock(&text_mutex);
> > +
> > +	/*
> > +	 * For x86-64, a 32-bit cross-modifying write to a call destination is
> > +	 * safe as long as it's within a cache line.
> > +	 */
> > +	opcode = *(unsigned char *)insn;
> > +	if (opcode != 0xe8 && opcode != 0xe9) {
> > +		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> > +			  opcode, insn);
> > +		goto done;
> > +	}
> > +
> > +	dest_relative = (long)(func) - (long)(insn + CALL_INSN_SIZE);
> > +
> > +	poker = early_boot_irqs_disabled ? text_poke_early : text_poke;
> > +	poker(insn + 1, &dest_relative, sizeof(dest_relative));
> > +
> > +done:
> > +	mutex_unlock(&text_mutex);
> > +}
> > +EXPORT_SYMBOL_GPL(arch_static_call_transform);
> 
> Err… I was rewriting __jump_label_transform(), so if this code duplication can
> be avoided, this would be great.
> 
> (See https://lkml.org/lkml/2018/11/14/72 )

I don't see much code duplication, because __jump_label_transform() uses
text_poke_bp(), whereas this uses text_poke().

It's true they both fall back to text_poke_early(), but I don't see any
opportunities for sharing code there, unless we decide to go back to
using breakpoints.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10  1:21 ` [PATCH v3 0/6] Static calls Nadav Amit
@ 2019-01-10 16:44   ` Josh Poimboeuf
  2019-01-10 17:32     ` Nadav Amit
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 16:44 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 01:21:00AM +0000, Nadav Amit wrote:
> > On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > With this version, I stopped trying to use text_poke_bp(), and instead
> > went with a different approach: if the call site destination doesn't
> > cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> > using the trampoline indefinitely.
> > 
> > NOTE: At least experimentally, the call destination writes seem to be
> > atomic with respect to instruction fetching.  On Nehalem I can easily
> > trigger crashes when writing a call destination across cachelines while
> > reading the instruction on other CPU; but I get no such crashes when
> > respecting cacheline boundaries.
> > 
> > BUT, the SDM doesn't document this approach, so it would be great if any
> > CPU people can confirm that it's safe!
> > 
> 
> I (still) think that having a compiler plugin can make things much cleaner
> (as done in [1]). The callers would not need to be changed, and the key can
> be provided through an attribute.
> 
> Using a plugin should also allow to use Steven’s proposal for doing
> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
> by the plugin, you can be guaranteed that registers are clobbered. Then, you
> can store in the assembly block the return address in one of these
> registers.

I'm no GCC expert (why do I find myself saying this a lot lately?), but
this sounds to me like it could be tricky to get right.

I suppose you'd have to do it in an early pass, to allow GCC to clobber
the registers in a later pass.  So it would necessarily have side
effects, but I don't know what the risks are.

Would it work with more than 5 arguments, where args get passed on the
stack?

At the very least, it would (at least partially) defeat the point of the
callee-saved paravirt ops.

What if we just used a plugin in a simpler fashion -- to do call site
alignment, if necessary, to ensure the instruction doesn't cross
cacheline boundaries.  This could be done in a later pass, with no side
effects other than code layout.  And it would allow us to avoid
breakpoints altogether -- again, assuming somebody can verify that
intra-cacheline call destination writes are atomic with respect to
instruction decoder reads.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10  9:32   ` Nadav Amit
@ 2019-01-10 17:20     ` Josh Poimboeuf
  2019-01-10 17:29       ` Nadav Amit
  2019-01-10 17:32       ` Steven Rostedt
  0 siblings, 2 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 17:20 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 09:32:23AM +0000, Nadav Amit wrote:
> > @@ -714,14 +714,39 @@ void *text_poke(void *addr, const void *opcode, size_t len)
> > 	}
> > 	BUG_ON(!pages[0]);
> > 	local_irq_save(flags);
> > +
> > 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> > 	if (pages[1])
> > 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> > -	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> > -	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> > +
> > +	vaddr = fix_to_virt(FIX_TEXT_POKE0) + ((unsigned long)addr & ~PAGE_MASK);
> > +
> > +	/*
> > +	 * Use a single access where possible.  Note that a single unaligned
> > +	 * multi-byte write will not necessarily be atomic on x86-32, or if the
> > +	 * address crosses a cache line boundary.
> > +	 */
> > +	switch (len) {
> > +	case 1:
> > +		WRITE_ONCE(*(u8 *)vaddr, *(u8 *)opcode);
> > +		break;
> > +	case 2:
> > +		WRITE_ONCE(*(u16 *)vaddr, *(u16 *)opcode);
> > +		break;
> > +	case 4:
> > +		WRITE_ONCE(*(u32 *)vaddr, *(u32 *)opcode);
> > +		break;
> > +	case 8:
> > +		WRITE_ONCE(*(u64 *)vaddr, *(u64 *)opcode);
> > +		break;
> > +	default:
> > +		memcpy((void *)vaddr, opcode, len);
> > +	}
> > +
> 
> Even if Intel and AMD CPUs are guaranteed to run instructions from L1
> atomically, this may break instruction emulators, such as those that
> hypervisors use. They might not read instructions atomically if on SMP VMs
> when the VM's text_poke() races with the emulated instruction fetch.
> 
> While I can't find a reason for hypervisors to emulate this instruction,
> smarter people might find ways to turn it into a security exploit.

Interesting point... but I wonder if it's a realistic concern.  BTW,
text_poke_bp() also relies on undocumented behavior.

The entire instruction doesn't need to be read atomically; just the
32-bit call destination.  Assuming the hypervisor is x86-64, and it uses
a 32-bit access to read the call destination (which seems logical), the
intra-cacheline reads will be atomic, as stated in the SDM.

If the above assumptions are not true, and the hypervisor reads the call
destination non-atomically (which seems unlikely IMO), even then I don't
see how it could be realistically exploitable.  It would just oops from
calling a corrupt address.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 17:20     ` Josh Poimboeuf
@ 2019-01-10 17:29       ` Nadav Amit
  2019-01-10 17:32       ` Steven Rostedt
  1 sibling, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-10 17:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 10, 2019, at 9:20 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Thu, Jan 10, 2019 at 09:32:23AM +0000, Nadav Amit wrote:
>>> @@ -714,14 +714,39 @@ void *text_poke(void *addr, const void *opcode, size_t len)
>>> 	}
>>> 	BUG_ON(!pages[0]);
>>> 	local_irq_save(flags);
>>> +
>>> 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
>>> 	if (pages[1])
>>> 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
>>> -	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
>>> -	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
>>> +
>>> +	vaddr = fix_to_virt(FIX_TEXT_POKE0) + ((unsigned long)addr & ~PAGE_MASK);
>>> +
>>> +	/*
>>> +	 * Use a single access where possible.  Note that a single unaligned
>>> +	 * multi-byte write will not necessarily be atomic on x86-32, or if the
>>> +	 * address crosses a cache line boundary.
>>> +	 */
>>> +	switch (len) {
>>> +	case 1:
>>> +		WRITE_ONCE(*(u8 *)vaddr, *(u8 *)opcode);
>>> +		break;
>>> +	case 2:
>>> +		WRITE_ONCE(*(u16 *)vaddr, *(u16 *)opcode);
>>> +		break;
>>> +	case 4:
>>> +		WRITE_ONCE(*(u32 *)vaddr, *(u32 *)opcode);
>>> +		break;
>>> +	case 8:
>>> +		WRITE_ONCE(*(u64 *)vaddr, *(u64 *)opcode);
>>> +		break;
>>> +	default:
>>> +		memcpy((void *)vaddr, opcode, len);
>>> +	}
>>> +
>> 
>> Even if Intel and AMD CPUs are guaranteed to run instructions from L1
>> atomically, this may break instruction emulators, such as those that
>> hypervisors use. They might not read instructions atomically if on SMP VMs
>> when the VM's text_poke() races with the emulated instruction fetch.
>> 
>> While I can't find a reason for hypervisors to emulate this instruction,
>> smarter people might find ways to turn it into a security exploit.
> 
> Interesting point... but I wonder if it's a realistic concern.  BTW,
> text_poke_bp() also relies on undocumented behavior.
> 
> The entire instruction doesn't need to be read atomically; just the
> 32-bit call destination.  Assuming the hypervisor is x86-64, and it uses
> a 32-bit access to read the call destination (which seems logical), the
> intra-cacheline reads will be atomic, as stated in the SDM.

At least in KVM, it doesn’t do so intentionally - eventually the emulated
fetch is done using __copy_from_user(). So now you rely on
__copy_from_user() doing it correctly.

> If the above assumptions are not true, and the hypervisor reads the call
> destination non-atomically (which seems unlikely IMO), even then I don't
> see how it could be realistically exploitable.  It would just oops from
> calling a corrupt address.

It might still be exploitable as DoS though (again, not that I think exactly
how). Having said that, I might be negative just because I’ve put a lot of
effort into avoiding this problem according to the SDM…


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2019-01-10  1:21 ` [PATCH v3 0/6] Static calls Nadav Amit
@ 2019-01-10 17:31 ` Linus Torvalds
  2019-01-10 20:51   ` H. Peter Anvin
  2019-01-10 20:30 ` Peter Zijlstra
  8 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2019-01-10 17:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Andy Lutomirski, Steven Rostedt, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> NOTE: At least experimentally, the call destination writes seem to be
> atomic with respect to instruction fetching.  On Nehalem I can easily
> trigger crashes when writing a call destination across cachelines while
> reading the instruction on other CPU; but I get no such crashes when
> respecting cacheline boundaries.

I still doubt ifetch is atomic on a cacheline boundary for the simple
reason that the bus between the IU and the L1 I$ is narrower in older
CPU's.

Also, the fill of the L1 I$ from the (cache coherent L2) may not be a
cacheline at a time either.

That said, the fetch may be sufficiently ordered that it works in
practice. It _would_ be absolutely lovely to be able to do things like
this.

I do agree with Nadav that if there's some way to avoid this, it would
be good. I'm not in general a huge fan of compiler plugins (compiler
instability is just about my worst fear, and I feel plugins tend to
open up that area a lot), but it does feel like this might be
something where compiler tweaking would possibly be the cleanest
approach.

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 16:44   ` Josh Poimboeuf
@ 2019-01-10 17:32     ` Nadav Amit
  2019-01-10 18:18       ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-10 17:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Thu, Jan 10, 2019 at 01:21:00AM +0000, Nadav Amit wrote:
>>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> 
>>> With this version, I stopped trying to use text_poke_bp(), and instead
>>> went with a different approach: if the call site destination doesn't
>>> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
>>> using the trampoline indefinitely.
>>> 
>>> NOTE: At least experimentally, the call destination writes seem to be
>>> atomic with respect to instruction fetching.  On Nehalem I can easily
>>> trigger crashes when writing a call destination across cachelines while
>>> reading the instruction on other CPU; but I get no such crashes when
>>> respecting cacheline boundaries.
>>> 
>>> BUT, the SDM doesn't document this approach, so it would be great if any
>>> CPU people can confirm that it's safe!
>> 
>> I (still) think that having a compiler plugin can make things much cleaner
>> (as done in [1]). The callers would not need to be changed, and the key can
>> be provided through an attribute.
>> 
>> Using a plugin should also allow to use Steven’s proposal for doing
>> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
>> by the plugin, you can be guaranteed that registers are clobbered. Then, you
>> can store in the assembly block the return address in one of these
>> registers.
> 
> I'm no GCC expert (why do I find myself saying this a lot lately?), but
> this sounds to me like it could be tricky to get right.
> 
> I suppose you'd have to do it in an early pass, to allow GCC to clobber
> the registers in a later pass.  So it would necessarily have side
> effects, but I don't know what the risks are.

I’m not GCC expert either and writing this code was not making me full of
joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
don’t think an early pass is needed, as long as hardware registers were not
allocated.

> Would it work with more than 5 arguments, where args get passed on the
> stack?

It does.

> 
> At the very least, it would (at least partially) defeat the point of the
> callee-saved paravirt ops.

Actually, I think you can even deal with callee-saved functions and remove
all the (terrible) macros. You would need to tell the extension not to
clobber the registers through a new attribute.

> What if we just used a plugin in a simpler fashion -- to do call site
> alignment, if necessary, to ensure the instruction doesn't cross
> cacheline boundaries.  This could be done in a later pass, with no side
> effects other than code layout.  And it would allow us to avoid
> breakpoints altogether -- again, assuming somebody can verify that
> intra-cacheline call destination writes are atomic with respect to
> instruction decoder reads.

The plugin should not be able to do so. Layout of the bytecode is done by
the assembler, so I don’t think a plugin would help you with this one.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 17:20     ` Josh Poimboeuf
  2019-01-10 17:29       ` Nadav Amit
@ 2019-01-10 17:32       ` Steven Rostedt
  2019-01-10 17:42         ` Sean Christopherson
  1 sibling, 1 reply; 90+ messages in thread
From: Steven Rostedt @ 2019-01-10 17:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Nadav Amit, X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, 10 Jan 2019 11:20:04 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> > While I can't find a reason for hypervisors to emulate this instruction,
> > smarter people might find ways to turn it into a security exploit.  
> 
> Interesting point... but I wonder if it's a realistic concern.  BTW,
> text_poke_bp() also relies on undocumented behavior.

But we did get an official OK from Intel that it will work. Took a bit
of arm twisting to get them to do so, but they did. And it really is
pretty robust.

I would really like an acknowledgment from the HW vendors before we do
go this route.

-- Steve


> 
> The entire instruction doesn't need to be read atomically; just the
> 32-bit call destination.  Assuming the hypervisor is x86-64, and it uses
> a 32-bit access to read the call destination (which seems logical), the
> intra-cacheline reads will be atomic, as stated in the SDM.
> 
> If the above assumptions are not true, and the hypervisor reads the call
> destination non-atomically (which seems unlikely IMO), even then I don't
> see how it could be realistically exploitable.  It would just oops from
> calling a corrupt address.
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 17:32       ` Steven Rostedt
@ 2019-01-10 17:42         ` Sean Christopherson
  2019-01-10 17:57           ` Steven Rostedt
  2019-01-11  0:59           ` hpa
  0 siblings, 2 replies; 90+ messages in thread
From: Sean Christopherson @ 2019-01-10 17:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
> On Thu, 10 Jan 2019 11:20:04 -0600
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> 
> > > While I can't find a reason for hypervisors to emulate this instruction,
> > > smarter people might find ways to turn it into a security exploit.  
> > 
> > Interesting point... but I wonder if it's a realistic concern.  BTW,
> > text_poke_bp() also relies on undocumented behavior.
> 
> But we did get an official OK from Intel that it will work. Took a bit
> of arm twisting to get them to do so, but they did. And it really is
> pretty robust.

Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
certain atomicity guarantees go out the window if WC memtype is used.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 17:42         ` Sean Christopherson
@ 2019-01-10 17:57           ` Steven Rostedt
  2019-01-10 18:04             ` Sean Christopherson
  2019-01-11  0:59           ` hpa
  1 sibling, 1 reply; 90+ messages in thread
From: Steven Rostedt @ 2019-01-10 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Josh Poimboeuf, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, 10 Jan 2019 09:42:57 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
> > On Thu, 10 Jan 2019 11:20:04 -0600
> > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> >   
> > > > While I can't find a reason for hypervisors to emulate this instruction,
> > > > smarter people might find ways to turn it into a security exploit.    
> > > 
> > > Interesting point... but I wonder if it's a realistic concern.  BTW,
> > > text_poke_bp() also relies on undocumented behavior.  
> > 
> > But we did get an official OK from Intel that it will work. Took a bit
> > of arm twisting to get them to do so, but they did. And it really is
> > pretty robust.  
> 
> Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
> certain atomicity guarantees go out the window if WC memtype is used.

Note, the text_poke_bp() process was this: (nothing to do with atomic
guarantees)

add breakpoint (one byte) to instruction.

Sync all cores (send an IPI to each one).

change the back half of the instruction (the rest of the instruction
after the breakpoint).

Sync all cores

Remove the breakpoint with the new byte of the new instruction.


What atomicity guarantee does the above require?

-- Steve

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 17:57           ` Steven Rostedt
@ 2019-01-10 18:04             ` Sean Christopherson
  2019-01-10 18:21               ` Josh Poimboeuf
                                 ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Sean Christopherson @ 2019-01-10 18:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 12:57:57PM -0500, Steven Rostedt wrote:
> On Thu, 10 Jan 2019 09:42:57 -0800
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
> > > On Thu, 10 Jan 2019 11:20:04 -0600
> > > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > 
> > >   
> > > > > While I can't find a reason for hypervisors to emulate this instruction,
> > > > > smarter people might find ways to turn it into a security exploit.    
> > > > 
> > > > Interesting point... but I wonder if it's a realistic concern.  BTW,
> > > > text_poke_bp() also relies on undocumented behavior.  
> > > 
> > > But we did get an official OK from Intel that it will work. Took a bit
> > > of arm twisting to get them to do so, but they did. And it really is
> > > pretty robust.  
> > 
> > Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
> > certain atomicity guarantees go out the window if WC memtype is used.
> 
> Note, the text_poke_bp() process was this: (nothing to do with atomic
> guarantees)
> 
> add breakpoint (one byte) to instruction.
> 
> Sync all cores (send an IPI to each one).
> 
> change the back half of the instruction (the rest of the instruction
> after the breakpoint).
> 
> Sync all cores
> 
> Remove the breakpoint with the new byte of the new instruction.
> 
> 
> What atomicity guarantee does the above require?

I was asking in the context of static calls.  My understanding is that
the write to change the imm32 of the CALL needs to be atomic from a
code fetch perspective so that we don't jump to a junk address.

Or were you saying that Intel gave an official OK on text_poke_bp()?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 17:32     ` Nadav Amit
@ 2019-01-10 18:18       ` Josh Poimboeuf
  2019-01-10 19:45         ` Nadav Amit
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 18:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 05:32:08PM +0000, Nadav Amit wrote:
> > On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Thu, Jan 10, 2019 at 01:21:00AM +0000, Nadav Amit wrote:
> >>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>> 
> >>> With this version, I stopped trying to use text_poke_bp(), and instead
> >>> went with a different approach: if the call site destination doesn't
> >>> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> >>> using the trampoline indefinitely.
> >>> 
> >>> NOTE: At least experimentally, the call destination writes seem to be
> >>> atomic with respect to instruction fetching.  On Nehalem I can easily
> >>> trigger crashes when writing a call destination across cachelines while
> >>> reading the instruction on other CPU; but I get no such crashes when
> >>> respecting cacheline boundaries.
> >>> 
> >>> BUT, the SDM doesn't document this approach, so it would be great if any
> >>> CPU people can confirm that it's safe!
> >> 
> >> I (still) think that having a compiler plugin can make things much cleaner
> >> (as done in [1]). The callers would not need to be changed, and the key can
> >> be provided through an attribute.
> >> 
> >> Using a plugin should also allow to use Steven’s proposal for doing
> >> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
> >> by the plugin, you can be guaranteed that registers are clobbered. Then, you
> >> can store in the assembly block the return address in one of these
> >> registers.
> > 
> > I'm no GCC expert (why do I find myself saying this a lot lately?), but
> > this sounds to me like it could be tricky to get right.
> > 
> > I suppose you'd have to do it in an early pass, to allow GCC to clobber
> > the registers in a later pass.  So it would necessarily have side
> > effects, but I don't know what the risks are.
> 
> I’m not GCC expert either and writing this code was not making me full of
> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
> don’t think an early pass is needed, as long as hardware registers were not
> allocated.
> 
> > Would it work with more than 5 arguments, where args get passed on the
> > stack?
> 
> It does.
> 
> > 
> > At the very least, it would (at least partially) defeat the point of the
> > callee-saved paravirt ops.
> 
> Actually, I think you can even deal with callee-saved functions and remove
> all the (terrible) macros. You would need to tell the extension not to
> clobber the registers through a new attribute.

Ok, it does sound interesting then.  I assume you'll be sharing the
code?

> > What if we just used a plugin in a simpler fashion -- to do call site
> > alignment, if necessary, to ensure the instruction doesn't cross
> > cacheline boundaries.  This could be done in a later pass, with no side
> > effects other than code layout.  And it would allow us to avoid
> > breakpoints altogether -- again, assuming somebody can verify that
> > intra-cacheline call destination writes are atomic with respect to
> > instruction decoder reads.
> 
> The plugin should not be able to do so. Layout of the bytecode is done by
> the assembler, so I don’t think a plugin would help you with this one.

Actually I think we could use .bundle_align_mode for this purpose:

  https://sourceware.org/binutils/docs-2.31/as/Bundle-directives.html

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 18:04             ` Sean Christopherson
@ 2019-01-10 18:21               ` Josh Poimboeuf
  2019-01-10 18:24               ` Steven Rostedt
  2019-01-11 12:10               ` Alexandre Chartre
  2 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 18:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steven Rostedt, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 10:04:28AM -0800, Sean Christopherson wrote:
> On Thu, Jan 10, 2019 at 12:57:57PM -0500, Steven Rostedt wrote:
> > On Thu, 10 Jan 2019 09:42:57 -0800
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > > On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
> > > > On Thu, 10 Jan 2019 11:20:04 -0600
> > > > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > > 
> > > >   
> > > > > > While I can't find a reason for hypervisors to emulate this instruction,
> > > > > > smarter people might find ways to turn it into a security exploit.    
> > > > > 
> > > > > Interesting point... but I wonder if it's a realistic concern.  BTW,
> > > > > text_poke_bp() also relies on undocumented behavior.  
> > > > 
> > > > But we did get an official OK from Intel that it will work. Took a bit
> > > > of arm twisting to get them to do so, but they did. And it really is
> > > > pretty robust.  
> > > 
> > > Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
> > > certain atomicity guarantees go out the window if WC memtype is used.
> > 
> > Note, the text_poke_bp() process was this: (nothing to do with atomic
> > guarantees)
> > 
> > add breakpoint (one byte) to instruction.
> > 
> > Sync all cores (send an IPI to each one).
> > 
> > change the back half of the instruction (the rest of the instruction
> > after the breakpoint).
> > 
> > Sync all cores
> > 
> > Remove the breakpoint with the new byte of the new instruction.
> > 
> > 
> > What atomicity guarantee does the above require?
> 
> I was asking in the context of static calls.  My understanding is that
> the write to change the imm32 of the CALL needs to be atomic from a
> code fetch perspective so that we don't jump to a junk address.
> 
> Or were you saying that Intel gave an official OK on text_poke_bp()?

Yeah, I'm pretty sure he was saying that.

Whose arms can we twist for finding out about static calls? :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 18:04             ` Sean Christopherson
  2019-01-10 18:21               ` Josh Poimboeuf
@ 2019-01-10 18:24               ` Steven Rostedt
  2019-01-11 12:10               ` Alexandre Chartre
  2 siblings, 0 replies; 90+ messages in thread
From: Steven Rostedt @ 2019-01-10 18:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Josh Poimboeuf, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, 10 Jan 2019 10:04:28 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> > What atomicity guarantee does the above require?  
> 
> I was asking in the context of static calls.  My understanding is that
> the write to change the imm32 of the CALL needs to be atomic from a
> code fetch perspective so that we don't jump to a junk address.
> 
> Or were you saying that Intel gave an official OK on text_poke_bp()?

Yes, the latter. I was talking about Intel giving the official OK for
text_poke_bp().

-- Steve

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 2/6] static_call: Add basic static call infrastructure
  2019-01-10 14:03   ` Edward Cree
@ 2019-01-10 18:37     ` Josh Poimboeuf
  0 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 18:37 UTC (permalink / raw)
  To: Edward Cree
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 02:03:20PM +0000, Edward Cree wrote:
> > +#endif /* CONFIG_HAVE_STATIC_CALL_INLINE */
> CONFIG_HAVE_STATIC_CALL?

Right, thanks.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 18:18       ` Josh Poimboeuf
@ 2019-01-10 19:45         ` Nadav Amit
  2019-01-10 20:32           ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-10 19:45 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 10, 2019, at 10:18 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Thu, Jan 10, 2019 at 05:32:08PM +0000, Nadav Amit wrote:
>>> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> 
>>> On Thu, Jan 10, 2019 at 01:21:00AM +0000, Nadav Amit wrote:
>>>>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>> 
>>>>> With this version, I stopped trying to use text_poke_bp(), and instead
>>>>> went with a different approach: if the call site destination doesn't
>>>>> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
>>>>> using the trampoline indefinitely.
>>>>> 
>>>>> NOTE: At least experimentally, the call destination writes seem to be
>>>>> atomic with respect to instruction fetching.  On Nehalem I can easily
>>>>> trigger crashes when writing a call destination across cachelines while
>>>>> reading the instruction on other CPU; but I get no such crashes when
>>>>> respecting cacheline boundaries.
>>>>> 
>>>>> BUT, the SDM doesn't document this approach, so it would be great if any
>>>>> CPU people can confirm that it's safe!
>>>> 
>>>> I (still) think that having a compiler plugin can make things much cleaner
>>>> (as done in [1]). The callers would not need to be changed, and the key can
>>>> be provided through an attribute.
>>>> 
>>>> Using a plugin should also allow to use Steven’s proposal for doing
>>>> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
>>>> by the plugin, you can be guaranteed that registers are clobbered. Then, you
>>>> can store in the assembly block the return address in one of these
>>>> registers.
>>> 
>>> I'm no GCC expert (why do I find myself saying this a lot lately?), but
>>> this sounds to me like it could be tricky to get right.
>>> 
>>> I suppose you'd have to do it in an early pass, to allow GCC to clobber
>>> the registers in a later pass.  So it would necessarily have side
>>> effects, but I don't know what the risks are.
>> 
>> I’m not GCC expert either and writing this code was not making me full of
>> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
>> don’t think an early pass is needed, as long as hardware registers were not
>> allocated.
>> 
>>> Would it work with more than 5 arguments, where args get passed on the
>>> stack?
>> 
>> It does.
>> 
>>> At the very least, it would (at least partially) defeat the point of the
>>> callee-saved paravirt ops.
>> 
>> Actually, I think you can even deal with callee-saved functions and remove
>> all the (terrible) macros. You would need to tell the extension not to
>> clobber the registers through a new attribute.
> 
> Ok, it does sound interesting then.  I assume you'll be sharing the
> code?

Of course. If this what is going to convince, I’ll make a small version for
PV callee-saved first.

>>> What if we just used a plugin in a simpler fashion -- to do call site
>>> alignment, if necessary, to ensure the instruction doesn't cross
>>> cacheline boundaries.  This could be done in a later pass, with no side
>>> effects other than code layout.  And it would allow us to avoid
>>> breakpoints altogether -- again, assuming somebody can verify that
>>> intra-cacheline call destination writes are atomic with respect to
>>> instruction decoder reads.
>> 
>> The plugin should not be able to do so. Layout of the bytecode is done by
>> the assembler, so I don’t think a plugin would help you with this one.
> 
> Actually I think we could use .bundle_align_mode for this purpose:
> 
>  https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.html&amp;data=02%7C01%7Cnamit%40vmware.com%7Cfa29fb8be208498d039008d67727fe30%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827411004664549&amp;sdata=elDuAVOsSlidG7pZSZfjbhrgnMOHeX6AWKs0hJM4cCE%3D&amp;reserved=0

Hm… I don’t understand what you have in mind (i.e., when would this
assembly directives would be emitted).




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2019-01-10 17:31 ` Linus Torvalds
@ 2019-01-10 20:30 ` Peter Zijlstra
  2019-01-10 20:52   ` Josh Poimboeuf
  8 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2019-01-10 20:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> using the trampoline indefinitely.

> - Get rid of the use of text_poke_bp(), in favor of atomic writes.
>   Out-of-line calls will be promoted to inline only if the call sites
>   don't cross cache line boundaries. [Linus/Andy]

Can we perserve why text_poke_bp() didn't work? I seem to have forgotten
again. The problem was poking the return address onto the stack from the
int3 handler, or something along those lines?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 19:45         ` Nadav Amit
@ 2019-01-10 20:32           ` Josh Poimboeuf
  2019-01-10 20:48             ` Nadav Amit
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 20:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 07:45:26PM +0000, Nadav Amit wrote:
> >> I’m not GCC expert either and writing this code was not making me full of
> >> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
> >> don’t think an early pass is needed, as long as hardware registers were not
> >> allocated.
> >> 
> >>> Would it work with more than 5 arguments, where args get passed on the
> >>> stack?
> >> 
> >> It does.
> >> 
> >>> At the very least, it would (at least partially) defeat the point of the
> >>> callee-saved paravirt ops.
> >> 
> >> Actually, I think you can even deal with callee-saved functions and remove
> >> all the (terrible) macros. You would need to tell the extension not to
> >> clobber the registers through a new attribute.
> > 
> > Ok, it does sound interesting then.  I assume you'll be sharing the
> > code?
> 
> Of course. If this what is going to convince, I’ll make a small version for
> PV callee-saved first.

It wasn't *only* the PV callee-saved part which interested me, so if you
already have something which implements the other parts, I'd still like
to see it.

> >>> What if we just used a plugin in a simpler fashion -- to do call site
> >>> alignment, if necessary, to ensure the instruction doesn't cross
> >>> cacheline boundaries.  This could be done in a later pass, with no side
> >>> effects other than code layout.  And it would allow us to avoid
> >>> breakpoints altogether -- again, assuming somebody can verify that
> >>> intra-cacheline call destination writes are atomic with respect to
> >>> instruction decoder reads.
> >> 
> >> The plugin should not be able to do so. Layout of the bytecode is done by
> >> the assembler, so I don’t think a plugin would help you with this one.
> > 
> > Actually I think we could use .bundle_align_mode for this purpose:
> > 
> >  https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.html&amp;data=02%7C01%7Cnamit%40vmware.com%7Cfa29fb8be208498d039008d67727fe30%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827411004664549&amp;sdata=elDuAVOsSlidG7pZSZfjbhrgnMOHeX6AWKs0hJM4cCE%3D&amp;reserved=0
> 
> Hm… I don’t understand what you have in mind (i.e., when would this
> assembly directives would be emitted).

For example, it could replace

  callq ____static_call_tramp_my_key

with

  .bundle_align_mode 6
  callq ____static_call_tramp_my_key
  .bundle_align_mode 0

which ensures the instruction is within a cache line, aligning it with
NOPs if necessary.  That would allow my current implementation to
upgrade out-of-line calls to inline calls 100% of the time, instead of
95% of the time.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 20:32           ` Josh Poimboeuf
@ 2019-01-10 20:48             ` Nadav Amit
  2019-01-10 20:57               ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-10 20:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Thu, Jan 10, 2019 at 07:45:26PM +0000, Nadav Amit wrote:
>>>> I’m not GCC expert either and writing this code was not making me full of
>>>> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
>>>> don’t think an early pass is needed, as long as hardware registers were not
>>>> allocated.
>>>> 
>>>>> Would it work with more than 5 arguments, where args get passed on the
>>>>> stack?
>>>> 
>>>> It does.
>>>> 
>>>>> At the very least, it would (at least partially) defeat the point of the
>>>>> callee-saved paravirt ops.
>>>> 
>>>> Actually, I think you can even deal with callee-saved functions and remove
>>>> all the (terrible) macros. You would need to tell the extension not to
>>>> clobber the registers through a new attribute.
>>> 
>>> Ok, it does sound interesting then.  I assume you'll be sharing the
>>> code?
>> 
>> Of course. If this what is going to convince, I’ll make a small version for
>> PV callee-saved first.
> 
> It wasn't *only* the PV callee-saved part which interested me, so if you
> already have something which implements the other parts, I'd still like
> to see it.

Did you have a look at https://lore.kernel.org/lkml/20181231072112.21051-4-namit@vmware.com/ ?

See the changes to x86_call_markup_plugin.c .

The missing part (that I just finished but need to cleanup) is attributes
that allow you to provide key and dynamically enable the patching.

>>>>> What if we just used a plugin in a simpler fashion -- to do call site
>>>>> alignment, if necessary, to ensure the instruction doesn't cross
>>>>> cacheline boundaries.  This could be done in a later pass, with no side
>>>>> effects other than code layout.  And it would allow us to avoid
>>>>> breakpoints altogether -- again, assuming somebody can verify that
>>>>> intra-cacheline call destination writes are atomic with respect to
>>>>> instruction decoder reads.
>>>> 
>>>> The plugin should not be able to do so. Layout of the bytecode is done by
>>>> the assembler, so I don’t think a plugin would help you with this one.
>>> 
>>> Actually I think we could use .bundle_align_mode for this purpose:
>>> 
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.html&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbc4dcc541474462da00b08d6773ab61f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827491388051263&amp;sdata=HZNPN4UygwQCqsX8dOajaNeDZyy1O0O4cYeSwu%2BIdO0%3D&amp;reserved=0
>> 
>> Hm… I don’t understand what you have in mind (i.e., when would this
>> assembly directives would be emitted).
> 
> For example, it could replace
> 
>  callq ____static_call_tramp_my_key
> 
> with
> 
>  .bundle_align_mode 6
>  callq ____static_call_tramp_my_key
>  .bundle_align_mode 0
> 
> which ensures the instruction is within a cache line, aligning it with
> NOPs if necessary.  That would allow my current implementation to
> upgrade out-of-line calls to inline calls 100% of the time, instead of
> 95% of the time.

Heh. I almost wrote based no the feature description that this will add
unnecessary padding no matter what, but actually (experimentally) it works
well…



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 17:31 ` Linus Torvalds
@ 2019-01-10 20:51   ` H. Peter Anvin
  0 siblings, 0 replies; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-10 20:51 UTC (permalink / raw)
  To: Linus Torvalds, Josh Poimboeuf
  Cc: the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Andy Lutomirski, Steven Rostedt, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On 1/10/19 9:31 AM, Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>
>> NOTE: At least experimentally, the call destination writes seem to be
>> atomic with respect to instruction fetching.  On Nehalem I can easily
>> trigger crashes when writing a call destination across cachelines while
>> reading the instruction on other CPU; but I get no such crashes when
>> respecting cacheline boundaries.
> 
> I still doubt ifetch is atomic on a cacheline boundary for the simple
> reason that the bus between the IU and the L1 I$ is narrower in older
> CPU's.
> 

As far as I understand, on P6+ (and possibly earlier, but I don't know) it is
atomic on a 16-byte fetch datum, at least for Intel CPUs.

However, single byte accesses are always going to be safe.

	-hpa


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 20:30 ` Peter Zijlstra
@ 2019-01-10 20:52   ` Josh Poimboeuf
  2019-01-10 23:02     ` Linus Torvalds
  2020-02-17 21:10     ` Jann Horn
  0 siblings, 2 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 09:30:23PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> > With this version, I stopped trying to use text_poke_bp(), and instead
> > went with a different approach: if the call site destination doesn't
> > cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> > using the trampoline indefinitely.
> 
> > - Get rid of the use of text_poke_bp(), in favor of atomic writes.
> >   Out-of-line calls will be promoted to inline only if the call sites
> >   don't cross cache line boundaries. [Linus/Andy]
> 
> Can we perserve why text_poke_bp() didn't work? I seem to have forgotten
> again. The problem was poking the return address onto the stack from the
> int3 handler, or something along those lines?

Right, emulating a call instruction from the #BP handler is ugly,
because you have to somehow grow the stack to make room for the return
address.  Personally I liked the idea of shifting the iret frame by 16
bytes in the #DB entry code, but others hated it.

So many bad-but-not-completely-unacceptable options to choose from.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 20:48             ` Nadav Amit
@ 2019-01-10 20:57               ` Josh Poimboeuf
  2019-01-10 21:47                 ` Nadav Amit
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-10 20:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 08:48:31PM +0000, Nadav Amit wrote:
> > On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Thu, Jan 10, 2019 at 07:45:26PM +0000, Nadav Amit wrote:
> >>>> I’m not GCC expert either and writing this code was not making me full of
> >>>> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
> >>>> don’t think an early pass is needed, as long as hardware registers were not
> >>>> allocated.
> >>>> 
> >>>>> Would it work with more than 5 arguments, where args get passed on the
> >>>>> stack?
> >>>> 
> >>>> It does.
> >>>> 
> >>>>> At the very least, it would (at least partially) defeat the point of the
> >>>>> callee-saved paravirt ops.
> >>>> 
> >>>> Actually, I think you can even deal with callee-saved functions and remove
> >>>> all the (terrible) macros. You would need to tell the extension not to
> >>>> clobber the registers through a new attribute.
> >>> 
> >>> Ok, it does sound interesting then.  I assume you'll be sharing the
> >>> code?
> >> 
> >> Of course. If this what is going to convince, I’ll make a small version for
> >> PV callee-saved first.
> > 
> > It wasn't *only* the PV callee-saved part which interested me, so if you
> > already have something which implements the other parts, I'd still like
> > to see it.
> 
> Did you have a look at https://lore.kernel.org/lkml/20181231072112.21051-4-namit@vmware.com/ ?
> 
> See the changes to x86_call_markup_plugin.c .
> 
> The missing part (that I just finished but need to cleanup) is attributes
> that allow you to provide key and dynamically enable the patching.

Aha, so it's the basically the same plugin you had for optpolines.  I
missed that.  I'll need to stare at the code for a little bit.

> >>>>> What if we just used a plugin in a simpler fashion -- to do call site
> >>>>> alignment, if necessary, to ensure the instruction doesn't cross
> >>>>> cacheline boundaries.  This could be done in a later pass, with no side
> >>>>> effects other than code layout.  And it would allow us to avoid
> >>>>> breakpoints altogether -- again, assuming somebody can verify that
> >>>>> intra-cacheline call destination writes are atomic with respect to
> >>>>> instruction decoder reads.
> >>>> 
> >>>> The plugin should not be able to do so. Layout of the bytecode is done by
> >>>> the assembler, so I don’t think a plugin would help you with this one.
> >>> 
> >>> Actually I think we could use .bundle_align_mode for this purpose:
> >>> 
> >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.html&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbc4dcc541474462da00b08d6773ab61f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827491388051263&amp;sdata=HZNPN4UygwQCqsX8dOajaNeDZyy1O0O4cYeSwu%2BIdO0%3D&amp;reserved=0
> >> 
> >> Hm… I don’t understand what you have in mind (i.e., when would this
> >> assembly directives would be emitted).
> > 
> > For example, it could replace
> > 
> >  callq ____static_call_tramp_my_key
> > 
> > with
> > 
> >  .bundle_align_mode 6
> >  callq ____static_call_tramp_my_key
> >  .bundle_align_mode 0
> > 
> > which ensures the instruction is within a cache line, aligning it with
> > NOPs if necessary.  That would allow my current implementation to
> > upgrade out-of-line calls to inline calls 100% of the time, instead of
> > 95% of the time.
> 
> Heh. I almost wrote based no the feature description that this will add
> unnecessary padding no matter what, but actually (experimentally) it works
> well…

Yeah, based on the poorly worded docs, I made the same assumption, until
I tried it.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 20:57               ` Josh Poimboeuf
@ 2019-01-10 21:47                 ` Nadav Amit
  0 siblings, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-10 21:47 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 10, 2019, at 12:57 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Thu, Jan 10, 2019 at 08:48:31PM +0000, Nadav Amit wrote:
>>> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> 
>>> On Thu, Jan 10, 2019 at 07:45:26PM +0000, Nadav Amit wrote:
>>>>>> I’m not GCC expert either and writing this code was not making me full of
>>>>>> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
>>>>>> don’t think an early pass is needed, as long as hardware registers were not
>>>>>> allocated.
>>>>>> 
>>>>>>> Would it work with more than 5 arguments, where args get passed on the
>>>>>>> stack?
>>>>>> 
>>>>>> It does.
>>>>>> 
>>>>>>> At the very least, it would (at least partially) defeat the point of the
>>>>>>> callee-saved paravirt ops.
>>>>>> 
>>>>>> Actually, I think you can even deal with callee-saved functions and remove
>>>>>> all the (terrible) macros. You would need to tell the extension not to
>>>>>> clobber the registers through a new attribute.
>>>>> 
>>>>> Ok, it does sound interesting then.  I assume you'll be sharing the
>>>>> code?
>>>> 
>>>> Of course. If this what is going to convince, I’ll make a small version for
>>>> PV callee-saved first.
>>> 
>>> It wasn't *only* the PV callee-saved part which interested me, so if you
>>> already have something which implements the other parts, I'd still like
>>> to see it.
>> 
>> Did you have a look at https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20181231072112.21051-4-namit%40vmware.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7C169b737792134fc852d808d6773e454b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827506683819442&amp;sdata=FhM%2F5OD%2FNHx9Jr97iPIBNyn0BoLAlyiSv%2BT4XICBUdg%3D&amp;reserved=0 ?
>> 
>> See the changes to x86_call_markup_plugin.c .
>> 
>> The missing part (that I just finished but need to cleanup) is attributes
>> that allow you to provide key and dynamically enable the patching.
> 
> Aha, so it's the basically the same plugin you had for optpolines.  I
> missed that.  I'll need to stare at the code for a little bit.

Pretty much. You would want to change the assembly code block, and based on
the use-case (e.g., callee-saved, static-calls) clobber or set as an input
more or fewer registers.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 20:52   ` Josh Poimboeuf
@ 2019-01-10 23:02     ` Linus Torvalds
  2019-01-11  0:56       ` Andy Lutomirski
  2020-02-17 21:10     ` Jann Horn
  1 sibling, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2019-01-10 23:02 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Masami Hiramatsu,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Nadav Amit,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> Right, emulating a call instruction from the #BP handler is ugly,
> because you have to somehow grow the stack to make room for the return
> address.  Personally I liked the idea of shifting the iret frame by 16
> bytes in the #DB entry code, but others hated it.

Yeah, I hated it.

But I'm starting to think it's the simplest solution.

So still not loving it, but all the other models have had huge issues too.

                Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 23:02     ` Linus Torvalds
@ 2019-01-11  0:56       ` Andy Lutomirski
  2019-01-11  1:47         ` Nadav Amit
  0 siblings, 1 reply; 90+ messages in thread
From: Andy Lutomirski @ 2019-01-11  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Masami Hiramatsu,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Nadav Amit,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > Right, emulating a call instruction from the #BP handler is ugly,
> > because you have to somehow grow the stack to make room for the return
> > address.  Personally I liked the idea of shifting the iret frame by 16
> > bytes in the #DB entry code, but others hated it.
>
> Yeah, I hated it.
>
> But I'm starting to think it's the simplest solution.
>
> So still not loving it, but all the other models have had huge issues too.
>

Putting my maintainer hat on:

I'm okay-ish with shifting the stack by 16 bytes.  If this is done, I
want an assertion in do_int3() or wherever the fixup happens that the
write isn't overlapping pt_regs (which is easy to implement because
that code has the relevant pt_regs pointer).  And I want some code
that explicitly triggers the fixup when a CONFIG_DEBUG_ENTRY=y or
similar kernel is built so that this whole mess actually gets
exercised.  Because the fixup only happens when a
really-quite-improbable race gets hit, and the issues depend on stack
alignment, which is presumably why Josh was able to submit a buggy
series without noticing.

BUT: this is going to be utterly gross whenever anyone tries to
implement shadow stacks for the kernel, and we might need to switch to
a longjmp-like approach if that happens.

--Andy

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 17:42         ` Sean Christopherson
  2019-01-10 17:57           ` Steven Rostedt
@ 2019-01-11  0:59           ` hpa
  2019-01-11  1:34             ` Sean Christopherson
  1 sibling, 1 reply; 90+ messages in thread
From: hpa @ 2019-01-11  0:59 UTC (permalink / raw)
  To: Sean Christopherson, Steven Rostedt
  Cc: Josh Poimboeuf, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On January 10, 2019 9:42:57 AM PST, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
>> On Thu, 10 Jan 2019 11:20:04 -0600
>> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> 
>> 
>> > > While I can't find a reason for hypervisors to emulate this
>instruction,
>> > > smarter people might find ways to turn it into a security
>exploit.  
>> > 
>> > Interesting point... but I wonder if it's a realistic concern. 
>BTW,
>> > text_poke_bp() also relies on undocumented behavior.
>> 
>> But we did get an official OK from Intel that it will work. Took a
>bit
>> of arm twisting to get them to do so, but they did. And it really is
>> pretty robust.
>
>Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
>certain atomicity guarantees go out the window if WC memtype is used.

If you run code from non-WB memory, all bets are off and you better not be doing cross-modifying code.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11  0:59           ` hpa
@ 2019-01-11  1:34             ` Sean Christopherson
  2019-01-11  8:13               ` hpa
  0 siblings, 1 reply; 90+ messages in thread
From: Sean Christopherson @ 2019-01-11  1:34 UTC (permalink / raw)
  To: hpa
  Cc: Steven Rostedt, Josh Poimboeuf, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 04:59:55PM -0800, hpa@zytor.com wrote:
> On January 10, 2019 9:42:57 AM PST, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
> >> On Thu, 10 Jan 2019 11:20:04 -0600
> >> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> 
> >> 
> >> > > While I can't find a reason for hypervisors to emulate this
> >instruction,
> >> > > smarter people might find ways to turn it into a security
> >exploit.  
> >> > 
> >> > Interesting point... but I wonder if it's a realistic concern. 
> >BTW,
> >> > text_poke_bp() also relies on undocumented behavior.
> >> 
> >> But we did get an official OK from Intel that it will work. Took a
> >bit
> >> of arm twisting to get them to do so, but they did. And it really is
> >> pretty robust.
> >
> >Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
> >certain atomicity guarantees go out the window if WC memtype is used.
> 
> If you run code from non-WB memory, all bets are off and you better
> not be doing cross-modifying code.

I wasn't thinking of running code from non-WB, but rather running code
in WB while doing a CMC write via WC.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11  0:56       ` Andy Lutomirski
@ 2019-01-11  1:47         ` Nadav Amit
  2019-01-11 15:15           ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-11  1:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Josh Poimboeuf, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 10, 2019, at 4:56 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> Right, emulating a call instruction from the #BP handler is ugly,
>>> because you have to somehow grow the stack to make room for the return
>>> address.  Personally I liked the idea of shifting the iret frame by 16
>>> bytes in the #DB entry code, but others hated it.
>> 
>> Yeah, I hated it.
>> 
>> But I'm starting to think it's the simplest solution.
>> 
>> So still not loving it, but all the other models have had huge issues too.
> 
> Putting my maintainer hat on:
> 
> I'm okay-ish with shifting the stack by 16 bytes.  If this is done, I
> want an assertion in do_int3() or wherever the fixup happens that the
> write isn't overlapping pt_regs (which is easy to implement because
> that code has the relevant pt_regs pointer).  And I want some code
> that explicitly triggers the fixup when a CONFIG_DEBUG_ENTRY=y or
> similar kernel is built so that this whole mess actually gets
> exercised.  Because the fixup only happens when a
> really-quite-improbable race gets hit, and the issues depend on stack
> alignment, which is presumably why Josh was able to submit a buggy
> series without noticing.
> 
> BUT: this is going to be utterly gross whenever anyone tries to
> implement shadow stacks for the kernel, and we might need to switch to
> a longjmp-like approach if that happens.

Here is an alternative idea (although similar to Steven’s and my code).

Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
should be done by the calling convention (and gcc plugin should allow us to
enforce). Also assume that we hold a table with all source RIP and the
matching target.

Now, in the int3 handler can you take the faulting RIP and search for it in
the “static-calls” table, writing the RIP+5 (offset) into R10 (return
address) and the target into R11. You make the int3 handler to divert the
code execution by changing pt_regs->rip to point to a new function that does:

	push R10
	jmp __x86_indirect_thunk_r11

And then you are done. No?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11  1:34             ` Sean Christopherson
@ 2019-01-11  8:13               ` hpa
  0 siblings, 0 replies; 90+ messages in thread
From: hpa @ 2019-01-11  8:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steven Rostedt, Josh Poimboeuf, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On January 10, 2019 5:34:21 PM PST, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>On Thu, Jan 10, 2019 at 04:59:55PM -0800, hpa@zytor.com wrote:
>> On January 10, 2019 9:42:57 AM PST, Sean Christopherson
><sean.j.christopherson@intel.com> wrote:
>> >On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
>> >> On Thu, 10 Jan 2019 11:20:04 -0600
>> >> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> 
>> >> 
>> >> > > While I can't find a reason for hypervisors to emulate this
>> >instruction,
>> >> > > smarter people might find ways to turn it into a security
>> >exploit.  
>> >> > 
>> >> > Interesting point... but I wonder if it's a realistic concern. 
>> >BTW,
>> >> > text_poke_bp() also relies on undocumented behavior.
>> >> 
>> >> But we did get an official OK from Intel that it will work. Took a
>> >bit
>> >> of arm twisting to get them to do so, but they did. And it really
>is
>> >> pretty robust.
>> >
>> >Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
>> >certain atomicity guarantees go out the window if WC memtype is
>used.
>> 
>> If you run code from non-WB memory, all bets are off and you better
>> not be doing cross-modifying code.
>
>I wasn't thinking of running code from non-WB, but rather running code
>in WB while doing a CMC write via WC.

Same thing. Don't do that.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-10 18:04             ` Sean Christopherson
  2019-01-10 18:21               ` Josh Poimboeuf
  2019-01-10 18:24               ` Steven Rostedt
@ 2019-01-11 12:10               ` Alexandre Chartre
  2019-01-11 15:28                 ` Josh Poimboeuf
  2 siblings, 1 reply; 90+ messages in thread
From: Alexandre Chartre @ 2019-01-11 12:10 UTC (permalink / raw)
  To: Sean Christopherson, Steven Rostedt
  Cc: Josh Poimboeuf, Nadav Amit, X86 ML, LKML, Ard Biesheuvel,
	Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira



On 01/10/2019 07:04 PM, Sean Christopherson wrote:
> On Thu, Jan 10, 2019 at 12:57:57PM -0500, Steven Rostedt wrote:
>> On Thu, 10 Jan 2019 09:42:57 -0800
>> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>
>>> On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
>>>> On Thu, 10 Jan 2019 11:20:04 -0600
>>>> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>
>>>>
>>>>>> While I can't find a reason for hypervisors to emulate this instruction,
>>>>>> smarter people might find ways to turn it into a security exploit.
>>>>>
>>>>> Interesting point... but I wonder if it's a realistic concern.  BTW,
>>>>> text_poke_bp() also relies on undocumented behavior.
>>>>
>>>> But we did get an official OK from Intel that it will work. Took a bit
>>>> of arm twisting to get them to do so, but they did. And it really is
>>>> pretty robust.
>>>
>>> Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
>>> certain atomicity guarantees go out the window if WC memtype is used.
>>
>> Note, the text_poke_bp() process was this: (nothing to do with atomic
>> guarantees)
>>
>> add breakpoint (one byte) to instruction.
>>
>> Sync all cores (send an IPI to each one).
>>
>> change the back half of the instruction (the rest of the instruction
>> after the breakpoint).
>>
>> Sync all cores
>>
>> Remove the breakpoint with the new byte of the new instruction.
>>
>>
>> What atomicity guarantee does the above require?
>
> I was asking in the context of static calls.  My understanding is that
> the write to change the imm32 of the CALL needs to be atomic from a
> code fetch perspective so that we don't jump to a junk address.
>

To avoid any issue with live patching the call instruction, what about
toggling between two call instructions: one would be the currently active
call, while the other would currently be inactive but to be used after a
change is made. You can safely patch the inactive call and then toggle
the call flow (using a jump label) between the active and inactive calls.

So instead of having a single call instruction:

	call function

You would have:

	STATIC_JUMP_IF_TRUE label, key
	call function1
	jmp done
label:
	call function2
done:

If the key is set so that function1 is currently called then you can
safely update the call instruction for function2. Once this is done,
just flip the key to make the function2 call active. On a next update,
you would, of course, have to switch and update the call for function1.


alex.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11  1:47         ` Nadav Amit
@ 2019-01-11 15:15           ` Josh Poimboeuf
  2019-01-11 15:48             ` Nadav Amit
  2019-01-11 19:03             ` Linus Torvalds
  0 siblings, 2 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 15:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 01:47:01AM +0000, Nadav Amit wrote:
> Here is an alternative idea (although similar to Steven’s and my code).
> 
> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
> should be done by the calling convention (and gcc plugin should allow us to
> enforce). Also assume that we hold a table with all source RIP and the
> matching target.
> 
> Now, in the int3 handler can you take the faulting RIP and search for it in
> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
> address) and the target into R11. You make the int3 handler to divert the
> code execution by changing pt_regs->rip to point to a new function that does:
> 
> 	push R10
> 	jmp __x86_indirect_thunk_r11
> 
> And then you are done. No?

IIUC, that sounds pretty much like what Steven proposed:

  https://lkml.kernel.org/r/20181129122000.7fb4fb04@gandalf.local.home

I liked the idea, BUT, how would it work for callee-saved PV ops?  In
that case there's only one clobbered register to work with (rax).

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11 12:10               ` Alexandre Chartre
@ 2019-01-11 15:28                 ` Josh Poimboeuf
  2019-01-11 16:46                   ` Alexandre Chartre
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 15:28 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Sean Christopherson, Steven Rostedt, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 01:10:52PM +0100, Alexandre Chartre wrote:
> To avoid any issue with live patching the call instruction, what about
> toggling between two call instructions: one would be the currently active
> call, while the other would currently be inactive but to be used after a
> change is made. You can safely patch the inactive call and then toggle
> the call flow (using a jump label) between the active and inactive calls.
> 
> So instead of having a single call instruction:
> 
> 	call function
> 
> You would have:
> 
> 	STATIC_JUMP_IF_TRUE label, key
> 	call function1
> 	jmp done
> label:
> 	call function2
> done:
> 
> If the key is set so that function1 is currently called then you can
> safely update the call instruction for function2. Once this is done,
> just flip the key to make the function2 call active. On a next update,
> you would, of course, have to switch and update the call for function1.

What about the following race?

CPU1						CPU2
static key is false, doesn't jump
task gets preempted before calling function1
						change static key to true
						start patching "call function1"
task resumes, sees inconsistent call instruction

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 15:15           ` Josh Poimboeuf
@ 2019-01-11 15:48             ` Nadav Amit
  2019-01-11 16:07               ` Josh Poimboeuf
  2019-01-11 19:03             ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 15:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 7:15 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Fri, Jan 11, 2019 at 01:47:01AM +0000, Nadav Amit wrote:
>> Here is an alternative idea (although similar to Steven’s and my code).
>> 
>> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
>> should be done by the calling convention (and gcc plugin should allow us to
>> enforce). Also assume that we hold a table with all source RIP and the
>> matching target.
>> 
>> Now, in the int3 handler can you take the faulting RIP and search for it in
>> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
>> address) and the target into R11. You make the int3 handler to divert the
>> code execution by changing pt_regs->rip to point to a new function that does:
>> 
>> 	push R10
>> 	jmp __x86_indirect_thunk_r11
>> 
>> And then you are done. No?
> 
> IIUC, that sounds pretty much like what Steven proposed:
> 
>  https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.home&amp;data=02%7C01%7Cnamit%40vmware.com%7Ce3f0b96a1e83417af48808d677d7a147%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828165370908292&amp;sdata=PFzrJQzoa21IRYmEuqHSSGYrNZt0zIo8TGOZa3NWbOE%3D&amp;reserved=0

Stupid me. I’ve remembered it slightly different (the caller saving the
target in a register).

> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
> that case there's only one clobbered register to work with (rax).

That’s would be more tricky. How about using a per-CPU trampoline code to
hold a direct call to the target and temporarily disable preemption (which
might be simpler by disabling IRQs):

Static-call modifier:

        1. synchronize_sched() to ensure per-cpu trampoline is not used
	2. Patches the jmp in a per-cpu trampoline (see below)
	3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) 
	4. Configures the int3 handler to use static-call int3 handler
	5. Patches the call target (as it currently does).

Static-call int3 handler:
	1. Changes flags on the stack to keep IRQs disabled on return
	2. Jumps to per-cpu trampoline on return

Per-cpu trampoline:
	push [per-CPU scratchpad RIP]
	sti
	jmp [ target ] (this one is patched)

Note that no IRQ should be possible between the STI and the JMP due to STI
blocking.

What do you say?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 15:48             ` Nadav Amit
@ 2019-01-11 16:07               ` Josh Poimboeuf
  2019-01-11 17:23                 ` Nadav Amit
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 16:07 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 03:48:59PM +0000, Nadav Amit wrote:
> > I liked the idea, BUT, how would it work for callee-saved PV ops?  In
> > that case there's only one clobbered register to work with (rax).
> 
> That’s would be more tricky. How about using a per-CPU trampoline code to
> hold a direct call to the target and temporarily disable preemption (which
> might be simpler by disabling IRQs):
> 
> Static-call modifier:
> 
>         1. synchronize_sched() to ensure per-cpu trampoline is not used
> 	2. Patches the jmp in a per-cpu trampoline (see below)
> 	3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) 
> 	4. Configures the int3 handler to use static-call int3 handler
> 	5. Patches the call target (as it currently does).
> 
> Static-call int3 handler:
> 	1. Changes flags on the stack to keep IRQs disabled on return
> 	2. Jumps to per-cpu trampoline on return
> 
> Per-cpu trampoline:
> 	push [per-CPU scratchpad RIP]
> 	sti
> 	jmp [ target ] (this one is patched)
> 
> Note that no IRQ should be possible between the STI and the JMP due to STI
> blocking.
> 
> What do you say?

This could work, but it's more complex than I was hoping for.

My current leading contender is to do call emulation in the #BP handler,
either by making a gap or by doing Andy's longjmp-style thingie.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11 15:28                 ` Josh Poimboeuf
@ 2019-01-11 16:46                   ` Alexandre Chartre
  2019-01-11 16:57                     ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Alexandre Chartre @ 2019-01-11 16:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Sean Christopherson, Steven Rostedt, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira



On 01/11/2019 04:28 PM, Josh Poimboeuf wrote:
> On Fri, Jan 11, 2019 at 01:10:52PM +0100, Alexandre Chartre wrote:
>> To avoid any issue with live patching the call instruction, what about
>> toggling between two call instructions: one would be the currently active
>> call, while the other would currently be inactive but to be used after a
>> change is made. You can safely patch the inactive call and then toggle
>> the call flow (using a jump label) between the active and inactive calls.
>>
>> So instead of having a single call instruction:
>>
>> 	call function
>>
>> You would have:
>>
>> 	STATIC_JUMP_IF_TRUE label, key
>> 	call function1
>> 	jmp done
>> label:
>> 	call function2
>> done:
>>
>> If the key is set so that function1 is currently called then you can
>> safely update the call instruction for function2. Once this is done,
>> just flip the key to make the function2 call active. On a next update,
>> you would, of course, have to switch and update the call for function1.
>
> What about the following race?
>
> CPU1						CPU2
> static key is false, doesn't jump
> task gets preempted before calling function1
> 						change static key to true
> 						start patching "call function1"
> task resumes, sees inconsistent call instruction
>

If the function1 call is active then it won't be changed, you will change
function2. However, I presume you can still have a race but if the function
is changed twice before calling function1:

CPU1						CPU2
static key is false, doesn't jump
task gets preempted before calling function1
                                                 -- first function change --
                                                 patch "call function2"
                                                 change static key to true
                                                 -- second function change --
                                                 start patching "call function1"
task resumes, sees inconsistent call instruction

So right, that's a problem.

alex.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11 16:46                   ` Alexandre Chartre
@ 2019-01-11 16:57                     ` Josh Poimboeuf
  2019-01-11 17:41                       ` Jason Baron
  2019-01-15 11:10                       ` Alexandre Chartre
  0 siblings, 2 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 16:57 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Sean Christopherson, Steven Rostedt, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 05:46:36PM +0100, Alexandre Chartre wrote:
> 
> 
> On 01/11/2019 04:28 PM, Josh Poimboeuf wrote:
> > On Fri, Jan 11, 2019 at 01:10:52PM +0100, Alexandre Chartre wrote:
> > > To avoid any issue with live patching the call instruction, what about
> > > toggling between two call instructions: one would be the currently active
> > > call, while the other would currently be inactive but to be used after a
> > > change is made. You can safely patch the inactive call and then toggle
> > > the call flow (using a jump label) between the active and inactive calls.
> > > 
> > > So instead of having a single call instruction:
> > > 
> > > 	call function
> > > 
> > > You would have:
> > > 
> > > 	STATIC_JUMP_IF_TRUE label, key
> > > 	call function1
> > > 	jmp done
> > > label:
> > > 	call function2
> > > done:
> > > 
> > > If the key is set so that function1 is currently called then you can
> > > safely update the call instruction for function2. Once this is done,
> > > just flip the key to make the function2 call active. On a next update,
> > > you would, of course, have to switch and update the call for function1.
> > 
> > What about the following race?
> > 
> > CPU1						CPU2
> > static key is false, doesn't jump
> > task gets preempted before calling function1
> > 						change static key to true
> > 						start patching "call function1"
> > task resumes, sees inconsistent call instruction
> > 
> 
> If the function1 call is active then it won't be changed, you will change
> function2. However, I presume you can still have a race but if the function
> is changed twice before calling function1:
> 
> CPU1						CPU2
> static key is false, doesn't jump
> task gets preempted before calling function1
>                                                 -- first function change --
>                                                 patch "call function2"
>                                                 change static key to true
>                                                 -- second function change --
>                                                 start patching "call function1"
> task resumes, sees inconsistent call instruction
> 
> So right, that's a problem.

Right, that's what I meant to say :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 16:07               ` Josh Poimboeuf
@ 2019-01-11 17:23                 ` Nadav Amit
  0 siblings, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 17:23 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 8:07 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Fri, Jan 11, 2019 at 03:48:59PM +0000, Nadav Amit wrote:
>>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>>> that case there's only one clobbered register to work with (rax).
>> 
>> That’s would be more tricky. How about using a per-CPU trampoline code to
>> hold a direct call to the target and temporarily disable preemption (which
>> might be simpler by disabling IRQs):
>> 

Allow me to simplify/correct:

>> Static-call modifier:
>> 
>>        1. synchronize_sched() to ensure per-cpu trampoline is not used
No need for (1) (We are going to sync using IPI).

>> 	2. Patches the jmp in a per-cpu trampoline (see below)
>> 	3. Saves the call source RIP in [per-cpu scratchpad RIP] (below)
Both do not need to be per-cpu

>> 	4. Configures the int3 handler to use static-call int3 handler
>> 	5. Patches the call target (as it currently does).
Note that text_poke_bp() would do eventually:

	on_each_cpu(do_sync_core, NULL, 1);

So you should know no cores run the trampoline after (5).


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11 16:57                     ` Josh Poimboeuf
@ 2019-01-11 17:41                       ` Jason Baron
  2019-01-11 17:54                         ` Nadav Amit
  2019-01-15 11:10                       ` Alexandre Chartre
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Baron @ 2019-01-11 17:41 UTC (permalink / raw)
  To: Josh Poimboeuf, Alexandre Chartre
  Cc: Sean Christopherson, Steven Rostedt, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On 1/11/19 11:57 AM, Josh Poimboeuf wrote:
> On Fri, Jan 11, 2019 at 05:46:36PM +0100, Alexandre Chartre wrote:
>>
>>
>> On 01/11/2019 04:28 PM, Josh Poimboeuf wrote:
>>> On Fri, Jan 11, 2019 at 01:10:52PM +0100, Alexandre Chartre wrote:
>>>> To avoid any issue with live patching the call instruction, what about
>>>> toggling between two call instructions: one would be the currently active
>>>> call, while the other would currently be inactive but to be used after a
>>>> change is made. You can safely patch the inactive call and then toggle
>>>> the call flow (using a jump label) between the active and inactive calls.
>>>>
>>>> So instead of having a single call instruction:
>>>>
>>>> 	call function
>>>>
>>>> You would have:
>>>>
>>>> 	STATIC_JUMP_IF_TRUE label, key
>>>> 	call function1
>>>> 	jmp done
>>>> label:
>>>> 	call function2
>>>> done:
>>>>
>>>> If the key is set so that function1 is currently called then you can
>>>> safely update the call instruction for function2. Once this is done,
>>>> just flip the key to make the function2 call active. On a next update,
>>>> you would, of course, have to switch and update the call for function1.
>>>
>>> What about the following race?
>>>
>>> CPU1						CPU2
>>> static key is false, doesn't jump
>>> task gets preempted before calling function1
>>> 						change static key to true
>>> 						start patching "call function1"
>>> task resumes, sees inconsistent call instruction
>>>
>>
>> If the function1 call is active then it won't be changed, you will change
>> function2. However, I presume you can still have a race but if the function
>> is changed twice before calling function1:
>>
>> CPU1						CPU2
>> static key is false, doesn't jump
>> task gets preempted before calling function1
>>                                                 -- first function change --
>>                                                 patch "call function2"
>>                                                 change static key to true
>>                                                 -- second function change --
>>                                                 start patching "call function1"
>> task resumes, sees inconsistent call instruction
>>
>> So right, that's a problem.
> 
> Right, that's what I meant to say :-)
> 

could you use something like synchronize_rcu_tasks() between successive
updates to guarantee nobody's stuck in the middle of the call
instruction update? Yes its really slow but the update path is slow anyways.

Thanks,

-Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11 17:41                       ` Jason Baron
@ 2019-01-11 17:54                         ` Nadav Amit
  0 siblings, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 17:54 UTC (permalink / raw)
  To: Jason Baron
  Cc: Josh Poimboeuf, Alexandre Chartre, Sean Christopherson,
	Steven Rostedt, X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 9:41 AM, Jason Baron <jbaron@akamai.com> wrote:
> 
> On 1/11/19 11:57 AM, Josh Poimboeuf wrote:
>> On Fri, Jan 11, 2019 at 05:46:36PM +0100, Alexandre Chartre wrote:
>>> On 01/11/2019 04:28 PM, Josh Poimboeuf wrote:
>>>> On Fri, Jan 11, 2019 at 01:10:52PM +0100, Alexandre Chartre wrote:
>>>>> To avoid any issue with live patching the call instruction, what about
>>>>> toggling between two call instructions: one would be the currently active
>>>>> call, while the other would currently be inactive but to be used after a
>>>>> change is made. You can safely patch the inactive call and then toggle
>>>>> the call flow (using a jump label) between the active and inactive calls.
>>>>> 
>>>>> So instead of having a single call instruction:
>>>>> 
>>>>> 	call function
>>>>> 
>>>>> You would have:
>>>>> 
>>>>> 	STATIC_JUMP_IF_TRUE label, key
>>>>> 	call function1
>>>>> 	jmp done
>>>>> label:
>>>>> 	call function2
>>>>> done:
>>>>> 
>>>>> If the key is set so that function1 is currently called then you can
>>>>> safely update the call instruction for function2. Once this is done,
>>>>> just flip the key to make the function2 call active. On a next update,
>>>>> you would, of course, have to switch and update the call for function1.
>>>> 
>>>> What about the following race?
>>>> 
>>>> CPU1						CPU2
>>>> static key is false, doesn't jump
>>>> task gets preempted before calling function1
>>>> 						change static key to true
>>>> 						start patching "call function1"
>>>> task resumes, sees inconsistent call instruction
>>> 
>>> If the function1 call is active then it won't be changed, you will change
>>> function2. However, I presume you can still have a race but if the function
>>> is changed twice before calling function1:
>>> 
>>> CPU1						CPU2
>>> static key is false, doesn't jump
>>> task gets preempted before calling function1
>>>                                                -- first function change --
>>>                                                patch "call function2"
>>>                                                change static key to true
>>>                                                -- second function change --
>>>                                                start patching "call function1"
>>> task resumes, sees inconsistent call instruction
>>> 
>>> So right, that's a problem.
>> 
>> Right, that's what I meant to say :-)
> 
> could you use something like synchronize_rcu_tasks() between successive
> updates to guarantee nobody's stuck in the middle of the call
> instruction update? Yes its really slow but the update path is slow anyways.

You would need to disable preemption or IRQs before the call, which is not
something you want to do (when would you enable it?)

Having said that, I suggested something somewhat similar things here:
https://lore.kernel.org/lkml/F6735FF5-4D62-4C4E-A145-751E6469CE9E@vmware.com/
https://lore.kernel.org/lkml/CCF7D3C7-9D12-489B-B778-C2156D2DBF47@vmware.com/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 15:15           ` Josh Poimboeuf
  2019-01-11 15:48             ` Nadav Amit
@ 2019-01-11 19:03             ` Linus Torvalds
  2019-01-11 19:17               ` Nadav Amit
                                 ` (2 more replies)
  1 sibling, 3 replies; 90+ messages in thread
From: Linus Torvalds @ 2019-01-11 19:03 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> >
> > Now, in the int3 handler can you take the faulting RIP and search for it in
> > the “static-calls” table, writing the RIP+5 (offset) into R10 (return
> > address) and the target into R11. You make the int3 handler to divert the
> > code execution by changing pt_regs->rip to point to a new function that does:
> >
> >       push R10
> >       jmp __x86_indirect_thunk_r11
> >
> > And then you are done. No?
>
> IIUC, that sounds pretty much like what Steven proposed:
>
>   https://lkml.kernel.org/r/20181129122000.7fb4fb04@gandalf.local.home
>
> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
> that case there's only one clobbered register to work with (rax).

Actually, there's a much simpler model now that I think about it.

The BP fixup just fixes up %rip to to point to "bp_int3_handler".

And that's just a random text address set up by "text_poke_bp()".

So how about the static call rewriting simply do this:

 - for each static call:

 1)   create a fixup code stub that does

        push $returnaddressforTHIScall
        jmp targetforTHIScall

 2) do

        on_each_cpu(do_sync_core, NULL, 1);

     to make sure all CPU's see this generated code

  3) do

        text_poke_bp(addr, newcode, newlen, generatedcode);

Ta-daa! Done.

In fact, it turns out that even the extra "do_sync_core()" in #2 is
unnecessary, because taking the BP will be serializing on the CPU that
takes it, so we can skip it.

End result: the text_poke_bp() function will do the two do_sync_core
IPI's that guarantee that by the time it returns, no other CPU is
using the generated code any more, so it can be re-used for the next
static call fixup.

Notice? No odd emulation, no need to adjust the stack in the BP
handler, just the regular "return to a different IP".

Now, there is a nasty special case with that stub, though.

So nasty thing with the whole "generate a stub for each call" case:
because it's dynamic and because of the re-use of the stub, you could
be in the situation where:

  CPU1                  CPU2
  ----                  ----

  generate a stub
  on_each_cpu(do_sync_core..)
  text_poke_bp()
  ...

  rewrite to BP
                        trigger the BP
                        return to the stub
                        fun the first instruction of the stub
                        *INTERRUPT causes rescheduling*

  on_each_cpu(do_sync_core..)
  rewrite to good instruction
  on_each_cpu(do_sync_core..)

  free or re-generate the stub

                        !! The stub is still in use !!

So that simple "just generate the stub dynamically" isn't so simple after all.

But it turns out that that is really simple to handle too. How do we do that?

We do that by giving the BP handler *two* code sequences, and we make
the BP handler pick one depending on whether it is returning to a
"interrupts disabled" or "interrupts enabled" case.

So the BP handler does this:

 - if we're returning with interrupts disabled, pick the simple stub

 - if we're returning with interrupts enabled, clkear IF in the return
%rflags, and pick a *slightly* more complex stub:

        push $returnaddressforTHIScall
        sti
        jmp targetforTHIScall

and now the STI shadow will mean that this sequence is uninterruptible.

So we'd not do complex emulation of the call instruction at BP time,
but we'd do that *trivial* change at BP time.

This seems simple, doesn't need any temporary registers at all, and
doesn't need any extra stack magic. It literally needs just a trivial
sequence in poke_int3_handler().

The we'd change the end of poke_int3_handler() to do something like
this instead:

        void *newip = bp_int3_handler;
        ..
        if (new == magic_static_call_bp_int3_handler) {
                if (regs->flags &X86_FLAGS_IF) {
                        newip = magic_static_call_bp_int3_handler_sti;
                        regs->flags &= ~X86_FLAGS_IF;
        }
        regs->ip = (unsigned long) newip;
        return 1;

AAND now we're *really* done.

Does anybody see any issues in this?

              Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:03             ` Linus Torvalds
@ 2019-01-11 19:17               ` Nadav Amit
  2019-01-11 19:23               ` hpa
  2019-01-11 20:04               ` Josh Poimboeuf
  2 siblings, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 19:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 11:03 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> Now, in the int3 handler can you take the faulting RIP and search for it in
>>> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
>>> address) and the target into R11. You make the int3 handler to divert the
>>> code execution by changing pt_regs->rip to point to a new function that does:
>>> 
>>>      push R10
>>>      jmp __x86_indirect_thunk_r11
>>> 
>>> And then you are done. No?
>> 
>> IIUC, that sounds pretty much like what Steven proposed:
>> 
>>  https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.home&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd90f2aee03854a8da5dd08d677f7866e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828302371417454&amp;sdata=KEGVjCZbl5z7cCQX7F0%2FOaWfU7l%2Bvd2YDwZMXuOhpGI%3D&amp;reserved=0
>> 
>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>> that case there's only one clobbered register to work with (rax).
> 
> Actually, there's a much simpler model now that I think about it.
> 
> The BP fixup just fixes up %rip to to point to "bp_int3_handler".
> 
> And that's just a random text address set up by "text_poke_bp()".
> 
> So how about the static call rewriting simply do this:
> 
> - for each static call:
> 
> 1)   create a fixup code stub that does
> 
>        push $returnaddressforTHIScall
>        jmp targetforTHIScall
> 
> 2) do
> 
>        on_each_cpu(do_sync_core, NULL, 1);
> 
>     to make sure all CPU's see this generated code
> 
>  3) do
> 
>        text_poke_bp(addr, newcode, newlen, generatedcode);
> 
> Ta-daa! Done.
> 
> In fact, it turns out that even the extra "do_sync_core()" in #2 is
> unnecessary, because taking the BP will be serializing on the CPU that
> takes it, so we can skip it.
> 
> End result: the text_poke_bp() function will do the two do_sync_core
> IPI's that guarantee that by the time it returns, no other CPU is
> using the generated code any more, so it can be re-used for the next
> static call fixup.
> 
> Notice? No odd emulation, no need to adjust the stack in the BP
> handler, just the regular "return to a different IP".
> 
> Now, there is a nasty special case with that stub, though.
> 
> So nasty thing with the whole "generate a stub for each call" case:
> because it's dynamic and because of the re-use of the stub, you could
> be in the situation where:
> 
>  CPU1                  CPU2
>  ----                  ----
> 
>  generate a stub
>  on_each_cpu(do_sync_core..)
>  text_poke_bp()
>  ...
> 
>  rewrite to BP
>                        trigger the BP
>                        return to the stub
>                        fun the first instruction of the stub
>                        *INTERRUPT causes rescheduling*
> 
>  on_each_cpu(do_sync_core..)
>  rewrite to good instruction
>  on_each_cpu(do_sync_core..)
> 
>  free or re-generate the stub
> 
>                        !! The stub is still in use !!
> 
> So that simple "just generate the stub dynamically" isn't so simple after all.
> 
> But it turns out that that is really simple to handle too. How do we do that?
> 
> We do that by giving the BP handler *two* code sequences, and we make
> the BP handler pick one depending on whether it is returning to a
> "interrupts disabled" or "interrupts enabled" case.
> 
> So the BP handler does this:
> 
> - if we're returning with interrupts disabled, pick the simple stub
> 
> - if we're returning with interrupts enabled, clkear IF in the return
> %rflags, and pick a *slightly* more complex stub:
> 
>        push $returnaddressforTHIScall
>        sti
>        jmp targetforTHIScall
> 
> and now the STI shadow will mean that this sequence is uninterruptible.
> 
> So we'd not do complex emulation of the call instruction at BP time,
> but we'd do that *trivial* change at BP time.
> 
> This seems simple, doesn't need any temporary registers at all, and
> doesn't need any extra stack magic. It literally needs just a trivial
> sequence in poke_int3_handler().
> 
> The we'd change the end of poke_int3_handler() to do something like
> this instead:
> 
>        void *newip = bp_int3_handler;
>        ..
>        if (new == magic_static_call_bp_int3_handler) {
>                if (regs->flags &X86_FLAGS_IF) {
>                        newip = magic_static_call_bp_int3_handler_sti;
>                        regs->flags &= ~X86_FLAGS_IF;
>        }
>        regs->ip = (unsigned long) newip;
>        return 1;
> 
> AAND now we're *really* done.
> 
> Does anybody see any issues in this?

I think it’s a better articulated, more detailed version to the solution I
proposed in this thread (which also used a patched trampoline with STI+JMP).

So obviously I like it. ;-)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:03             ` Linus Torvalds
  2019-01-11 19:17               ` Nadav Amit
@ 2019-01-11 19:23               ` hpa
  2019-01-11 19:33                 ` Nadav Amit
                                   ` (2 more replies)
  2019-01-11 20:04               ` Josh Poimboeuf
  2 siblings, 3 replies; 90+ messages in thread
From: hpa @ 2019-01-11 19:23 UTC (permalink / raw)
  To: Linus Torvalds, Josh Poimboeuf
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On January 11, 2019 11:03:30 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoimboe@redhat.com>
>wrote:
>>
>> >
>> > Now, in the int3 handler can you take the faulting RIP and search
>for it in
>> > the “static-calls” table, writing the RIP+5 (offset) into R10
>(return
>> > address) and the target into R11. You make the int3 handler to
>divert the
>> > code execution by changing pt_regs->rip to point to a new function
>that does:
>> >
>> >       push R10
>> >       jmp __x86_indirect_thunk_r11
>> >
>> > And then you are done. No?
>>
>> IIUC, that sounds pretty much like what Steven proposed:
>>
>>  
>https://lkml.kernel.org/r/20181129122000.7fb4fb04@gandalf.local.home
>>
>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>> that case there's only one clobbered register to work with (rax).
>
>Actually, there's a much simpler model now that I think about it.
>
>The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>
>And that's just a random text address set up by "text_poke_bp()".
>
>So how about the static call rewriting simply do this:
>
> - for each static call:
>
> 1)   create a fixup code stub that does
>
>        push $returnaddressforTHIScall
>        jmp targetforTHIScall
>
> 2) do
>
>        on_each_cpu(do_sync_core, NULL, 1);
>
>     to make sure all CPU's see this generated code
>
>  3) do
>
>        text_poke_bp(addr, newcode, newlen, generatedcode);
>
>Ta-daa! Done.
>
>In fact, it turns out that even the extra "do_sync_core()" in #2 is
>unnecessary, because taking the BP will be serializing on the CPU that
>takes it, so we can skip it.
>
>End result: the text_poke_bp() function will do the two do_sync_core
>IPI's that guarantee that by the time it returns, no other CPU is
>using the generated code any more, so it can be re-used for the next
>static call fixup.
>
>Notice? No odd emulation, no need to adjust the stack in the BP
>handler, just the regular "return to a different IP".
>
>Now, there is a nasty special case with that stub, though.
>
>So nasty thing with the whole "generate a stub for each call" case:
>because it's dynamic and because of the re-use of the stub, you could
>be in the situation where:
>
>  CPU1                  CPU2
>  ----                  ----
>
>  generate a stub
>  on_each_cpu(do_sync_core..)
>  text_poke_bp()
>  ...
>
>  rewrite to BP
>                        trigger the BP
>                        return to the stub
>                        fun the first instruction of the stub
>                        *INTERRUPT causes rescheduling*
>
>  on_each_cpu(do_sync_core..)
>  rewrite to good instruction
>  on_each_cpu(do_sync_core..)
>
>  free or re-generate the stub
>
>                        !! The stub is still in use !!
>
>So that simple "just generate the stub dynamically" isn't so simple
>after all.
>
>But it turns out that that is really simple to handle too. How do we do
>that?
>
>We do that by giving the BP handler *two* code sequences, and we make
>the BP handler pick one depending on whether it is returning to a
>"interrupts disabled" or "interrupts enabled" case.
>
>So the BP handler does this:
>
> - if we're returning with interrupts disabled, pick the simple stub
>
> - if we're returning with interrupts enabled, clkear IF in the return
>%rflags, and pick a *slightly* more complex stub:
>
>        push $returnaddressforTHIScall
>        sti
>        jmp targetforTHIScall
>
>and now the STI shadow will mean that this sequence is uninterruptible.
>
>So we'd not do complex emulation of the call instruction at BP time,
>but we'd do that *trivial* change at BP time.
>
>This seems simple, doesn't need any temporary registers at all, and
>doesn't need any extra stack magic. It literally needs just a trivial
>sequence in poke_int3_handler().
>
>The we'd change the end of poke_int3_handler() to do something like
>this instead:
>
>        void *newip = bp_int3_handler;
>        ..
>        if (new == magic_static_call_bp_int3_handler) {
>                if (regs->flags &X86_FLAGS_IF) {
>                        newip = magic_static_call_bp_int3_handler_sti;
>                        regs->flags &= ~X86_FLAGS_IF;
>        }
>        regs->ip = (unsigned long) newip;
>        return 1;
>
>AAND now we're *really* done.
>
>Does anybody see any issues in this?
>
>              Linus

I still don't see why can't simply spin in the #BP handler until the patch is complete.

We can't have the #BP handler do any additional patching, as previously discussed, but spinning should be perfectly safe. The simplest way to spin it to just IRET; that both serializes and will re-take the exception if the patch is still in progress.

It requires exactly *no* awareness in the #BP handler, allows for the call to be replaced with inline code or a simple NOP if desired (or vice versa, as long as it is a single instruction.)

If I'm missing something, then please educate me or point me to previous discussion; I would greatly appreciate it.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:23               ` hpa
@ 2019-01-11 19:33                 ` Nadav Amit
  2019-01-11 19:34                 ` Linus Torvalds
  2019-01-11 19:39                 ` Jiri Kosina
  2 siblings, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 19:33 UTC (permalink / raw)
  To: hpa
  Cc: Linus Torvalds, Josh Poimboeuf, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 11:23 AM, hpa@zytor.com wrote:
> 
> On January 11, 2019 11:03:30 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoimboe@redhat.com>
>> wrote:
>>>> Now, in the int3 handler can you take the faulting RIP and search
>> for it in
>>>> the “static-calls” table, writing the RIP+5 (offset) into R10
>> (return
>>>> address) and the target into R11. You make the int3 handler to
>> divert the
>>>> code execution by changing pt_regs->rip to point to a new function
>> that does:
>>>>      push R10
>>>>      jmp __x86_indirect_thunk_r11
>>>> 
>>>> And then you are done. No?
>>> 
>>> IIUC, that sounds pretty much like what Steven proposed:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.home&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca665a074940b4630e3fc08d677fa753b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828314949874019&amp;sdata=fs2uo%2BjL%2FV3rpzIHJ%2B3QoyHg4KhV%2B%2FUPmUOLpy8S8p8%3D&amp;reserved=0
>>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>>> that case there's only one clobbered register to work with (rax).
>> 
>> Actually, there's a much simpler model now that I think about it.
>> 
>> The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>> 
>> And that's just a random text address set up by "text_poke_bp()".
>> 
>> So how about the static call rewriting simply do this:
>> 
>> - for each static call:
>> 
>> 1)   create a fixup code stub that does
>> 
>>       push $returnaddressforTHIScall
>>       jmp targetforTHIScall
>> 
>> 2) do
>> 
>>       on_each_cpu(do_sync_core, NULL, 1);
>> 
>>    to make sure all CPU's see this generated code
>> 
>> 3) do
>> 
>>       text_poke_bp(addr, newcode, newlen, generatedcode);
>> 
>> Ta-daa! Done.
>> 
>> In fact, it turns out that even the extra "do_sync_core()" in #2 is
>> unnecessary, because taking the BP will be serializing on the CPU that
>> takes it, so we can skip it.
>> 
>> End result: the text_poke_bp() function will do the two do_sync_core
>> IPI's that guarantee that by the time it returns, no other CPU is
>> using the generated code any more, so it can be re-used for the next
>> static call fixup.
>> 
>> Notice? No odd emulation, no need to adjust the stack in the BP
>> handler, just the regular "return to a different IP".
>> 
>> Now, there is a nasty special case with that stub, though.
>> 
>> So nasty thing with the whole "generate a stub for each call" case:
>> because it's dynamic and because of the re-use of the stub, you could
>> be in the situation where:
>> 
>> CPU1                  CPU2
>> ----                  ----
>> 
>> generate a stub
>> on_each_cpu(do_sync_core..)
>> text_poke_bp()
>> ...
>> 
>> rewrite to BP
>>                       trigger the BP
>>                       return to the stub
>>                       fun the first instruction of the stub
>>                       *INTERRUPT causes rescheduling*
>> 
>> on_each_cpu(do_sync_core..)
>> rewrite to good instruction
>> on_each_cpu(do_sync_core..)
>> 
>> free or re-generate the stub
>> 
>>                       !! The stub is still in use !!
>> 
>> So that simple "just generate the stub dynamically" isn't so simple
>> after all.
>> 
>> But it turns out that that is really simple to handle too. How do we do
>> that?
>> 
>> We do that by giving the BP handler *two* code sequences, and we make
>> the BP handler pick one depending on whether it is returning to a
>> "interrupts disabled" or "interrupts enabled" case.
>> 
>> So the BP handler does this:
>> 
>> - if we're returning with interrupts disabled, pick the simple stub
>> 
>> - if we're returning with interrupts enabled, clkear IF in the return
>> %rflags, and pick a *slightly* more complex stub:
>> 
>>       push $returnaddressforTHIScall
>>       sti
>>       jmp targetforTHIScall
>> 
>> and now the STI shadow will mean that this sequence is uninterruptible.
>> 
>> So we'd not do complex emulation of the call instruction at BP time,
>> but we'd do that *trivial* change at BP time.
>> 
>> This seems simple, doesn't need any temporary registers at all, and
>> doesn't need any extra stack magic. It literally needs just a trivial
>> sequence in poke_int3_handler().
>> 
>> The we'd change the end of poke_int3_handler() to do something like
>> this instead:
>> 
>>       void *newip = bp_int3_handler;
>>       ..
>>       if (new == magic_static_call_bp_int3_handler) {
>>               if (regs->flags &X86_FLAGS_IF) {
>>                       newip = magic_static_call_bp_int3_handler_sti;
>>                       regs->flags &= ~X86_FLAGS_IF;
>>       }
>>       regs->ip = (unsigned long) newip;
>>       return 1;
>> 
>> AAND now we're *really* done.
>> 
>> Does anybody see any issues in this?
>> 
>>             Linus
> 
> I still don't see why can't simply spin in the #BP handler until the patch is complete.
> 
> We can't have the #BP handler do any additional patching, as previously discussed, but spinning should be perfectly safe. The simplest way to spin it to just IRET; that both serializes and will re-take the exception if the patch is still in progress.
> 
> It requires exactly *no* awareness in the #BP handler, allows for the call to be replaced with inline code or a simple NOP if desired (or vice versa, as long as it is a single instruction.)
> 
> If I'm missing something, then please educate me or point me to previous discussion; I would greatly appreciate it.

One thing that comes to mind is that text_poke_bp() runs after patching the
int3 and before patching in the instruction:

	on_each_cpu(do_sync_core, NULL, 1);

If IRQs are disabled when the BP is hit, spinning can cause the system to
hang.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:23               ` hpa
  2019-01-11 19:33                 ` Nadav Amit
@ 2019-01-11 19:34                 ` Linus Torvalds
  2019-01-13  0:34                   ` hpa
  2019-01-13  0:36                   ` hpa
  2019-01-11 19:39                 ` Jiri Kosina
  2 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2019-01-11 19:34 UTC (permalink / raw)
  To: Peter Anvin
  Cc: Josh Poimboeuf, Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 11:24 AM <hpa@zytor.com> wrote:
>
> I still don't see why can't simply spin in the #BP handler until the patch is complete.

So here's at least one problem:

text_poke_bp()
  text_poke(addr, &int3, sizeof(int3));
   *interrupt*
      interrupt has a static call
        *BP*
          poke_int3_handler
             *BOOM*

Note how at BOOM we cannot just spin (or return) to wait for the
'int3' to be switched back. Becuase it never will. Because we are
interrupting the thing that would do that switch-back.

So we'd have to do the 'text_poke_bp()' sequence with interrupts
disabled. Which we can't do right now at least, because part of that
sequence involves that on_each_cpu(do_sync_core) thing, which needs
interrupts enabled.

See?

Or am I missing something?

            Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:23               ` hpa
  2019-01-11 19:33                 ` Nadav Amit
  2019-01-11 19:34                 ` Linus Torvalds
@ 2019-01-11 19:39                 ` Jiri Kosina
  2019-01-14  2:31                   ` H. Peter Anvin
  2 siblings, 1 reply; 90+ messages in thread
From: Jiri Kosina @ 2019-01-11 19:39 UTC (permalink / raw)
  To: hpa
  Cc: Linus Torvalds, Josh Poimboeuf, Nadav Amit, Andy Lutomirski,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, 11 Jan 2019, hpa@zytor.com wrote:

> I still don't see why can't simply spin in the #BP handler until the 
> patch is complete.

I think this brings us to the already discussed possible deadlock, when 
one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside 
spin_lock_irq*(&lock) and CPU#2 hits the breakpont while holding that very 
'lock'.

Then we're stuck forever, because CPU#1 will never handle the pending 
sync_core() IPI (it's not NMI).

Or have I misunderstood what you meant?

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:03             ` Linus Torvalds
  2019-01-11 19:17               ` Nadav Amit
  2019-01-11 19:23               ` hpa
@ 2019-01-11 20:04               ` Josh Poimboeuf
  2019-01-11 20:12                 ` Linus Torvalds
  2 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 11:03:30AM -0800, Linus Torvalds wrote:
> The we'd change the end of poke_int3_handler() to do something like
> this instead:
> 
>         void *newip = bp_int3_handler;
>         ..
>         if (new == magic_static_call_bp_int3_handler) {
>                 if (regs->flags &X86_FLAGS_IF) {
>                         newip = magic_static_call_bp_int3_handler_sti;
>                         regs->flags &= ~X86_FLAGS_IF;
>         }
>         regs->ip = (unsigned long) newip;
>         return 1;
> 
> AAND now we're *really* done.
> 
> Does anybody see any issues in this?

This sounds ok, with a possible tweak: instead of the sti tricks,
couldn't we just use synchronize_rcu_tasks() (as Jason suggested), to
make sure the stubs are no longer used by a preempted task?

But really, to me, having to create and manage all those custom
trampolines still feels a lot more complex than just making a gap on the
stack.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 20:04               ` Josh Poimboeuf
@ 2019-01-11 20:12                 ` Linus Torvalds
  2019-01-11 20:31                   ` Josh Poimboeuf
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2019-01-11 20:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> But really, to me, having to create and manage all those custom
> trampolines still feels a lot more complex than just making a gap on the
> stack.

There are no "all those custom trampolines".

There is literally *one* custom trampoline that you generate as you do
the rewriting.

Well, two, since you need the version with the "sti" before the jmp.

It would be possible to generate the custom trampoline on the fly in
the BP handler itself, and just have a magic flag for that case. But
it's probably simpler to do it in the caller, since you need to
generate that special writable and executable code sequence. You
probably don't want to do that at BP time.

You probably want to use a FIX_TEXT_POKE2 page for the generated
sequence that just maps some generated code executably for a short
while. Or something like that.

              Linus

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 20:12                 ` Linus Torvalds
@ 2019-01-11 20:31                   ` Josh Poimboeuf
  2019-01-11 20:46                     ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 12:12:30PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > But really, to me, having to create and manage all those custom
> > trampolines still feels a lot more complex than just making a gap on the
> > stack.
> 
> There are no "all those custom trampolines".
> 
> There is literally *one* custom trampoline that you generate as you do
> the rewriting.
> 
> Well, two, since you need the version with the "sti" before the jmp.
> 
> It would be possible to generate the custom trampoline on the fly in
> the BP handler itself, and just have a magic flag for that case. But
> it's probably simpler to do it in the caller, since you need to
> generate that special writable and executable code sequence. You
> probably don't want to do that at BP time.
> 
> You probably want to use a FIX_TEXT_POKE2 page for the generated
> sequence that just maps some generated code executably for a short
> while. Or something like that.

I was referring to the fact that a single static call key update will
usually result in patching multiple call sites.  But you're right, it's
only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
we may want to batch all the writes like what Daniel has proposed for
jump labels, to reduce IPIs.

Regardless, the trampoline management seems more complex to me.  But
it's easier to argue about actual code, so maybe I'll code it up to make
it easier to compare solutions.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 20:31                   ` Josh Poimboeuf
@ 2019-01-11 20:46                     ` Linus Torvalds
  2019-01-11 21:05                       ` Andy Lutomirski
  2019-01-11 21:22                       ` Josh Poimboeuf
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2019-01-11 20:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> I was referring to the fact that a single static call key update will
> usually result in patching multiple call sites.  But you're right, it's
> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> we may want to batch all the writes like what Daniel has proposed for
> jump labels, to reduce IPIs.

Yeah, my suggestion doesn't allow for batching, since it would
basically generate one trampoline for every rewritten instruction.

> Regardless, the trampoline management seems more complex to me.  But
> it's easier to argue about actual code, so maybe I'll code it up to make
> it easier to compare solutions.

I do agree hat the stack games are likely "simpler" in one sense. I
just abhor playing those kinds of games with the entry code and entry
stack.

A small bit of extra complexity in the code that actually does the
rewriting would be much more palatable to me than the complexity in
the entry code. I prefer seeing the onus of complexity being on the
code that introduces the problem, not on a innocent bystander.

I'd like to say that our entry code actually looks fairly sane these
days.  I'd _like_ to say that, but I'd be lying through my teeth if I
did. The macros we use make any normal persons head spin.

The workaround for the stack corruption was fairly simple, but the
subtlety behind the *reason* for it was what made my hackles rise
about that code.

The x86 entry code is some of the nastiest in the kernel, I feel, with
all the subtle interactions about odd stack switches, odd CPU bugs
causing odd TLB switches, NMI interactions etc etc.

So I am fully cognizant that the workaround to shift the stack in the
entry code was just a couple of lines, and not very complicated.

And I agree that I may be a bit oversensitive about that area, but it
really is one of those places where I go "damn, I think I know some
low-level x86 stuff better than most, but that code scares *me*".

Which is why I'd accept a rather bigger complexity hit just about
anywhere else in the code...

               Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 20:46                     ` Linus Torvalds
@ 2019-01-11 21:05                       ` Andy Lutomirski
  2019-01-11 21:10                         ` Linus Torvalds
  2019-01-14 12:28                         ` Peter Zijlstra
  2019-01-11 21:22                       ` Josh Poimboeuf
  1 sibling, 2 replies; 90+ messages in thread
From: Andy Lutomirski @ 2019-01-11 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > I was referring to the fact that a single static call key update will
> > usually result in patching multiple call sites.  But you're right, it's
> > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > we may want to batch all the writes like what Daniel has proposed for
> > jump labels, to reduce IPIs.
>
> Yeah, my suggestion doesn't allow for batching, since it would
> basically generate one trampoline for every rewritten instruction.

Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
batch :)  As long as the number of trampolines is smallish (e.g. fits
in a page), then we should be in good shape.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:05                       ` Andy Lutomirski
@ 2019-01-11 21:10                         ` Linus Torvalds
  2019-01-11 21:32                           ` Josh Poimboeuf
  2019-01-14 12:28                         ` Peter Zijlstra
  1 sibling, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2019-01-11 21:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Nadav Amit, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
>
> Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
> batch :)  As long as the number of trampolines is smallish (e.g. fits
> in a page), then we should be in good shape.

Yeah, I guess that's true.

But let's not worry about it. I don't think we do batching for any
alternative instruction rewriting right now, do we?

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 20:46                     ` Linus Torvalds
  2019-01-11 21:05                       ` Andy Lutomirski
@ 2019-01-11 21:22                       ` Josh Poimboeuf
  2019-01-11 21:23                         ` Josh Poimboeuf
                                           ` (3 more replies)
  1 sibling, 4 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 21:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > I was referring to the fact that a single static call key update will
> > usually result in patching multiple call sites.  But you're right, it's
> > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > we may want to batch all the writes like what Daniel has proposed for
> > jump labels, to reduce IPIs.
> 
> Yeah, my suggestion doesn't allow for batching, since it would
> basically generate one trampoline for every rewritten instruction.

As Andy said, I think batching would still be possible, it's just that
we'd have to create multiple trampolines at a time.

Or... we could do a hybrid approach: create a single custom trampoline
which has the call destination patched in, but put the return address in
%rax -- which is always clobbered, even for callee-saved PV ops.  Like:

trampoline:
	push %rax
	call patched-dest

That way the batching could be done with a single trampoline
(particularly if using rcu-sched to avoid the sti hack).

If you don't completely hate that approach then I may try it.

> > Regardless, the trampoline management seems more complex to me.  But
> > it's easier to argue about actual code, so maybe I'll code it up to make
> > it easier to compare solutions.
> 
> I do agree hat the stack games are likely "simpler" in one sense. I
> just abhor playing those kinds of games with the entry code and entry
> stack.
> 
> A small bit of extra complexity in the code that actually does the
> rewriting would be much more palatable to me than the complexity in
> the entry code. I prefer seeing the onus of complexity being on the
> code that introduces the problem, not on a innocent bystander.
> 
> I'd like to say that our entry code actually looks fairly sane these
> days.  I'd _like_ to say that, but I'd be lying through my teeth if I
> did. The macros we use make any normal persons head spin.
> 
> The workaround for the stack corruption was fairly simple, but the
> subtlety behind the *reason* for it was what made my hackles rise
> about that code.
> 
> The x86 entry code is some of the nastiest in the kernel, I feel, with
> all the subtle interactions about odd stack switches, odd CPU bugs
> causing odd TLB switches, NMI interactions etc etc.
> 
> So I am fully cognizant that the workaround to shift the stack in the
> entry code was just a couple of lines, and not very complicated.
> 
> And I agree that I may be a bit oversensitive about that area, but it
> really is one of those places where I go "damn, I think I know some
> low-level x86 stuff better than most, but that code scares *me*".
> 
> Which is why I'd accept a rather bigger complexity hit just about
> anywhere else in the code...

I agree that, to a certain extent, it can make sense to put the "onus of
complexity" on the code that introduces the problem.  But of course it's
not an absolute rule, and should be considered in context with the
relative complexities of the competing solutions.

But I think where I see things differently is that:

a) The entry code is, by far, in the best shape it's ever been [*],
   thanks to Andy's considerable efforts.  I find it to be quite
   readable, but that might be due to many hours of intense study...

   [*] overlooking recent unavoidable meltdown/spectre hacks

b) Adding a gap to the #DB entry stack is (in my mind) a simple
   localized change, which is easily understandable by a reader of the
   entry code -- assuming certain personality characteristics of a
   person whose life decisions have resulted in them reading entry code
   in the first place...

c) Doing so is an order of magnitude simpler than the custom trampoline
   thing (or really any of the other many alternatives we've discussed).
   At least that's my feeling.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:22                       ` Josh Poimboeuf
@ 2019-01-11 21:23                         ` Josh Poimboeuf
  2019-01-11 21:25                         ` Josh Poimboeuf
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 21:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote:
> trampoline:
> 	push %rax
> 	call patched-dest

That should be a JMP of course.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:22                       ` Josh Poimboeuf
  2019-01-11 21:23                         ` Josh Poimboeuf
@ 2019-01-11 21:25                         ` Josh Poimboeuf
  2019-01-11 21:36                         ` Nadav Amit
  2019-01-12 23:54                         ` Andy Lutomirski
  3 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote:
> b) Adding a gap to the #DB entry stack

And that should be #BP of course...  Not sure why my fingers keep doing
that!

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:10                         ` Linus Torvalds
@ 2019-01-11 21:32                           ` Josh Poimboeuf
  0 siblings, 0 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Nadav Amit, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 01:10:47PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > > Yeah, my suggestion doesn't allow for batching, since it would
> > > basically generate one trampoline for every rewritten instruction.
> >
> > Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
> > batch :)  As long as the number of trampolines is smallish (e.g. fits
> > in a page), then we should be in good shape.
> 
> Yeah, I guess that's true.
> 
> But let's not worry about it. I don't think we do batching for any
> alternative instruction rewriting right now, do we?

Not at the moment, but Daniel has been working on some patches to do so:

  https://lkml.kernel.org/r/cover.1545228276.git.bristot@redhat.com

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:22                       ` Josh Poimboeuf
  2019-01-11 21:23                         ` Josh Poimboeuf
  2019-01-11 21:25                         ` Josh Poimboeuf
@ 2019-01-11 21:36                         ` Nadav Amit
  2019-01-11 21:41                           ` Josh Poimboeuf
  2019-01-12 23:54                         ` Andy Lutomirski
  3 siblings, 1 reply; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 21:36 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> I was referring to the fact that a single static call key update will
>>> usually result in patching multiple call sites.  But you're right, it's
>>> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
>>> we may want to batch all the writes like what Daniel has proposed for
>>> jump labels, to reduce IPIs.
>> 
>> Yeah, my suggestion doesn't allow for batching, since it would
>> basically generate one trampoline for every rewritten instruction.
> 
> As Andy said, I think batching would still be possible, it's just that
> we'd have to create multiple trampolines at a time.
> 
> Or... we could do a hybrid approach: create a single custom trampoline
> which has the call destination patched in, but put the return address in
> %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
> 
> trampoline:
> 	push %rax
> 	call patched-dest
> 
> That way the batching could be done with a single trampoline
> (particularly if using rcu-sched to avoid the sti hack).

I don’t see RCU-sched solves the problem if you don’t disable preemption. On
a fully preemptable kernel, you can get preempted between the push and the
call (jmp) or before the push. RCU-sched can then finish, and the preempted
task may later jump to a wrong patched-dest.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:36                         ` Nadav Amit
@ 2019-01-11 21:41                           ` Josh Poimboeuf
  2019-01-11 21:55                             ` Steven Rostedt
  2019-01-11 21:56                             ` Nadav Amit
  0 siblings, 2 replies; 90+ messages in thread
From: Josh Poimboeuf @ 2019-01-11 21:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 09:36:59PM +0000, Nadav Amit wrote:
> > On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>> I was referring to the fact that a single static call key update will
> >>> usually result in patching multiple call sites.  But you're right, it's
> >>> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> >>> we may want to batch all the writes like what Daniel has proposed for
> >>> jump labels, to reduce IPIs.
> >> 
> >> Yeah, my suggestion doesn't allow for batching, since it would
> >> basically generate one trampoline for every rewritten instruction.
> > 
> > As Andy said, I think batching would still be possible, it's just that
> > we'd have to create multiple trampolines at a time.
> > 
> > Or... we could do a hybrid approach: create a single custom trampoline
> > which has the call destination patched in, but put the return address in
> > %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
> > 
> > trampoline:
> > 	push %rax
> > 	call patched-dest
> > 
> > That way the batching could be done with a single trampoline
> > (particularly if using rcu-sched to avoid the sti hack).
> 
> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
> a fully preemptable kernel, you can get preempted between the push and the
> call (jmp) or before the push. RCU-sched can then finish, and the preempted
> task may later jump to a wrong patched-dest.

Argh, I misspoke about RCU-sched.  Words are hard.

I meant synchronize_rcu_tasks(), which is a completely different animal.
My understanding is that it waits until all runnable tasks (including
preempted tasks) have gotten a chance to run.

-- 
Josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:41                           ` Josh Poimboeuf
@ 2019-01-11 21:55                             ` Steven Rostedt
  2019-01-11 21:59                               ` Nadav Amit
  2019-01-11 21:56                             ` Nadav Amit
  1 sibling, 1 reply; 90+ messages in thread
From: Steven Rostedt @ 2019-01-11 21:55 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Nadav Amit, Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Ingo Molnar, Thomas Gleixner, Masami Hiramatsu,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Fri, 11 Jan 2019 15:41:22 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> > I don’t see RCU-sched solves the problem if you don’t disable preemption. On
> > a fully preemptable kernel, you can get preempted between the push and the
> > call (jmp) or before the push. RCU-sched can then finish, and the preempted
> > task may later jump to a wrong patched-dest.  
> 
> Argh, I misspoke about RCU-sched.  Words are hard.
> 
> I meant synchronize_rcu_tasks(), which is a completely different animal.
> My understanding is that it waits until all runnable tasks (including
> preempted tasks) have gotten a chance to run.

Not quite, but does the same thing. It waits for all tasks to either
schedule voluntarily (not preempted), or be / go into idle, or be /go
into userspace. In any case, it makes sure code is off of trampolines.
I use this before freeing trampolines used by ftrace.

-- Steve

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:41                           ` Josh Poimboeuf
  2019-01-11 21:55                             ` Steven Rostedt
@ 2019-01-11 21:56                             ` Nadav Amit
  1 sibling, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 21:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 1:41 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Fri, Jan 11, 2019 at 09:36:59PM +0000, Nadav Amit wrote:
>>> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> 
>>> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
>>>> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>> I was referring to the fact that a single static call key update will
>>>>> usually result in patching multiple call sites.  But you're right, it's
>>>>> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
>>>>> we may want to batch all the writes like what Daniel has proposed for
>>>>> jump labels, to reduce IPIs.
>>>> 
>>>> Yeah, my suggestion doesn't allow for batching, since it would
>>>> basically generate one trampoline for every rewritten instruction.
>>> 
>>> As Andy said, I think batching would still be possible, it's just that
>>> we'd have to create multiple trampolines at a time.
>>> 
>>> Or... we could do a hybrid approach: create a single custom trampoline
>>> which has the call destination patched in, but put the return address in
>>> %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
>>> 
>>> trampoline:
>>> 	push %rax
>>> 	call patched-dest
>>> 
>>> That way the batching could be done with a single trampoline
>>> (particularly if using rcu-sched to avoid the sti hack).
>> 
>> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
>> a fully preemptable kernel, you can get preempted between the push and the
>> call (jmp) or before the push. RCU-sched can then finish, and the preempted
>> task may later jump to a wrong patched-dest.
> 
> Argh, I misspoke about RCU-sched.  Words are hard.
> 
> I meant synchronize_rcu_tasks(), which is a completely different animal.
> My understanding is that it waits until all runnable tasks (including
> preempted tasks) have gotten a chance to run.

Actually, I just used the term you used, and thought about
synchronize_sched(). If you look at my patch [1], you’ll see I did something
similar using synchronize_sched(). But this required some delicate work of
restarting any preempted “optpoline” (or whatever name you want) block. 

[Note that my implementation has a terrible bug in this respect].

This is required since running a preempted task to does now prevent it from
being preempted again without doing any “real” progress.

If we want to adapt the same solution to static_calls, this means that in
retint_kernel (entry_64.S), you need check whether you got preempted inside
the trampoline and change the saved RIP in such case back, before the
static_call.

IMHO, sti+jmp is simpler.

[1] https://lore.kernel.org/lkml/20181231072112.21051-6-namit@vmware.com/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:55                             ` Steven Rostedt
@ 2019-01-11 21:59                               ` Nadav Amit
  0 siblings, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-11 21:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Ingo Molnar, Thomas Gleixner, Masami Hiramatsu,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

> On Jan 11, 2019, at 1:55 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Fri, 11 Jan 2019 15:41:22 -0600
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
>>> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
>>> a fully preemptable kernel, you can get preempted between the push and the
>>> call (jmp) or before the push. RCU-sched can then finish, and the preempted
>>> task may later jump to a wrong patched-dest.  
>> 
>> Argh, I misspoke about RCU-sched.  Words are hard.
>> 
>> I meant synchronize_rcu_tasks(), which is a completely different animal.
>> My understanding is that it waits until all runnable tasks (including
>> preempted tasks) have gotten a chance to run.
> 
> Not quite, but does the same thing. It waits for all tasks to either
> schedule voluntarily (not preempted), or be / go into idle, or be /go
> into userspace. In any case, it makes sure code is off of trampolines.
> I use this before freeing trampolines used by ftrace.

Interesting. So my last email is completely wrong.
 

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:22                       ` Josh Poimboeuf
                                           ` (2 preceding siblings ...)
  2019-01-11 21:36                         ` Nadav Amit
@ 2019-01-12 23:54                         ` Andy Lutomirski
  3 siblings, 0 replies; 90+ messages in thread
From: Andy Lutomirski @ 2019-01-12 23:54 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 1:22 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >
> > > I was referring to the fact that a single static call key update will
> > > usually result in patching multiple call sites.  But you're right, it's
> > > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > > we may want to batch all the writes like what Daniel has proposed for
> > > jump labels, to reduce IPIs.
> >
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
>
> As Andy said, I think batching would still be possible, it's just that
> we'd have to create multiple trampolines at a time.
>
> Or... we could do a hybrid approach: create a single custom trampoline
> which has the call destination patched in, but put the return address in
> %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
>

One think I particularly like about the current design is that there
are no requirements at all on the calling convention.  I think it
seems fragile to add a calling convention constraint that only applies
when there's a race.  I'd rather do a longjmp-like hack or a stack gap
adding hack than make the actual static calls more fragile.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:34                 ` Linus Torvalds
@ 2019-01-13  0:34                   ` hpa
  2019-01-13  0:36                   ` hpa
  1 sibling, 0 replies; 90+ messages in thread
From: hpa @ 2019-01-13  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On January 11, 2019 11:34:34 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, Jan 11, 2019 at 11:24 AM <hpa@zytor.com> wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, &int3, sizeof(int3));
>   *interrupt*
>      interrupt has a static call
>        *BP*
>          poke_int3_handler
>             *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>            Linus

Let me think about it.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:34                 ` Linus Torvalds
  2019-01-13  0:34                   ` hpa
@ 2019-01-13  0:36                   ` hpa
  1 sibling, 0 replies; 90+ messages in thread
From: hpa @ 2019-01-13  0:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Nadav Amit, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On January 11, 2019 11:34:34 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, Jan 11, 2019 at 11:24 AM <hpa@zytor.com> wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, &int3, sizeof(int3));
>   *interrupt*
>      interrupt has a static call
>        *BP*
>          poke_int3_handler
>             *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>            Linus

Ok, I was thinking far more about spinning with an IRET and letting the exception be delivered. Patching with interrupts disabled have other problems... 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 19:39                 ` Jiri Kosina
@ 2019-01-14  2:31                   ` H. Peter Anvin
  2019-01-14  2:40                     ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-14  2:31 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Josh Poimboeuf, Nadav Amit, Andy Lutomirski,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On 1/11/19 11:39 AM, Jiri Kosina wrote:
> On Fri, 11 Jan 2019, hpa@zytor.com wrote:
> 
>> I still don't see why can't simply spin in the #BP handler until the 
>> patch is complete.
> 
> I think this brings us to the already discussed possible deadlock, when 
> one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside 
> spin_lock_irq*(&lock) and CPU#2 hits the breakpont while holding that very 
> 'lock'.
> 
> Then we're stuck forever, because CPU#1 will never handle the pending 
> sync_core() IPI (it's not NMI).
> 
> Or have I misunderstood what you meant?
> 

OK, I was thinking about this quite a while ago, and even started hacking on
it, but apparently I managed to forget some key details.

Specifically, you do *not* want to use the acknowledgment of the IPI as the
blocking condition, so don't use a waiting IPI.

Instead, you want a CPU bitmap (or percpu variable) that the IPI handler
clears.  When you are spinning in the IPI handler, you *also* need to clear
that bit.  Doing so is safe even in the case of batched updates, because you
are guaranteed to execute an IRET before you get to patched code.

So the synchronization part of the patching routine becomes:

static cpumask_t text_poke_cpumask;

static void text_poke_sync(void)
{
	smp_wmb();
	text_poke_cpumask = cpu_online_mask;
	smp_wmb();	/* Optional on x86 */
	cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
	on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
	while (!cpumask_empty(&text_poke_cpumask)) {
		cpu_relax();
		smp_rmb();
	}
}

static void text_poke_sync_cpu(void *dummy)
{
	(void)dummy;

	smp_rmb();
	cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
	/*
	 * We are guaranteed to return with an IRET, either from the
	 * IPI or the #BP handler; this provides serialization.
	 */
}

The spin routine then needs add a call to do something like this. By
(optionally) not comparing to a specific breakpoint address we allow for
batching, but we may end up spinning on a breakpoint that is not actually a
patching breakpoint until the patching is done.

int poke_int3_handler(struct pt_regs *regs)
{
	/* In the current handler, but shouldn't be needed... */
	smp_rmb();

	if (likely(!atomic_read(bp_patching_in_progress)))
		return 0;

	if (user_mode(regs) unlikely(!user_mode(regs) &&
	    atomic_read(&bp_patching_in_progress)) {
		text_poke_sync();
		regs->ip--;
		return 1;	/* May end up retaking the trap */
	} else {
		return 0;
	}
}

Unless I'm totally mistaken, the worst thing that will happen with this code
is that it may end up taking a harmless spurious IPI at a later time.

	-hpa

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14  2:31                   ` H. Peter Anvin
@ 2019-01-14  2:40                     ` H. Peter Anvin
  2019-01-14 20:11                       ` Andy Lutomirski
  2019-01-14 22:00                       ` H. Peter Anvin
  0 siblings, 2 replies; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-14  2:40 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Josh Poimboeuf, Nadav Amit, Andy Lutomirski,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> 
> static cpumask_t text_poke_cpumask;
> 
> static void text_poke_sync(void)
> {
> 	smp_wmb();
> 	text_poke_cpumask = cpu_online_mask;
> 	smp_wmb();	/* Should be optional on x86 */
> 	cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
> 	on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
> 	while (!cpumask_empty(&text_poke_cpumask)) {
> 		cpu_relax();
> 		smp_rmb();
> 	}
> }
> 
> static void text_poke_sync_cpu(void *dummy)
> {
> 	(void)dummy;
> 
> 	smp_rmb();
> 	cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
> 	/*
> 	 * We are guaranteed to return with an IRET, either from the
> 	 * IPI or the #BP handler; this provides serialization.
> 	 */
> }
> 

The invariants here are:

1. The patching routine must set each bit in the cpumask after each event
   that requires synchronization is complete.
2. The bit can be (atomically) cleared on the target CPU only, and only in a
   place that guarantees a synchronizing event (e.g. IRET) before it may
   reaching the poked instruction.
3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
   *is* also possible to clear it in other places, e.g. the NMI handler, if
   necessary as long as condition 2 is satisfied.

	-hpa

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-11 21:05                       ` Andy Lutomirski
  2019-01-11 21:10                         ` Linus Torvalds
@ 2019-01-14 12:28                         ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2019-01-14 12:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	the arch/x86 maintainers, Linux List Kernel Mailing,
	Ard Biesheuvel, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jan 11, 2019 at 01:05:20PM -0800, Andy Lutomirski wrote:
> On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >
> > > I was referring to the fact that a single static call key update will
> > > usually result in patching multiple call sites.  But you're right, it's
> > > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > > we may want to batch all the writes like what Daniel has proposed for
> > > jump labels, to reduce IPIs.
> >
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
> 
> Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
> batch :)  As long as the number of trampolines is smallish (e.g. fits
> in a page), then we should be in good shape.

Much easier still would be to make the ARCH_DEFINE_STATIC_TRAMP thing
generate the two trampolines per callsite and simply keep them around.

Another advantage is that you then only have to patch the JMP target,
since the return address will always stay the same (since these things
are generated per call-site).


Anyway... the STI-shadow thing is very clever. But I'm with Josh in that
I think I prefer the IRET frame offset thing -- but yes, I've read
Linus' argument against that.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14  2:40                     ` H. Peter Anvin
@ 2019-01-14 20:11                       ` Andy Lutomirski
  2019-01-14 22:00                       ` H. Peter Anvin
  1 sibling, 0 replies; 90+ messages in thread
From: Andy Lutomirski @ 2019-01-14 20:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jiri Kosina, Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	Andy Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Sun, Jan 13, 2019 at 6:41 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> >
> > static cpumask_t text_poke_cpumask;
> >
> > static void text_poke_sync(void)
> > {
> >       smp_wmb();
> >       text_poke_cpumask = cpu_online_mask;
> >       smp_wmb();      /* Should be optional on x86 */
> >       cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
> >       on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
> >       while (!cpumask_empty(&text_poke_cpumask)) {
> >               cpu_relax();
> >               smp_rmb();
> >       }
> > }
> >
> > static void text_poke_sync_cpu(void *dummy)
> > {
> >       (void)dummy;
> >
> >       smp_rmb();
> >       cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
> >       /*
> >        * We are guaranteed to return with an IRET, either from the
> >        * IPI or the #BP handler; this provides serialization.
> >        */
> > }
> >
>
> The invariants here are:
>
> 1. The patching routine must set each bit in the cpumask after each event
>    that requires synchronization is complete.
> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>    place that guarantees a synchronizing event (e.g. IRET) before it may
>    reaching the poked instruction.
> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>    *is* also possible to clear it in other places, e.g. the NMI handler, if
>    necessary as long as condition 2 is satisfied.
>

I don't even think this is sufficient.  I think we also need everyone
who clears the bit to check if all bits are clear and, if so, remove
the breakpoint.  Otherwise we have a situation where, if you are in
text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
and that interrupt then hits the breakpoint, then you deadlock because
no one removes the breakpoint.

If we do this, and if we can guarantee that all CPUs make forward
progress, then maybe the problem is solved. Can we guarantee something
like all NMI handlers that might wait in a spinlock or for any other
reason will periodically check if a sync is needed while they're
spinning?

--Andy

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14  2:40                     ` H. Peter Anvin
  2019-01-14 20:11                       ` Andy Lutomirski
@ 2019-01-14 22:00                       ` H. Peter Anvin
  2019-01-14 22:54                         ` H. Peter Anvin
  2019-01-14 23:27                         ` Andy Lutomirski
  1 sibling, 2 replies; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-14 22:00 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Josh Poimboeuf, Nadav Amit, Andy Lutomirski,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

So I was already in the middle of composing this message when Andy posted:

> I don't even think this is sufficient.  I think we also need everyone
> who clears the bit to check if all bits are clear and, if so, remove
> the breakpoint.  Otherwise we have a situation where, if you are in
> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
> and that interrupt then hits the breakpoint, then you deadlock because
> no one removes the breakpoint.
> 
> If we do this, and if we can guarantee that all CPUs make forward
> progress, then maybe the problem is solved. Can we guarantee something
> like all NMI handlers that might wait in a spinlock or for any other
> reason will periodically check if a sync is needed while they're
> spinning?

So the really, really nasty case is when an asynchronous event on the
*patching* processor gets stuck spinning on a resource which is
unavailable due to another processor spinning on the #BP. We can disable
interrupts, but we can't stop NMIs from coming in (although we could
test in the NMI handler if we are in that condition and return
immediately; I'm not sure we want to do that, and we still have to deal
with #MC and what not.)

The fundamental problem here is that we don't see the #BP on the
patching processor, in which case we could simply complete the patching
from the #BP handler on that processor.

On 1/13/19 6:40 PM, H. Peter Anvin wrote:
> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>>
>> static cpumask_t text_poke_cpumask;
>>
>> static void text_poke_sync(void)
>> {
>> 	smp_wmb();
>> 	text_poke_cpumask = cpu_online_mask;
>> 	smp_wmb();	/* Should be optional on x86 */
>> 	cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
>> 	on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
>> 	while (!cpumask_empty(&text_poke_cpumask)) {
>> 		cpu_relax();
>> 		smp_rmb();
>> 	}
>> }
>>
>> static void text_poke_sync_cpu(void *dummy)
>> {
>> 	(void)dummy;
>>
>> 	smp_rmb();
>> 	cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
>> 	/*
>> 	 * We are guaranteed to return with an IRET, either from the
>> 	 * IPI or the #BP handler; this provides serialization.
>> 	 */
>> }
>>
> 
> The invariants here are:
> 
> 1. The patching routine must set each bit in the cpumask after each event
>    that requires synchronization is complete.
> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>    place that guarantees a synchronizing event (e.g. IRET) before it may
>    reaching the poked instruction.
> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>    *is* also possible to clear it in other places, e.g. the NMI handler, if
>    necessary as long as condition 2 is satisfied.
> 

OK, so with interrupts enabled *on the processor doing the patching* we
still have a problem if it takes an interrupt which in turn takes a #BP.
 Disabling interrupts would not help, because but an NMI and #MC could
still cause problems unless we can guarantee that no path which may be
invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
assumption.

Note: I am assuming preemption is disabled.

The easiest/sanest way to deal with this might be to switch the IDT (or
provide a hook in the generic exception entry code) on the patching
processor, such that if an asynchronous event comes in, we either roll
forward or revert. This is doable because the second sync we currently
do is not actually necessary per the hardware guys.

If we take that #BP during the breakpoint deployment phase -- that is,
before the first sync has completed -- restore the previous value of the
breakpoint byte. Upon return text_poke_bp() will then have to loop back
to the beginning and do it again.

If we take the #BP after that point, we can complete the patch in the
normal manner, by writing the rest of the instruction and then removing
the #BP. text_poke_bp() will complete the synchronization sequence on
return, but if another processor is spinning and sees the breakpoint
having been removed, it is good to go.

Rignt now we do completely unnecessary setup and teardown of the PDT
entries for each phase of the patching. This would have to be removed,
so that the asynchronous event handler will always be able to do the
roll forward/roll back as required.

If this is unpalatable, the solution you touched on is probably also
doable, but I need to think *really* carefully about the sequencing
constraints, because now you also have to worry about events
interrupting a patch in progress but not completed. It would however
have the advantage that an arbitrary interrupt on the patching processor
is unlikely to cause a rollback, and so would be safer to execute with
interrupts enabled without causing a livelock.

Now, you can't just remove the breakpoint; you have to make sure that if
you do, during the first phase text_poke_bp() will loop, and in the
second phase complete the whole patching sequence.

Let me think about the second solution, the one you proposed, and what
it would require to avoid any possible race condition. If it is
practical, then I think it is probably the better option.

	-hpa

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14 22:00                       ` H. Peter Anvin
@ 2019-01-14 22:54                         ` H. Peter Anvin
  2019-01-15  3:05                           ` Andy Lutomirski
  2019-01-14 23:27                         ` Andy Lutomirski
  1 sibling, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-14 22:54 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Josh Poimboeuf, Nadav Amit, Andy Lutomirski,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

I think this sequence ought to work (keep in mind we are already under a
mutex, so the global data is safe even if we are preempted):

	set up page table entries
	invlpg
	set up bp patching global data

	cpu = get_cpu()

	bp_old_value = atomic_read(bp_write_addr)

	do {
		atomic_write(&bp_poke_state, 1)

		atomic_write(bp_write_addr, 0xcc)

		mask <- online_cpu_mask - self
		send IPIs
		wait for mask = 0

	} while (cmpxchg(&bp_poke_state, 1, 2) != 1);

	patch sites, remove breakpoints after patching each one

	atomic_write(&bp_poke_state, 3);

	mask <- online_cpu_mask - self
	send IPIs
	wait for mask = 0

	atomic_write(&bp_poke_state, 0);

	tear down patching global data
	tear down page table entries



The #BP handler would then look like:

	state = cmpxchg(&bp_poke_state, 1, 4);
	switch (state) {
		case 1:
		case 4:
			invlpg
			cmpxchg(bp_write_addr, 0xcc, bp_old_value)
			break;
		case 2:
			invlpg
			complete patch sequence
			remove breakpoint
			break;
		case 3:
			/* If we are here, the #BP will go away on its own */
			break;
		case 0:
			/* No patching in progress!!! */
			return 0;
	}

	clear bit in mask
	return 1;

The IPI handler:

	clear bit in mask
	sync_core	/* Needed if multiple IPI events are chained */

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14 22:00                       ` H. Peter Anvin
  2019-01-14 22:54                         ` H. Peter Anvin
@ 2019-01-14 23:27                         ` Andy Lutomirski
  2019-01-14 23:51                           ` Nadav Amit
  2019-01-15  2:28                           ` hpa
  1 sibling, 2 replies; 90+ messages in thread
From: Andy Lutomirski @ 2019-01-14 23:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jiri Kosina, Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	Andy Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> So I was already in the middle of composing this message when Andy posted:
>
> > I don't even think this is sufficient.  I think we also need everyone
> > who clears the bit to check if all bits are clear and, if so, remove
> > the breakpoint.  Otherwise we have a situation where, if you are in
> > text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
> > and that interrupt then hits the breakpoint, then you deadlock because
> > no one removes the breakpoint.
> >
> > If we do this, and if we can guarantee that all CPUs make forward
> > progress, then maybe the problem is solved. Can we guarantee something
> > like all NMI handlers that might wait in a spinlock or for any other
> > reason will periodically check if a sync is needed while they're
> > spinning?
>
> So the really, really nasty case is when an asynchronous event on the
> *patching* processor gets stuck spinning on a resource which is
> unavailable due to another processor spinning on the #BP. We can disable
> interrupts, but we can't stop NMIs from coming in (although we could
> test in the NMI handler if we are in that condition and return
> immediately; I'm not sure we want to do that, and we still have to deal
> with #MC and what not.)
>
> The fundamental problem here is that we don't see the #BP on the
> patching processor, in which case we could simply complete the patching
> from the #BP handler on that processor.
>
> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
> > On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> >>
> >> static cpumask_t text_poke_cpumask;
> >>
> >> static void text_poke_sync(void)
> >> {
> >>      smp_wmb();
> >>      text_poke_cpumask = cpu_online_mask;
> >>      smp_wmb();      /* Should be optional on x86 */
> >>      cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
> >>      on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
> >>      while (!cpumask_empty(&text_poke_cpumask)) {
> >>              cpu_relax();
> >>              smp_rmb();
> >>      }
> >> }
> >>
> >> static void text_poke_sync_cpu(void *dummy)
> >> {
> >>      (void)dummy;
> >>
> >>      smp_rmb();
> >>      cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
> >>      /*
> >>       * We are guaranteed to return with an IRET, either from the
> >>       * IPI or the #BP handler; this provides serialization.
> >>       */
> >> }
> >>
> >
> > The invariants here are:
> >
> > 1. The patching routine must set each bit in the cpumask after each event
> >    that requires synchronization is complete.
> > 2. The bit can be (atomically) cleared on the target CPU only, and only in a
> >    place that guarantees a synchronizing event (e.g. IRET) before it may
> >    reaching the poked instruction.
> > 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
> >    *is* also possible to clear it in other places, e.g. the NMI handler, if
> >    necessary as long as condition 2 is satisfied.
> >
>
> OK, so with interrupts enabled *on the processor doing the patching* we
> still have a problem if it takes an interrupt which in turn takes a #BP.
>  Disabling interrupts would not help, because but an NMI and #MC could
> still cause problems unless we can guarantee that no path which may be
> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
> assumption.
>
> Note: I am assuming preemption is disabled.
>
> The easiest/sanest way to deal with this might be to switch the IDT (or
> provide a hook in the generic exception entry code) on the patching
> processor, such that if an asynchronous event comes in, we either roll
> forward or revert. This is doable because the second sync we currently
> do is not actually necessary per the hardware guys.

This is IMO insanely complicated.  I much prefer the kind of
complexity that is more or less deterministic and easy to test to the
kind of complexity (like this) that only happens in corner cases.

I see two solutions here:

1. Just suck it up and emulate the CALL.  And find a way to write a
test case so we know it works.

2. Find a non-deadlocky way to make the breakpoint handler wait for
the breakpoint to get removed, without any mucking at all with the
entry code.  And find a way to write a test case so we know it works.
(E.g. stick an actual static_call call site *in text_poke_bp()* that
fires once on boot so that the really awful recursive case gets
exercised all the time.

But if we're going to do any mucking with the entry code, let's just
do the simple mucking to make emulating CALL work.

--Andy

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14 23:27                         ` Andy Lutomirski
@ 2019-01-14 23:51                           ` Nadav Amit
  2019-01-15  2:28                           ` hpa
  1 sibling, 0 replies; 90+ messages in thread
From: Nadav Amit @ 2019-01-14 23:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Jiri Kosina, Linus Torvalds, Josh Poimboeuf,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jan 14, 2019, at 3:27 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin <hpa@zytor.com> wrote:
>> So I was already in the middle of composing this message when Andy posted:
>> 
>>> I don't even think this is sufficient.  I think we also need everyone
>>> who clears the bit to check if all bits are clear and, if so, remove
>>> the breakpoint.  Otherwise we have a situation where, if you are in
>>> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
>>> and that interrupt then hits the breakpoint, then you deadlock because
>>> no one removes the breakpoint.
>>> 
>>> If we do this, and if we can guarantee that all CPUs make forward
>>> progress, then maybe the problem is solved. Can we guarantee something
>>> like all NMI handlers that might wait in a spinlock or for any other
>>> reason will periodically check if a sync is needed while they're
>>> spinning?
>> 
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to deal
>> with #MC and what not.)
>> 
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the patching
>> from the #BP handler on that processor.
>> 
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>>> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>>>> static cpumask_t text_poke_cpumask;
>>>> 
>>>> static void text_poke_sync(void)
>>>> {
>>>>     smp_wmb();
>>>>     text_poke_cpumask = cpu_online_mask;
>>>>     smp_wmb();      /* Should be optional on x86 */
>>>>     cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
>>>>     on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
>>>>     while (!cpumask_empty(&text_poke_cpumask)) {
>>>>             cpu_relax();
>>>>             smp_rmb();
>>>>     }
>>>> }
>>>> 
>>>> static void text_poke_sync_cpu(void *dummy)
>>>> {
>>>>     (void)dummy;
>>>> 
>>>>     smp_rmb();
>>>>     cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
>>>>     /*
>>>>      * We are guaranteed to return with an IRET, either from the
>>>>      * IPI or the #BP handler; this provides serialization.
>>>>      */
>>>> }
>>> 
>>> The invariants here are:
>>> 
>>> 1. The patching routine must set each bit in the cpumask after each event
>>>   that requires synchronization is complete.
>>> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>>>   place that guarantees a synchronizing event (e.g. IRET) before it may
>>>   reaching the poked instruction.
>>> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>>>   *is* also possible to clear it in other places, e.g. the NMI handler, if
>>>   necessary as long as condition 2 is satisfied.
>> 
>> OK, so with interrupts enabled *on the processor doing the patching* we
>> still have a problem if it takes an interrupt which in turn takes a #BP.
>> Disabling interrupts would not help, because but an NMI and #MC could
>> still cause problems unless we can guarantee that no path which may be
>> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
>> assumption.
>> 
>> Note: I am assuming preemption is disabled.
>> 
>> The easiest/sanest way to deal with this might be to switch the IDT (or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either roll
>> forward or revert. This is doable because the second sync we currently
>> do is not actually necessary per the hardware guys.
> 
> This is IMO insanely complicated.  I much prefer the kind of
> complexity that is more or less deterministic and easy to test to the
> kind of complexity (like this) that only happens in corner cases.
> 
> I see two solutions here:
> 
> 1. Just suck it up and emulate the CALL.  And find a way to write a
> test case so we know it works.
> 
> 2. Find a non-deadlocky way to make the breakpoint handler wait for
> the breakpoint to get removed, without any mucking at all with the
> entry code.  And find a way to write a test case so we know it works.
> (E.g. stick an actual static_call call site *in text_poke_bp()* that
> fires once on boot so that the really awful recursive case gets
> exercised all the time.
> 
> But if we're going to do any mucking with the entry code, let's just
> do the simple mucking to make emulating CALL work.

These two approaches still seem more complicated to me than having a 
trampoline as backup, which is patched dynamically.
 
IIUC, the current pushback against this option is that it makes batching is
more complicated. I am not sure it is that bad, but there are other variants
of this solution, for example using an retpoline-like flow:

BP-handler:
	/\x10* Sets a per-CPU variable to hold the target */ 
	this_cpu_write(interrupted_static_call_target) = 
					get_static_call_targets(regs->rip);

	/* Choose the function based in IRQ-disable during interrupt */
	if (regs->flags & X86_EFLAGS_IF) {
		regs->flags &= ~X86_EFLAGS_IF;
		regs->rip = user_handler_IRQ_disabled
	} else {
		regs->rip = user_handler_IRQ_enabled
	}

user_handler_IRQ_disabled:
	push PER_CPU_VAR(interrupted_static_call_target)
	sti  # this one is not needed in the the enabled case
	ret  # sti-blocking prevents preemption before

Obviously, I don’t know how this coexists with CET.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14 23:27                         ` Andy Lutomirski
  2019-01-14 23:51                           ` Nadav Amit
@ 2019-01-15  2:28                           ` hpa
  1 sibling, 0 replies; 90+ messages in thread
From: hpa @ 2019-01-15  2:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On January 14, 2019 3:27:55 PM PST, Andy Lutomirski <luto@kernel.org> wrote:
>On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> So I was already in the middle of composing this message when Andy
>posted:
>>
>> > I don't even think this is sufficient.  I think we also need
>everyone
>> > who clears the bit to check if all bits are clear and, if so,
>remove
>> > the breakpoint.  Otherwise we have a situation where, if you are in
>> > text_poke_bp() and you take an NMI (or interrupt or MCE or
>whatever)
>> > and that interrupt then hits the breakpoint, then you deadlock
>because
>> > no one removes the breakpoint.
>> >
>> > If we do this, and if we can guarantee that all CPUs make forward
>> > progress, then maybe the problem is solved. Can we guarantee
>something
>> > like all NMI handlers that might wait in a spinlock or for any
>other
>> > reason will periodically check if a sync is needed while they're
>> > spinning?
>>
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can
>disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to
>deal
>> with #MC and what not.)
>>
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the
>patching
>> from the #BP handler on that processor.
>>
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>> > On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>> >>
>> >> static cpumask_t text_poke_cpumask;
>> >>
>> >> static void text_poke_sync(void)
>> >> {
>> >>      smp_wmb();
>> >>      text_poke_cpumask = cpu_online_mask;
>> >>      smp_wmb();      /* Should be optional on x86 */
>> >>      cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
>> >>      on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu,
>NULL, false);
>> >>      while (!cpumask_empty(&text_poke_cpumask)) {
>> >>              cpu_relax();
>> >>              smp_rmb();
>> >>      }
>> >> }
>> >>
>> >> static void text_poke_sync_cpu(void *dummy)
>> >> {
>> >>      (void)dummy;
>> >>
>> >>      smp_rmb();
>> >>      cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
>> >>      /*
>> >>       * We are guaranteed to return with an IRET, either from the
>> >>       * IPI or the #BP handler; this provides serialization.
>> >>       */
>> >> }
>> >>
>> >
>> > The invariants here are:
>> >
>> > 1. The patching routine must set each bit in the cpumask after each
>event
>> >    that requires synchronization is complete.
>> > 2. The bit can be (atomically) cleared on the target CPU only, and
>only in a
>> >    place that guarantees a synchronizing event (e.g. IRET) before
>it may
>> >    reaching the poked instruction.
>> > 3. At a minimum the IPI handler and #BP handler needs to clear the
>bit. It
>> >    *is* also possible to clear it in other places, e.g. the NMI
>handler, if
>> >    necessary as long as condition 2 is satisfied.
>> >
>>
>> OK, so with interrupts enabled *on the processor doing the patching*
>we
>> still have a problem if it takes an interrupt which in turn takes a
>#BP.
>>  Disabling interrupts would not help, because but an NMI and #MC
>could
>> still cause problems unless we can guarantee that no path which may
>be
>> invoked by NMI/#MC can do text_poke, which seems to be a very
>aggressive
>> assumption.
>>
>> Note: I am assuming preemption is disabled.
>>
>> The easiest/sanest way to deal with this might be to switch the IDT
>(or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either
>roll
>> forward or revert. This is doable because the second sync we
>currently
>> do is not actually necessary per the hardware guys.
>
>This is IMO insanely complicated.  I much prefer the kind of
>complexity that is more or less deterministic and easy to test to the
>kind of complexity (like this) that only happens in corner cases.
>
>I see two solutions here:
>
>1. Just suck it up and emulate the CALL.  And find a way to write a
>test case so we know it works.
>
>2. Find a non-deadlocky way to make the breakpoint handler wait for
>the breakpoint to get removed, without any mucking at all with the
>entry code.  And find a way to write a test case so we know it works.
>(E.g. stick an actual static_call call site *in text_poke_bp()* that
>fires once on boot so that the really awful recursive case gets
>exercised all the time.
>
>But if we're going to do any mucking with the entry code, let's just
>do the simple mucking to make emulating CALL work.
>
>--Andy

Ugh. So much for not really proofreading. Yes, I think the second solution is the right thing since I think I figured out how to do it without deadlock; see other mail.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-14 22:54                         ` H. Peter Anvin
@ 2019-01-15  3:05                           ` Andy Lutomirski
  2019-01-15  5:01                             ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: Andy Lutomirski @ 2019-01-15  3:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jiri Kosina, Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	Andy Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> I think this sequence ought to work (keep in mind we are already under a
> mutex, so the global data is safe even if we are preempted):

I'm trying to wrap my head around this.  The states are:

0: normal operation
1: writing 0xcc, can be canceled
2: writing final instruction.  The 0xcc was definitely synced to all CPUs.
3: patch is definitely installed but maybe not sync_cored.

>
>         set up page table entries
>         invlpg
>         set up bp patching global data
>
>         cpu = get_cpu()
>
So we're assuming that the state is

>         bp_old_value = atomic_read(bp_write_addr)
>
>         do {

So we're assuming that the state is 0 here.  A WARN_ON_ONCE to check
that would be nice.

>                 atomic_write(&bp_poke_state, 1)
>
>                 atomic_write(bp_write_addr, 0xcc)
>
>                 mask <- online_cpu_mask - self
>                 send IPIs
>                 wait for mask = 0
>
>         } while (cmpxchg(&bp_poke_state, 1, 2) != 1);
>
>         patch sites, remove breakpoints after patching each one

Not sure what you mean by patch *sites*.  As written, this only
supports one patch site at a time, since there's only one
bp_write_addr, and fixing that may be complicated.  Not fixing it
might also be a scalability problem.

>
>         atomic_write(&bp_poke_state, 3);
>
>         mask <- online_cpu_mask - self
>         send IPIs
>         wait for mask = 0
>
>         atomic_write(&bp_poke_state, 0);
>
>         tear down patching global data
>         tear down page table entries
>
>
>
> The #BP handler would then look like:
>
>         state = cmpxchg(&bp_poke_state, 1, 4);
>         switch (state) {
>                 case 1:
>                 case 4:

What is state 4?

>                         invlpg
>                         cmpxchg(bp_write_addr, 0xcc, bp_old_value)
>                         break;
>                 case 2:
>                         invlpg
>                         complete patch sequence
>                         remove breakpoint
>                         break;

ISTM you might as well change state to 3 here, but it's arguably unnecessary.

>                 case 3:
>                         /* If we are here, the #BP will go away on its own */
>                         break;
>                 case 0:
>                         /* No patching in progress!!! */
>                         return 0;
>         }
>
>         clear bit in mask
>         return 1;
>
> The IPI handler:
>
>         clear bit in mask
>         sync_core       /* Needed if multiple IPI events are chained */

I really like that this doesn't require fixups -- text_poke_bp() just
works.  But I'm nervous about livelocks or maybe just extreme slowness
under nasty loads.  Suppose some perf NMI code does a static call or
uses a static call.  Now there's a situation where, under high
frequency perf sampling, the patch process might almost always hit the
breakpoint while in state 1.  It'll get reversed and done again, and
we get stuck.  It would be neat if we could get the same "no
deadlocks" property while significantly reducing the chance of a
rollback.

This is why I proposed something where we try to guarantee forward
progress by making sure that any NMI code that might spin and wait for
other CPUs is guaranteed to eventually sync_core(), clear its bit, and
possibly finish a patch.  But this is a bit gross.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-15  3:05                           ` Andy Lutomirski
@ 2019-01-15  5:01                             ` H. Peter Anvin
  2019-01-15  5:37                               ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-15  5:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On 1/14/19 7:05 PM, Andy Lutomirski wrote:
> On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I think this sequence ought to work (keep in mind we are already under a
>> mutex, so the global data is safe even if we are preempted):
> 
> I'm trying to wrap my head around this.  The states are:
> 
> 0: normal operation
> 1: writing 0xcc, can be canceled
> 2: writing final instruction.  The 0xcc was definitely synced to all CPUs.
> 3: patch is definitely installed but maybe not sync_cored.
> 

4: breakpoint has been canceled; need to redo patching.

>>
>>         set up page table entries
>>         invlpg
>>         set up bp patching global data
>>
>>         cpu = get_cpu()
>>
> So we're assuming that the state is
> 
>>         bp_old_value = atomic_read(bp_write_addr)
>>
>>         do {
> 
> So we're assuming that the state is 0 here.  A WARN_ON_ONCE to check
> that would be nice.

The state here can be 0 or 4.

>>                 atomic_write(&bp_poke_state, 1)
>>
>>                 atomic_write(bp_write_addr, 0xcc)
>>
>>                 mask <- online_cpu_mask - self
>>                 send IPIs
>>                 wait for mask = 0
>>
>>         } while (cmpxchg(&bp_poke_state, 1, 2) != 1);
>>
>>         patch sites, remove breakpoints after patching each one
> 
> Not sure what you mean by patch *sites*.  As written, this only
> supports one patch site at a time, since there's only one
> bp_write_addr, and fixing that may be complicated.  Not fixing it
> might also be a scalability problem.

Fixing it isn't all that complicated; we just need to have a list of
patch locations (which we need anyway!) and walk (or search) it instead
of checking just one; I omitted that detail for simplicity.

>>         atomic_write(&bp_poke_state, 3);
>>
>>         mask <- online_cpu_mask - self
>>         send IPIs
>>         wait for mask = 0
>>
>>         atomic_write(&bp_poke_state, 0);
>>
>>         tear down patching global data
>>         tear down page table entries
>>
>>
>>
>> The #BP handler would then look like:
>>
>>         state = cmpxchg(&bp_poke_state, 1, 4);
>>         switch (state) {
>>                 case 1:
>>                 case 4:
> 
> What is state 4?
> 
>>                         invlpg
>>                         cmpxchg(bp_write_addr, 0xcc, bp_old_value)

I'm 85% sure that the cmpxchg here is actually unnecessary, an
atomic_write() is sufficient.

>>                         break;
>>                 case 2:
>>                         invlpg
>>                         complete patch sequence
>>                         remove breakpoint
>>                         break;
> 
> ISTM you might as well change state to 3 here, but it's arguably unnecessary.

If and only if you have only one patch location you could, but again,
unnecessary.

>>                 case 3:
>>                         /* If we are here, the #BP will go away on its own */
>>                         break;
>>                 case 0:
>>                         /* No patching in progress!!! */
>>                         return 0;
>>         }
>>
>>         clear bit in mask
>>         return 1;
>>
>> The IPI handler:
>>
>>         clear bit in mask
>>         sync_core       /* Needed if multiple IPI events are chained */
> 
> I really like that this doesn't require fixups -- text_poke_bp() just
> works.  But I'm nervous about livelocks or maybe just extreme slowness
> under nasty loads.  Suppose some perf NMI code does a static call or
> uses a static call.  Now there's a situation where, under high
> frequency perf sampling, the patch process might almost always hit the
> breakpoint while in state 1.  It'll get reversed and done again, and
> we get stuck.  It would be neat if we could get the same "no
> deadlocks" property while significantly reducing the chance of a
> rollback.

This could be as simple as spinning for a limited time waiting for
states 0 or 3 if we are not the patching CPU. It is also not necessary
to wait for the mask to become zero for the first sync if we find
ourselves suddenly in state 4.

This wouldn't reduce the livelock probability to zero, but it ought to
reduce it enough that if we really are under such heavy event load we
may end up getting stuck in any number of ways...

> This is why I proposed something where we try to guarantee forward
> progress by making sure that any NMI code that might spin and wait for
> other CPUs is guaranteed to eventually sync_core(), clear its bit, and
> possibly finish a patch.  But this is a bit gross.

Yes, this gets really grotty and who knows how many code paths it would
touch.

	-hpa



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-15  5:01                             ` H. Peter Anvin
@ 2019-01-15  5:37                               ` H. Peter Anvin
  0 siblings, 0 replies; 90+ messages in thread
From: H. Peter Anvin @ 2019-01-15  5:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Linus Torvalds, Josh Poimboeuf, Nadav Amit,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Masami Hiramatsu, Jason Baron,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On 1/14/19 9:01 PM, H. Peter Anvin wrote:
> 
> This could be as simple as spinning for a limited time waiting for
> states 0 or 3 if we are not the patching CPU. It is also not necessary
> to wait for the mask to become zero for the first sync if we find
> ourselves suddenly in state 4.
> 

So this would look something like this for the #BP handler; I think this
is safe.  This uses the TLB miss on the write page intentionally to slow
down the loop a bit to reduce the risk of livelock.  Note that
"bp_write_addr" here refers to the write address for the breakpoint that
was taken.

	state = atomic_read(&bp_poke_state);
	if (state == 0)
		return 0;		/* No patching in progress */

recheck:
	clear bit in mask

	switch (state) {
		case 1:
		case 4:
			if (smp_processor_id() != bp_patching_cpu) {
				int retries = NNN;
				while (retries--) {
					invlpg
					if (*bp_write_addr != 0xcc)
						goto recheck;
					state = atomic_read(&bp_poke_state);
					if (state != 1 && state != 4)
						goto recheck;
				}
			}
			state = cmpxchg(&bp_poke_state, 1, 4);
			if (state != 1 && state != 4)
				goto recheck;
			atomic_write(bp_write_addr, bp_old_value);
			break;
		case 2:
			if (smp_processor_id() != bp_patching_cpu) {
				invlpg
				state = atomic_read(&bp_poke_state);
				if (state != 2)
					goto recheck;
			}
			complete patch sequence
			remove breakpoint
			break;

		case 3:
		case 0:
			/*
			 * If we are here, the #BP will go away on its
			 * own, or we will re-take it if it was a "real"
			 * breakpoint.
			 */
			break;
	}
	return 1;

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-11 16:57                     ` Josh Poimboeuf
  2019-01-11 17:41                       ` Jason Baron
@ 2019-01-15 11:10                       ` Alexandre Chartre
  2019-01-15 16:19                         ` Steven Rostedt
  1 sibling, 1 reply; 90+ messages in thread
From: Alexandre Chartre @ 2019-01-15 11:10 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Sean Christopherson, Steven Rostedt, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira


On 01/11/2019 05:57 PM, Josh Poimboeuf wrote:
> On Fri, Jan 11, 2019 at 05:46:36PM +0100, Alexandre Chartre wrote:
>>
>>
>> On 01/11/2019 04:28 PM, Josh Poimboeuf wrote:
>>> On Fri, Jan 11, 2019 at 01:10:52PM +0100, Alexandre Chartre wrote:
>>>> To avoid any issue with live patching the call instruction, what about
>>>> toggling between two call instructions: one would be the currently active
>>>> call, while the other would currently be inactive but to be used after a
>>>> change is made. You can safely patch the inactive call and then toggle
>>>> the call flow (using a jump label) between the active and inactive calls.
>>>>
>>>> So instead of having a single call instruction:
>>>>
>>>> 	call function
>>>>
>>>> You would have:
>>>>
>>>> 	STATIC_JUMP_IF_TRUE label, key
>>>> 	call function1
>>>> 	jmp done
>>>> label:
>>>> 	call function2
>>>> done:
>>>>
>>>> If the key is set so that function1 is currently called then you can
>>>> safely update the call instruction for function2. Once this is done,
>>>> just flip the key to make the function2 call active. On a next update,
>>>> you would, of course, have to switch and update the call for function1.
>>>
>>> What about the following race?
>>>
>>> CPU1						CPU2
>>> static key is false, doesn't jump
>>> task gets preempted before calling function1
>>> 						change static key to true
>>> 						start patching "call function1"
>>> task resumes, sees inconsistent call instruction
>>>
>>
>> If the function1 call is active then it won't be changed, you will change
>> function2. However, I presume you can still have a race but if the function
>> is changed twice before calling function1:
>>
>> CPU1						CPU2
>> static key is false, doesn't jump
>> task gets preempted before calling function1
>>                                                 -- first function change --
>>                                                 patch "call function2"
>>                                                 change static key to true
>>                                                 -- second function change --
>>                                                 start patching "call function1"
>> task resumes, sees inconsistent call instruction
>>
>> So right, that's a problem.
>
> Right, that's what I meant to say :-)
>

Thinking more about it (and I've probably missed something or I am just being
totally stupid because this seems way too simple), can't we just replace the
"call" with "push+jmp" and patch the jmp instruction?

Instead of having:

     call target

Have:

     push $done
static_call:
     jmp target
done:

Then we can safely patch the "jmp" instruction to jump to a new target
with text_poke_bp(), using the new target as the text_poke_bp() handler:

   new_jmp_code = opcode of "jmp new_target"

   text_poke_bp(static_call, new_jmp_code, new_jmp_code_size, new_target);

Problems come with patching a call instruction, but there's no issue with patching
a jmp, no? (that's what jump labels do).

No change to the int3 handler, no thunk, this seems really too simple... :-)


alex.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-15 11:10                       ` Alexandre Chartre
@ 2019-01-15 16:19                         ` Steven Rostedt
  2019-01-15 16:45                           ` Alexandre Chartre
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Rostedt @ 2019-01-15 16:19 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Josh Poimboeuf, Sean Christopherson, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, 15 Jan 2019 12:10:19 +0100
Alexandre Chartre <alexandre.chartre@oracle.com> wrote:

> Thinking more about it (and I've probably missed something or I am just being
> totally stupid because this seems way too simple), can't we just replace the
> "call" with "push+jmp" and patch the jmp instruction?
> 
> Instead of having:
> 
>      call target
> 
> Have:
> 
>      push $done
> static_call:
>      jmp target
> done:

But how do you implement it? Inline assembly()? Then you need to be
able to do that for any type of function parameters (there will be
users that have 13 parameters!)

I believe people have mentioned having a gcc plugin that would do it
for us, which was one of the suggested solutions.

-- Steve

> 
> Then we can safely patch the "jmp" instruction to jump to a new target
> with text_poke_bp(), using the new target as the text_poke_bp() handler:
> 
>    new_jmp_code = opcode of "jmp new_target"
> 
>    text_poke_bp(static_call, new_jmp_code, new_jmp_code_size, new_target);
> 
> Problems come with patching a call instruction, but there's no issue with patching
> a jmp, no? (that's what jump labels do).
> 
> No change to the int3 handler, no thunk, this seems really too simple... :-)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible
  2019-01-15 16:19                         ` Steven Rostedt
@ 2019-01-15 16:45                           ` Alexandre Chartre
  0 siblings, 0 replies; 90+ messages in thread
From: Alexandre Chartre @ 2019-01-15 16:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Sean Christopherson, Nadav Amit, X86 ML, LKML,
	Ard Biesheuvel, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira


On 01/15/2019 05:19 PM, Steven Rostedt wrote:
> On Tue, 15 Jan 2019 12:10:19 +0100
> Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
>
>> Thinking more about it (and I've probably missed something or I am just being
>> totally stupid because this seems way too simple), can't we just replace the
>> "call" with "push+jmp" and patch the jmp instruction?
>>
>> Instead of having:
>>
>>      call target
>>
>> Have:
>>
>>      push $done
>> static_call:
>>      jmp target
>> done:
>
> But how do you implement it? Inline assembly()? Then you need to be
> able to do that for any type of function parameters (there will be
> users that have 13 parameters!)
>
> I believe people have mentioned having a gcc plugin that would do it
> for us, which was one of the suggested solutions.
>

Ah okay, I think I get it now (hopefully; I probably lost track of the
discussion at some point), so Linus' latest proposal avoids the gcc
plugin by keeping the call as is, and deals with it in the int3 handler
+ thunk.

Thanks, and sorry for the noise.

alex.

>>
>> Then we can safely patch the "jmp" instruction to jump to a new target
>> with text_poke_bp(), using the new target as the text_poke_bp() handler:
>>
>>    new_jmp_code = opcode of "jmp new_target"
>>
>>    text_poke_bp(static_call, new_jmp_code, new_jmp_code_size, new_target);
>>
>> Problems come with patching a call instruction, but there's no issue with patching
>> a jmp, no? (that's what jump labels do).
>>
>> No change to the int3 handler, no thunk, this seems really too simple... :-)
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2019-01-10 20:52   ` Josh Poimboeuf
  2019-01-10 23:02     ` Linus Torvalds
@ 2020-02-17 21:10     ` Jann Horn
  2020-02-17 21:57       ` Steven Rostedt
  1 sibling, 1 reply; 90+ messages in thread
From: Jann Horn @ 2020-02-17 21:10 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, the arch/x86 maintainers, kernel list,
	Ard Biesheuvel, Andy Lutomirski, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Thu, Jan 10, 2019 at 9:52 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jan 10, 2019 at 09:30:23PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> > > With this version, I stopped trying to use text_poke_bp(), and instead
> > > went with a different approach: if the call site destination doesn't
> > > cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> > > using the trampoline indefinitely.
> >
> > > - Get rid of the use of text_poke_bp(), in favor of atomic writes.
> > >   Out-of-line calls will be promoted to inline only if the call sites
> > >   don't cross cache line boundaries. [Linus/Andy]
> >
> > Can we perserve why text_poke_bp() didn't work? I seem to have forgotten
> > again. The problem was poking the return address onto the stack from the
> > int3 handler, or something along those lines?
>
> Right, emulating a call instruction from the #BP handler is ugly,
> because you have to somehow grow the stack to make room for the return
> address.  Personally I liked the idea of shifting the iret frame by 16
> bytes in the #DB entry code, but others hated it.
>
> So many bad-but-not-completely-unacceptable options to choose from.

Silly suggestion from someone who has skimmed the thread:

Wouldn't a retpoline-style trampoline solve this without needing
memory allocations? Let the interrupt handler stash the destination in
a percpu variable and clear IF in regs->flags. Something like:

void simulate_call(unsigned long target) {
  __this_cpu_write(static_call_restore_if, (regs->flags & X86_EFLAGS_IF) != 0);
  regs->flags &= ~X86_EFLAGS_IF;
  __this_cpu_write(static_call_trampoline_source, regs->ip + 5);
  __this_cpu_write(static_call_trampoline_target, target);
  regs->ip = magic_static_call_trampoline;
}

magic_static_call_trampoline:
; set up return address for returning from target function
pushl PER_CPU_VAR(static_call_trampoline_source)
; set up retpoline-style return address
pushl PER_CPU_VAR(static_call_trampoline_target)
; restore flags if needed
cmp PER_CPU_VAR(static_call_restore_if), 0
je 1f
sti ; NOTE: percpu data must not be accessed past this point
1:
ret ; "return" to the call target

By using a return to implement the call, we don't need any scratch
registers for the call.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 0/6] Static calls
  2020-02-17 21:10     ` Jann Horn
@ 2020-02-17 21:57       ` Steven Rostedt
  0 siblings, 0 replies; 90+ messages in thread
From: Steven Rostedt @ 2020-02-17 21:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: Josh Poimboeuf, Peter Zijlstra, the arch/x86 maintainers,
	kernel list, Ard Biesheuvel, Andy Lutomirski, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Mon, 17 Feb 2020 22:10:27 +0100
Jann Horn <jannh@google.com> wrote:

> On Thu, Jan 10, 2019 at 9:52 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Thu, Jan 10, 2019 at 09:30:23PM +0100, Peter Zijlstra wrote:  
> > > On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:  
> > > > With this version, I stopped trying to use text_poke_bp(), and instead
> > > > went with a different approach: if the call site destination doesn't
> > > > cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> > > > using the trampoline indefinitely.  
> > >  
> > > > - Get rid of the use of text_poke_bp(), in favor of atomic writes.
> > > >   Out-of-line calls will be promoted to inline only if the call sites
> > > >   don't cross cache line boundaries. [Linus/Andy]  
> > >
> > > Can we perserve why text_poke_bp() didn't work? I seem to have forgotten
> > > again. The problem was poking the return address onto the stack from the
> > > int3 handler, or something along those lines?  
> >
> > Right, emulating a call instruction from the #BP handler is ugly,
> > because you have to somehow grow the stack to make room for the return
> > address.  Personally I liked the idea of shifting the iret frame by 16
> > bytes in the #DB entry code, but others hated it.
> >
> > So many bad-but-not-completely-unacceptable options to choose from.  
> 
> Silly suggestion from someone who has skimmed the thread:
> 
> Wouldn't a retpoline-style trampoline solve this without needing
> memory allocations? Let the interrupt handler stash the destination in
> a percpu variable and clear IF in regs->flags. Something like:

Linus actually suggested something similar, but for ftrace, but after
implementing it, it was hard to get right, and caused havoc with
utilities like lockdep, and also shadow stacks.

See this thread:

  https://lore.kernel.org/linux-kselftest/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=A@mail.gmail.com/


 -- Steve

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2020-02-17 21:57 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 22:59 [PATCH v3 0/6] Static calls Josh Poimboeuf
2019-01-09 22:59 ` [PATCH v3 1/6] compiler.h: Make __ADDRESSABLE() symbol truly unique Josh Poimboeuf
2019-01-09 22:59 ` [PATCH v3 2/6] static_call: Add basic static call infrastructure Josh Poimboeuf
2019-01-10 14:03   ` Edward Cree
2019-01-10 18:37     ` Josh Poimboeuf
2019-01-09 22:59 ` [PATCH v3 3/6] x86/static_call: Add out-of-line static call implementation Josh Poimboeuf
2019-01-10  0:16   ` Nadav Amit
2019-01-10 16:28     ` Josh Poimboeuf
2019-01-09 22:59 ` [PATCH v3 4/6] static_call: Add inline static call infrastructure Josh Poimboeuf
2019-01-09 22:59 ` [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible Josh Poimboeuf
2019-01-10  9:32   ` Nadav Amit
2019-01-10 17:20     ` Josh Poimboeuf
2019-01-10 17:29       ` Nadav Amit
2019-01-10 17:32       ` Steven Rostedt
2019-01-10 17:42         ` Sean Christopherson
2019-01-10 17:57           ` Steven Rostedt
2019-01-10 18:04             ` Sean Christopherson
2019-01-10 18:21               ` Josh Poimboeuf
2019-01-10 18:24               ` Steven Rostedt
2019-01-11 12:10               ` Alexandre Chartre
2019-01-11 15:28                 ` Josh Poimboeuf
2019-01-11 16:46                   ` Alexandre Chartre
2019-01-11 16:57                     ` Josh Poimboeuf
2019-01-11 17:41                       ` Jason Baron
2019-01-11 17:54                         ` Nadav Amit
2019-01-15 11:10                       ` Alexandre Chartre
2019-01-15 16:19                         ` Steven Rostedt
2019-01-15 16:45                           ` Alexandre Chartre
2019-01-11  0:59           ` hpa
2019-01-11  1:34             ` Sean Christopherson
2019-01-11  8:13               ` hpa
2019-01-09 22:59 ` [PATCH v3 6/6] x86/static_call: Add inline static call implementation for x86-64 Josh Poimboeuf
2019-01-10  1:21 ` [PATCH v3 0/6] Static calls Nadav Amit
2019-01-10 16:44   ` Josh Poimboeuf
2019-01-10 17:32     ` Nadav Amit
2019-01-10 18:18       ` Josh Poimboeuf
2019-01-10 19:45         ` Nadav Amit
2019-01-10 20:32           ` Josh Poimboeuf
2019-01-10 20:48             ` Nadav Amit
2019-01-10 20:57               ` Josh Poimboeuf
2019-01-10 21:47                 ` Nadav Amit
2019-01-10 17:31 ` Linus Torvalds
2019-01-10 20:51   ` H. Peter Anvin
2019-01-10 20:30 ` Peter Zijlstra
2019-01-10 20:52   ` Josh Poimboeuf
2019-01-10 23:02     ` Linus Torvalds
2019-01-11  0:56       ` Andy Lutomirski
2019-01-11  1:47         ` Nadav Amit
2019-01-11 15:15           ` Josh Poimboeuf
2019-01-11 15:48             ` Nadav Amit
2019-01-11 16:07               ` Josh Poimboeuf
2019-01-11 17:23                 ` Nadav Amit
2019-01-11 19:03             ` Linus Torvalds
2019-01-11 19:17               ` Nadav Amit
2019-01-11 19:23               ` hpa
2019-01-11 19:33                 ` Nadav Amit
2019-01-11 19:34                 ` Linus Torvalds
2019-01-13  0:34                   ` hpa
2019-01-13  0:36                   ` hpa
2019-01-11 19:39                 ` Jiri Kosina
2019-01-14  2:31                   ` H. Peter Anvin
2019-01-14  2:40                     ` H. Peter Anvin
2019-01-14 20:11                       ` Andy Lutomirski
2019-01-14 22:00                       ` H. Peter Anvin
2019-01-14 22:54                         ` H. Peter Anvin
2019-01-15  3:05                           ` Andy Lutomirski
2019-01-15  5:01                             ` H. Peter Anvin
2019-01-15  5:37                               ` H. Peter Anvin
2019-01-14 23:27                         ` Andy Lutomirski
2019-01-14 23:51                           ` Nadav Amit
2019-01-15  2:28                           ` hpa
2019-01-11 20:04               ` Josh Poimboeuf
2019-01-11 20:12                 ` Linus Torvalds
2019-01-11 20:31                   ` Josh Poimboeuf
2019-01-11 20:46                     ` Linus Torvalds
2019-01-11 21:05                       ` Andy Lutomirski
2019-01-11 21:10                         ` Linus Torvalds
2019-01-11 21:32                           ` Josh Poimboeuf
2019-01-14 12:28                         ` Peter Zijlstra
2019-01-11 21:22                       ` Josh Poimboeuf
2019-01-11 21:23                         ` Josh Poimboeuf
2019-01-11 21:25                         ` Josh Poimboeuf
2019-01-11 21:36                         ` Nadav Amit
2019-01-11 21:41                           ` Josh Poimboeuf
2019-01-11 21:55                             ` Steven Rostedt
2019-01-11 21:59                               ` Nadav Amit
2019-01-11 21:56                             ` Nadav Amit
2019-01-12 23:54                         ` Andy Lutomirski
2020-02-17 21:10     ` Jann Horn
2020-02-17 21:57       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).